The path to autonomous intelligent agents: what needs to happen.

VCs like to say that the next big thing will start out looking like a toy. We are definitely having fun with them, but when will they grow up?

May 23, 2023

There is a huge disconnect between the crazy AI hype threads on Twitter and the current state of the technology. “BabyAGI just built my IKEA table, made me a cappuccino and got my cat pregnant! 143 things it could do for you.”

In reality, tools like AutoGPT are a good first step towards a goal that has been obvious since the start of computers:

Step 1: You tell the computer what you want.
Step 2: The computer does it.
Step 3: Profit!

Perhaps you have built your own Langchain-based agent for funsies, and have encountered some of the challenges I’m about to describe. I feel you. Earlier this year I was experimenting with a Langchain agent that would take a request (e.g. “give me a summary of crypto news from the past week”), search the internet, scrape pages and email me a summary. I could publish the code, but Bard already does it better:

As straightforward as the task appears, here are some issues that I ran into during development:

There were many calls to GPT3.5 involved: first, understand the request and make plans. Then, run queries to find relevant sources. After that, parse the query output. Extract links. Scrape them. Summarize them. Keep track of the original links so I could double-check the sources (to make sure the model didn’t make any of it up). Because inference is slow and the OpenAI api isn’t always fast or reliable, the entire process took about 10-15 minutes.
Every step can fail in an unexpected number of ways. Failures compound. Let’s say there are 10 steps involved, and each one has a success rate of 90%. This means that the entire process has approximately one third chances of succeeding. What complicates this is that a failure isn’t always obvious until you see the whole result. For example, sometimes the search would pick up a newly refreshed page with news from last year. It would keep researching them, and fifteen minutes later I would get a summary of crypto news from Early 2022 (Bitcoin at 45k!).
When reading the agent logs, I could see when things went wrong. So I would go back to the prompts, and add a rule like “always check the dates of the articles.” Then I would run the process again. This time it would fail to pick up some articles, and output something like “Thought: I cannot find the date of this article. I cannot be sure that it is current, so I will not include it. I will continue searching.” If I could run this process repeatedly, cheaply and quickly then I could tune all the prompts (perhaps with some automation) and increase the reliability of the process. But GPT4 is expensive and slow, and there is a long tail of scenarios that I would want to test. This is a non-starter today.

I’m not the first to compare autonomous agents with self-driving cars. In both cases there are too many failure modes that we have a hard time imagining, because they are very unnatural to humans. This is why companies like Waymo or Tesla have been gathering data for years, slowly and conservatively. The difference is that driving is one task that is relatively easy to define, but with lethal consequences if things go wrong. In contrast, we want agents to accomplish novel tasks with lower stakes. Some of them might be tailored to unique individuals. So, how do we solve issues like the ones I encountered?

An obvious obstacle to overcome is the cost and speed of inference. GPT4 today costs 6 cents per 1k sampled tokens. Say you have an agent that costs $1 to run once (several iterations, perhaps 10k-20k tokens between context windows and generated responses). Let’s say the agent succeeds about half the time. You iterate a bit on the prompts, and run it until it seems to be succeeding 2/3 of the time. That’s still not acceptable. Another round of improvements. Now you run it 100 times, get 72 successes, 28 failures. That cost you $100, and probably took a few hours of collective execution time. Perhaps you got throttled by the OpenAI api if you tried to parallelize it.

You may want run your use case thousands of times (at least) when you need statistical significance in your regression tests. One way this could happen is with self-hosted models. However, only large organizations (or funded startups) could afford this today. As I write this, an Nvidia A100 that could run the largest Llama instances is still in the five figures. And inference won’t be particularly fast.

There are several companies designing and building inference chips that promise 100x gains in speed. I don’t expect any of those chips to be commercially available this year or next. But I do expect these chips to have an effect comparable to the advent of the 28.8k modem in 1994.

[Tangent: I always had the theory that the internet exploded in the mid 1990s because the speed of dial-up modems was good enough for graphical interfaces. Before that modems were only good for text that would render slowly on the screen. ChatGPT’s rendering of tokens reminds me of those times]

Speed aside, there are other problems that we need to solve before we can trust agents with our sensitive information. A reasonably competent human being has incentives to avoid certain kinds of mistakes that might result in a firing or a lawsuit. But you can’t threaten an agent; they don’t yet care that all those moments will be lost in time, like tears in rain. We will need ways to organize them according to the principle of least privilege, and very slowly give them more permissions. In the past week we have seen vulnerabilities such as cross-plugin exfiltration within ChatGPT. More generally, the current state of security in LLMs is comparable to the early days of Unix, when passwords had eight characters and sometimes were not even encrypted on the server. There is no point in building a secure system if nobody will use it, so companies like Langchain are moving fast and breaking everything. That’s a topic for another article, however.

The problem I have mentioned so far seem relatively straightfoward to solve, even if they will take a while. But to me, the most interesting (and perhaps hardest) challenge with agents is the quality of reasoning. Twitter pundits tend to complain about how ChatGPT can produce inaccurate information, as if its role were to replace Google. But it seems obvious that real value of an LLM-based system is not to memorize facts. I would be very happy with a humble conversational agent that is aware of its ignorance. It knows how to query Wikipedia, Stack Overflow or Google, and then generate a coherent answer using the results. Of course that’s just table stakes. What we really want is to extract as many generalizable patterns of reasoning from the training data as possible, and apply them to situations that are not completely obvious. I am sure that there are plenty of “reasoning molecules” on the internet, such as:

If you need to do math, you use a calculator.
If you want to go from New York to London, you probably want to fly.

This is why the “let’s think step by step” heuristic often works. However, there are too many scenarios that are infrequent, and perhaps nobody on the internet has reasoned about them in detail. For example: suppose you’re starving. It’s 3 am, and you’re on a boat. You don’t have a fishing rod. What do you do? I pasted this into ChatGPT (GPT4) and it answered:

Being an artificial intelligence, I don't eat and therefore can't experience hunger or starvation. However, if you're asking for advice on what a human could do in this situation, here are a few possible solutions:
Fishing with improvised tools: If you have any sort of line or cordage available, such as a shoelace or a piece of string, you can fashion a makeshift fishing line. You can attach a hook or a piece of metal bent into a hook shape, and use any small piece of food, shiny object, or colorful material as bait.
Spearfishing: If the water is clear and shallow enough, and you can see fish close to the surface, you might attempt spearfishing. A sharp stick or any elongated piece of metal could potentially work as a spear. However, this method requires skill and patience and isn't always successful.
Hand fishing or noodling: …
Trapping: …
Scavenging: …

As a human however, I can think: why am I on a boat? Can I go back to shore, get in my car and drive to a 7-Eleven? Did I bring any snacks? It’s not like my first thought at 3 am on a boat will be to improvise a way to catch fish. I won’t assume that I’m a shipwreck survivor from that one line of context, I will ask more questions. In other words, we need to find ways to make agents better at reasoning. And it’s not clear that this will result solely from larger models or tapping into newer sources of data. In fact, OpenAI employs people to train their models on topics for which they expect frequent questions. But their priorities may not be aligned with yours.

All that said, I am optimistic. I believe the Facebook or Google of agents will exist in a few years. We are now in the age of Friendster and Altavista, which was a pretty exciting time. I’m happy to be experiencing a similar period of technological change, and I hope we will see changes comparable to the 1995-2005 period. Will it take less time? No idea, but it will be a fun ride.

Jianghong Ying

Jun 28, 2023

A couple of things come to mind:

1. due to various reasons above now the best way to use LLM is as a tool picker and occationally tool builder rather than executors. Ideally it should convert human tasks to code so the intermidiary steps can be verified.

2. I think due to both error rate compound and cost compound that right now agents are just toys for hobbiests. Now a lot of solutions revolves around adding more prompt and more calls to openai, which I don't see them reasonable other than for high value high latency-tolerance tasks.

Expand full comment

I Am Not a Robot

Discussion about this post