Llama 4’s 10 million token context window is not enough to eliminate RAG. Not even close. However the increased context could drastically simplify agent workflow development.
Byte Pair Encoding is used to produce tokens and typically each token represents one or more letters. Common words may be represented by a single token, but longer sequences are unusual.
If we assume an average of 4 letters per token and each letter takes one byte, then 10M tokens is 40MB of text data. We could assume an average of 2 or 8 letters, it doesn’t make a lot of difference; we’re still just talking megabytes.
Many companies have terabytes of text data. Not to mention audiovisual data. Even if you have a trillion tokens then your whole context could be filled by your knowledge base leaving none for output.
So let’s say that for long term memory this is not close to enough, but what about the opposite end? Let’s call it working memory.
When an agent takes instructions from the user, creates a plan, calls tools and reads the results, this all takes up context space.
If the agent runs out of context then it can forget what steps of the plan it has executed and what were the results.
For example I have seen an agent get stuck in a loop searching the web, because while it had space reserved for my original prompt in its context, the results of the search were not entirely present. So it kept deciding to do the search.
To mitigate this we must use tricks to compress previous steps and remove irrelevant details from the context. In the case above one thing we could do is summarize the search results or add them to a vector store.
The bigger the context the less we have to do this. Our working memory is handled by the LLM and we just need to manage longer term memory.
This is nothing new for software, basically the more RAM you have the simpler it is to write software. You don’t have to decide which data should be in RAM and which on some longer term storage. It’s just all in RAM.
As to whether Llama 4 has an impact on agent development, I don’t know. While it may have a large context, it has been reported that the model’s responses are poor quality. These reports could be based on incorrectly deployed models or other confusions, so as usual it’s good to wait for the dust to settle before writing the model off.
Unlike RAM, recall accuracy, speed and overall quality may degrade as the context is filled. There is research (https://x.com/rajhans_samdani/status/1899969384191582218) to show that this is the case and it also fits with my anecdotal experience.
Still, perhaps the architecture that allows for such a large context can be replicated without negative effects on quality. In fact even if the quality is poor once you have a few million tokens in context, this could be preferable to a hard limit.