- A recent research paper revealed a new way to help AI models ingest way more data.
- “Ring Attention” removes a major memory bottleneck for AI models.
- Soon, you’ll be able to put millions of words into context windows of AI models, researchers say.
Right now, ChatGPT can ingest a few thousand words at most. Bigger AI models can handle more, but only up to about 75,000.
What if you could pump millions of words or whole codebases or big videos into these models?
A Google researcher, along with Databricks CTO Matei Zaharia and UC Berkeley professor Pieter Abbeel, have worked out a way to do just that.
The advance, revealed in a recent preprint research paper, promises to radically change how we interact with these powerful new tech tools.
The current approach can’t handle huge inputs due to memory limitations of the GPUs that train and run AI models.
In the industry, this stuff is measured and discussed based on “tokens” and “context windows.” A token is a unit that might represent a word, or a bit of a word, a number or something similar. The context window is the space where you plop a question or text or other inputs into a chatbot or AI model so it can analyze the content and spit back something smart.
AI startup Anthropic, and its chatbot Claude, has a context window of up to 100,000 tokens, which works out roughly to 75,000 words. That’s basically one book that the system can take in at once and do clever things with.
OpenAI’s GPT-3.5 model has a context length of 16,000 tokens. GPT-4’s is 32,000. A model created by MosiacML, owned by Databricks, can handle 65,000 tokens, according to the recent research paper.
Pay attention to Ring Attention
Hao Liu, a UC Berkeley PhD student and part-time researcher at Google DeepMind, is co-author of the paper, titled “Ring Attention with Blockwise Transformers for Near-Infinite Context.”
I interviewed him via video chat soon after the paper came out. He looks really young, and he’s whip-smart and capable of explaining some of the complex technology behind his idea.
It’s a riff on the original Transformer architecture that revolutionized AI in 2017 and forms the basis of ChatGPT and all the new models that have come out in recent years, such at GPT-4, Llama 2 and Google’s upcoming Gemini.
The basic idea is that modern AI models crunch data in a way that requires GPUs to store various internal outputs and then recompute them before passing this along to the next GPU.
This requires a lot of memory and there just isn’t enough. This ends up limiting how much input an AI model can process. No matter how fast the GPU is, there’s a memory bottleneck.
“The goal of this research was to remove this bottleneck,” Liu told me.
The new approach he created with Zaharia and Abbeel forms a kind of ring of GPUs that pass bits of the process along to the next GPU while simultaneously receiving similar blocks from their other GPU neighbor. And so on.
“This effectively eliminates the memory constraints imposed by individual devices,” the researchers wrote, referring to GPUs.
I boiled that explanation down to an incredibly basic level. But the end result is what’s really important.
Massive context windows
This Ring Attention method means that we should be able to put millions of words into the context windows of AI models, not just tens of thousands.
Liu goes further, saying that, in theory, many books and even videos can be dropped in one go into context windows in the future, and AI models will analyze them and produce coherent responses.
“An AI model could read an entire codebase, or output an entire codebase,” Liu said. “The more GPUs you have, the longer the context window can be now. I’m GPU poor, I can’t do that. The big tech companies, the GPU rich companies, it will be exciting to see what they build.”
The researchers tested this in real-world experiments. I asked Liu if he was worried the approach might not work. His response was very Googley.
“I didn’t worry,” he said. “You can compute this mathematically.”
With the current way of doing things, if you have a 16,000 token context window, with a 13 billion parameter AI model that relies on 256 Nvidia A100 GPUs, the context length is limited to 16,000, he explained.
With the Ring Attention approach, that same setup would be able to handle a 4 million token context window, he said.
That’s the math. In reality, when you train AI models, you need some GPUs to do other tasks, so 4 million wouldn’t be the actual size of the context window. But it would be millions, according to Liu.
Nvidia GPU demand
These findings beg an important question: If you can do more with fewer GPUs, will that mean weaker demand for Nvidia’s AI chips?
No, according to Liu. Instead, developers and tech companies will just try bigger and bolder things with this new technique, he said.
“Ring Attention won’t discourage the sale of GPUs,” he added. “If you need GPUs, you need GPUs.”
Read the full article here