Google’s New AI Just Broke My Brain
Chapters9
announces a new method that promises 4-6x memory reduction and 8x speedups for attention, with potential to lower AI running costs.
TurboQuant promises 4–6x memory reduction and about 8x faster attention, with no meaningful quality loss, potentially reshaping how we run large AI models.
Summary
Two Minute Papers’ Dr. Károly Zsolnai-Fehér weighs Google’s TurboQuant, a method claimed to cut memory usage and speed up attention in AI models. He emphasizes that the idea compresses the KV cache—the short-term memory of AI assistants—and that the announcement arrived during a memory shortage, making the potential impact timely. He highlights that TurboQuant blends three legacy ideas: rotating vectors before quantization, a Johnson–Lindenstrauss (JL) Transform for dimensionality reduction, and conventional quantization, all combined rather than newly invented from scratch. Zsolnai-Fehér notes that the proof is formal and that other researchers have started reproducing the technique and benchmarking it. In practice, early tests show a 30–40% memory reduction for KV caches and about a 40% speedup in processing prompts, with little to no loss in output quality. He cautions that media hype can exaggerate idealized benchmarks, suggesting the results will vary by scenario and workload. The paper’s acceptance has sparked discussion about overlaps with prior work, and not all researchers agree all concerns are fully addressed. Yet the takeaway remains exciting: for tasks requiring very long contexts, TurboQuant could let you analyze large PDFs, movies, or codebases more cheaply. All of this underscores how a smart combination of existing methods can produce what feels like a breakthrough. Zsolnai-Fehér closes with a note that more data and replication are needed, but the current signal is undeniably positive for the AI tooling community.
Key Takeaways
- TurboQuant reportedly reduces KV-cache memory by 30–40%, helping long-context AI tasks like large PDFs or codebases run cheaper.
- Processing speed improves by about 40% on prompts, addressing bottlenecks without adding significant quality loss.
- The approach combines rotation before quantization with a Johnson–Lindenstrauss Transform to preserve distances while reducing dimensionality.
- Three established ideas are used together (rotation, quantization, JL transform) rather than inventing something entirely new.
- Independent researchers have reproduced the technique and benchmarked it, validating practical benefits.
- Media hype should be tempered; gains vary by model and workload and aren’t a universal 6x RAM reduction.
- The method is compatible with existing models and can be layered on top of current architectures.
Who Is This For?
Essential viewing for researchers and engineers working with large language models and long-context applications who want to understand practical memory and speed improvements without retraining from scratch.
Notable Quotes
"They call it TurboQuant. Roughly speaking, they claim 4 to 6 times less memory, that is insane. And 8 times faster computation for a part of the neural network called attention."
—Opening summary of TurboQuant’s claimed benefits and the focus on attention.
"before chopping it off, rotate the arrow in a random direction. Now the energy spreads more evenly across all directions."
—Explaining the rotation trick to preserve information when quantizing.
"This is a very old idea."
—Acknowleding that the core concepts aren’t new, just combined cleverly.
"But it’s still good. Really good!"
—Expressing enthusiasm about the practical potential despite hype.
"What? That is…my brain crashed."
—A reaction to the reported 40% speedup and memory savings.
Questions This Video Answers
- How does TurboQuant reduce memory usage in AI models with KV caches?
- What is the Johnson-Lindenstrauss Transform and why is it used in TurboQuant?
- Can TurboQuant’s memory and speed benefits apply to long-context AI tasks like processing large documents?
- What are the caveats or limitations when applying TurboQuant in real-world workloads?
- Is TurboQuant compatible with existing models or requires retraining?
TurboQuantJohnson-Lindenstrauss TransformKV cachetransformer attentionmemory efficiencyspeedupTwo Minute PapersKároly Zsolnai-Fehérlong-context AI
Full Transcript
Google made a huge announcement about their new method that lets us run AI techniques cheaper. The news took the world by storm. This came at the best possible time, because we have a worldwide memory shortage. So the prices for capable laptops and GPUs and anything that can run these AI systems is up by…insane amounts. And this work would make them much cheaper to run. They call it TurboQuant. Roughly speaking, they claim 4 to 6 times less memory, that is insane. And 8 times faster computation for a part of the neural network called attention. No meaningful loss in output quality.
And it works on top of existing models as is. If true, that is a total game changer. The news was so huge it even moved the stock price of huge semiconductor companies. Because of that, I did not want to publish an early video on the huge sensation. No. I really wanted to wait a bit, and find out whether it actually works in practice. I’ll tell you about that. And I’ll also tell you that not everyone is happy about it. So, three questions. What does it do? Does it work? What is the controversy about? Dear Fellow Scholars, this is Two Minute Papers with Dr.
Károly Zsolnai-Fehér. It feels so good to do it like this. Well, this compresses the KV cache of AI systems, like large language models. This is the short-term memory of an AI assistant. If you would look into that, you would see tons and tons of numbers. These numbers relate to what you are currently talking about. Movies, a bunch of documents or a huge codebase. Now, personally, I am a research scientist, what caught my eye was not the media hype, but this. Oh! A formal mathematical proof that it works. Now we’re talking. Okay, one, so what does this do?
And these numbers have lots of digits. Scientists propose that we chop off the end of the numbers to save memory. Is that a new idea? No. Is that a good idea? No, unless you are very careful. Because you can lose a lot of information and your neural network might output nonsense. So, how do you do that? Well, imagine a vector, this is like an arrow pointing somewhere. Sometimes that arrow points mostly along one axis. So most of its "energy" is in one direction, and a little in other directions. When you chop that information off, it snaps on to the grid, you basically lose everything except that one direction.
That is not useful. Now here’s a brilliant idea: before chopping it off, rotate the arrow in a random direction. Now the energy spreads more evenly across all directions. So when you round off parts of it, you lose a little from everywhere instead of everything from most places. The result? Much less information lost. Is this idea new? No. This is a very old idea. Now they do one more thing. They use a Johnson–Lindenstrauss Transform to compress the data. What is that? Remember, we have a bunch of numbers, representing arrow directions. And we want fewer numbers to describe these directions.
But very carefully. You do this in a way that guarantees that the distances between these arrows is roughly the same after squishing. If you want to sound really cool, just call it the JL transform. Is that new? Not really. 40-year old technique. And I think that is the key. Everyone loves to invent shiny new stuff. But here, quantization is not new, rotating things around is not new. This transform is not new. These are three age old ideas combined together to great effect. Sometimes you don’t need to invent grand new theories. Sometimes you need a smart combination of existing methods.
Okay, second big question: so does it work in practice? To conclude that it works in practice, I wanted to see other scientists reproduce the technique and benchmark it for themselves. This is why this video appears later than most others, but I think it makes it more truthful. So were other scientists able to reproduce this technique? Yes. Did they also benchmark it? Yes. Does this technique help? Yes. But, not so fast! The first tests reveal that it decreased the memory cost of the KV cache, short term memory by 30-40%. That is fantastic. I would have been very happy with this.
But it doesn’t end there. Typically you have a tradeoff where you decrease memory usage at the cost of something. So something needs to slow down. Now hold on to your papers Fellow Scholars, because it also sped up processing the prompts by about 40% as well. What? That is…my brain crashed. We get faster AI assistants that need less memory at almost zero cost. That is insane. In a world where it’s harder and harder to own things, this is a blessing. Thank you so much! It is also remarkable that the paper has barely been out for a week and some of you Fellow Scholars already coded it up. Nice work.
Link is in the description. So, it’s not quite like the media says. Based on the results, we cannot conclude that every AI machine suddenly needs 6 times less ram. No. That is a bit idealistic and only true for some corner cases. You know when you see an official benchmark of a phone battery or electric car mileage with somewhat idealized conditions? It is a bit like that. So careful with the media hype. Experienced Fellow Scholars like you know that in your mind, you have to tone these numbers down a little. This is why we wait for more data and analyze experiments here, to get the highest quality information for you.
But it’s still good. Really good! It helps most people who run AI systems with very long contexts. When you chuck in a huge pdf document, or a movie, or a huge codebase for an AI to analyze. Yes, you will be able to do that cheaper, with meaningfully less memory. Often a few gigabytes less. And I think that is absolutely amazing news. Third, I will note that other researchers point out that the paper overlaps with previous techniques. They felt that it has similarities that should be discussed more thoroughly. There was more. Eventually, the paper was accepted for publication, though not all researchers agree the concerns were fully addressed.
I put the links to all of these in the video description. But this proves that even in modern AI, there are still basic things we haven’t invented yet. And that makes this a very exciting area to be in. What a time to be alive! And if you agree that this is the way of talking about papers, please consider subscribing and hitting the bell icon.
More from Two Minute Papers
Get daily recaps from
Two Minute Papers
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.



