Google’s New AI Just Broke My Brain

Two Minute Papers| 00:08:34|Apr 1, 2026

Chapters9

announces a new method that promises 4-6x memory reduction and 8x speedups for attention, with potential to lower AI running costs.

TurboQuant promises 4–6x memory reduction and about 8x faster attention, with no meaningful quality loss, potentially reshaping how we run large AI models.

Summary

Two Minute Papers’ Dr. Károly Zsolnai-Fehér weighs Google’s TurboQuant, a method claimed to cut memory usage and speed up attention in AI models. He emphasizes that the idea compresses the KV cache—the short-term memory of AI assistants—and that the announcement arrived during a memory shortage, making the potential impact timely. He highlights that TurboQuant blends three legacy ideas: rotating vectors before quantization, a Johnson–Lindenstrauss (JL) Transform for dimensionality reduction, and conventional quantization, all combined rather than newly invented from scratch. Zsolnai-Fehér notes that the proof is formal and that other researchers have started reproducing the technique and benchmarking it. In practice, early tests show a 30–40% memory reduction for KV caches and about a 40% speedup in processing prompts, with little to no loss in output quality. He cautions that media hype can exaggerate idealized benchmarks, suggesting the results will vary by scenario and workload. The paper’s acceptance has sparked discussion about overlaps with prior work, and not all researchers agree all concerns are fully addressed. Yet the takeaway remains exciting: for tasks requiring very long contexts, TurboQuant could let you analyze large PDFs, movies, or codebases more cheaply. All of this underscores how a smart combination of existing methods can produce what feels like a breakthrough. Zsolnai-Fehér closes with a note that more data and replication are needed, but the current signal is undeniably positive for the AI tooling community.

Key Takeaways

TurboQuant reportedly reduces KV-cache memory by 30–40%, helping long-context AI tasks like large PDFs or codebases run cheaper.
Processing speed improves by about 40% on prompts, addressing bottlenecks without adding significant quality loss.
The approach combines rotation before quantization with a Johnson–Lindenstrauss Transform to preserve distances while reducing dimensionality.
Three established ideas are used together (rotation, quantization, JL transform) rather than inventing something entirely new.
Independent researchers have reproduced the technique and benchmarked it, validating practical benefits.
Media hype should be tempered; gains vary by model and workload and aren’t a universal 6x RAM reduction.
The method is compatible with existing models and can be layered on top of current architectures.

Who Is This For?

Essential viewing for researchers and engineers working with large language models and long-context applications who want to understand practical memory and speed improvements without retraining from scratch.

Notable Quotes

"They call it TurboQuant. Roughly speaking, they claim 4 to 6 times less memory, that is insane. And 8 times faster computation for a part of the neural network called attention."

—Opening summary of TurboQuant’s claimed benefits and the focus on attention.

"before chopping it off, rotate the arrow in a random direction. Now the energy spreads more evenly across all directions."

—Explaining the rotation trick to preserve information when quantizing.

"This is a very old idea."

—Acknowleding that the core concepts aren’t new, just combined cleverly.

"But it’s still good. Really good!"

—Expressing enthusiasm about the practical potential despite hype.

"What? That is…my brain crashed."

—A reaction to the reported 40% speedup and memory savings.

Questions This Video Answers

How does TurboQuant reduce memory usage in AI models with KV caches?
What is the Johnson-Lindenstrauss Transform and why is it used in TurboQuant?
Can TurboQuant’s memory and speed benefits apply to long-context AI tasks like processing large documents?
What are the caveats or limitations when applying TurboQuant in real-world workloads?
Is TurboQuant compatible with existing models or requires retraining?

TurboQuantJohnson-Lindenstrauss TransformKV cachetransformer attentionmemory efficiencyspeedupTwo Minute PapersKároly Zsolnai-Fehérlong-context AI

Full Transcript

Google made a huge announcement about their new  method that lets us run AI techniques cheaper.  The news took the world by storm.  This came at the best possible time,   because we have a worldwide memory  shortage. So the prices for capable   laptops and GPUs and anything that can run  these AI systems is up by…insane amounts. And this work would make them much  cheaper to run. They call it TurboQuant. Roughly speaking, they claim 4 to 6  times less memory, that is insane.   And 8 times faster computation for a part  of the neural network called attention.   No meaningful loss in output quality. And  it works on top of existing models as is. If true, that is a total game changer.  The news was so huge it even moved the   stock price of huge semiconductor companies. Because of that, I did not want to publish an   early video on the huge sensation.  No. I really wanted to wait a bit,   and find out whether it actually  works in practice. I’ll tell you   about that. And I’ll also tell you  that not everyone is happy about it. So, three questions. What does it do? Does  it work? What is the controversy about? Dear Fellow Scholars, this is Two Minute  Papers with Dr. Károly Zsolnai-Fehér.   It feels so good to do it like this. Well,  this compresses the KV cache of AI systems,   like large language models. This is the  short-term memory of an AI assistant. If   you would look into that, you would see  tons and tons of numbers. These numbers   relate to what you are currently talking about.  Movies, a bunch of documents or a huge codebase. Now, personally, I am a research scientist, what  caught my eye was not the media hype, but this.   Oh! A formal mathematical proof that it works. Now  we’re talking. Okay, one, so what does this do? And these numbers have lots of digits. Scientists  propose that we chop off the end of the numbers   to save memory. Is that a new idea? No. Is that a good idea? No, unless you are very   careful. Because you can lose a lot of information  and your neural network might output nonsense. So, how do you do that? Well, imagine a vector,  this is like an arrow pointing somewhere.   Sometimes that arrow points mostly along one  axis. So most of its "energy" is in one direction,   and a little in other directions.  When you chop that information off,   it snaps on to the grid, you basically  lose everything except that one direction. That is not useful. Now here’s a brilliant idea:  before chopping it off, rotate the arrow in   a random direction. Now the energy spreads  more evenly across all directions. So when   you round off parts of it, you lose a little from  everywhere instead of everything from most places. The result? Much less information lost. Is  this idea new? No. This is a very old idea. Now they do one more thing. They use a   Johnson–Lindenstrauss Transform to  compress the data. What is that? Remember, we have a bunch of numbers,  representing arrow directions. And we   want fewer numbers to describe these  directions. But very carefully. You   do this in a way that guarantees that the  distances between these arrows is roughly   the same after squishing. If you want to sound  really cool, just call it the JL transform. Is that new? Not really. 40-year old  technique. And I think that is the   key. Everyone loves to invent shiny new  stuff. But here, quantization is not new,   rotating things around is not new. This  transform is not new. These are three   age old ideas combined together to great  effect. Sometimes you don’t need to invent   grand new theories. Sometimes you need  a smart combination of existing methods. Okay, second big question:  so does it work in practice?  To conclude that it works in practice,   I wanted to see other scientists reproduce  the technique and benchmark it for themselves.  This is why this video appears later than most  others, but I think it makes it more truthful.  So were other scientists able to  reproduce this technique? Yes.  Did they also benchmark it? Yes. Does this technique help? Yes. But, not so fast! The first tests reveal that  it decreased the memory cost of the KV cache,   short term memory by 30-40%. That is fantastic.  I would have been very happy with this. But it   doesn’t end there. Typically you have a tradeoff  where you decrease memory usage at the cost of   something. So something needs to slow down.  Now hold on to your papers Fellow Scholars,   because it also sped up processing the  prompts by about 40% as well. What? That is…my brain crashed. We get faster  AI assistants that need less memory at   almost zero cost. That is insane. In a world  where it’s harder and harder to own things,   this is a blessing. Thank you so much! It is also remarkable that the paper has  barely been out for a week and some of   you Fellow Scholars already coded it up.  Nice work. Link is in the description. So, it’s not quite like the media says.  Based on the results, we cannot conclude   that every AI machine suddenly needs 6 times  less ram. No. That is a bit idealistic and only   true for some corner cases. You know when  you see an official benchmark of a phone   battery or electric car mileage with somewhat  idealized conditions? It is a bit like that.   So careful with the media hype. Experienced  Fellow Scholars like you know that in your mind,   you have to tone these numbers down a  little. This is why we wait for more   data and analyze experiments here, to get  the highest quality information for you. But it’s still good. Really good!  It helps most people who run AI   systems with very long contexts. When you  chuck in a huge pdf document, or a movie,   or a huge codebase for an AI to analyze. Yes,  you will be able to do that cheaper, with   meaningfully less memory. Often a few gigabytes  less. And I think that is absolutely amazing news. Third, I will note that other researchers  point out that the paper overlaps with previous   techniques. They felt that it has similarities  that should be discussed more thoroughly. There   was more. Eventually, the paper was accepted  for publication, though not all researchers   agree the concerns were fully addressed. I put the  links to all of these in the video description. But this proves that even in modern AI, there  are still basic things we haven’t invented   yet. And that makes this a very exciting  area to be in. What a time to be alive! And if you agree that this is  the way of talking about papers,   please consider subscribing  and hitting the bell icon.