Solved: The Bug That Haunted AI Video For Years

Two Minute Papers| 00:09:35|Apr 29, 2026

Chapters9

Discusses how AI can produce photorealistic frames but motion often feels off, and that more compute alone does not fix it.

Two Minute Papers reveals a clever fix for AI motion errors: trim the junk data, compress learning signals, and let the model remember the right physics with Johnson–Lindenstrauss projections.

Summary

In this episode, Dr. Károly Zsolnai-Fehér breaks down why motion is the Achilles’ heel for AI-generated videos, despite photorealism being excellent. He highlights that simply cranking up compute or data doesn’t fix messy motion and demonstrates a smarter approach: separating how things move from how they look. The key idea is a motion masking step using optical flow applied to the AI’s internal learning signals, not the video itself, to locate where decisions come from. Faced with memory constraints from models with over a billion parameters, the team compresses these signals down to 512 numbers using the Johnson–Lindenstrauss projection—retaining essential distances while slashing storage needs. They show that removing “bad influences” (negative samples like cartoons with conflicting physics) before fine-tuning yields far more accurate motion, exemplified by a spinning coin stabilizing on the correct axis and a more realistic ball. The paper’s user study reports a 74.1% win rate over the original across 50 videos and 17 participants. Dr. Zsolnai-Fehér emphasizes that this approach proves you don’t need more data to improve learning—just better data and a cleaner signal. He closes with the broader takeaway: separate truth from noise, verify what you hear, and embrace quality information over quantity. The promise of free code adds to the excitement about where this research is headed.

Key Takeaways

Motion accuracy in AI video hinges on separating motion from appearance, then aligning learning signals accordingly.
Negative samples like cartoons with rubbery physics can mislead AI training and degrade motion realism.
Cutting out bad influences and fine-tuning with good examples dramatically improves results (e.g., spinning coin on the correct axis).
Johnson–Lindenstrauss projection reduces 1+ billion learning parameters to 512 dimensions with preserved relative distances, enabling feasible memory use.
The study reports a 74.1% win rate for the new method over the original across 50 videos and 17 participants.
OpenAI’s Sora-style high-compute trajectories show that brute force isn’t the answer; smarter data handling yields better motion.
The researchers promise to release code for free, accelerating practical adoption of the method.

Who Is This For?

Essential viewing for AI/ML researchers and practitioners working on video synthesis and motion realism, plus developers curious about efficient learning signal compression and robust training with selective data.

Notable Quotes

"What about motion? Well, now we got a problem! Yup, motion breaks the spell."

—Sets up the central problem: motion realism is harder than photorealism.

"After cutting out these bad influences and fine-tuning the AI with the good ones, look at that! That is a beautiful spinning coin."

—Shows the core experimental result of removing negative samples.

"The technique they use is called the Johnson–Lindenstrauss projection and it was used in Google’s TurboQuant compression algorithm as well."

—Explains the memory-efficient data compression method.

"Truth is the best teacher. And you don’t need a lot of it."

—Wraps the broader takeaway about quality over quantity in data.

"This technique just showed a tiny clean signal beats a mountain of junk."

—Summarizes the main impact of the research finding.

Questions This Video Answers

How can optical flow be used to separate motion from appearance in AI video models?
Why do negative samples like cartoons harm motion learning in AI, and how can we mitigate it?
What is Johnson–Lindenstrauss projection and why is it useful for compressing learning signals in large models?
Can memory-efficient projections keep learning signals accurate enough for video synthesis?
Will the code for the new motion-learning method be released for public use?

Two Minute PapersDr. Károly Zsolnai-Fehérmotion learningoptical flowmotion maskingJohnson–Lindenstrauss projectiondata compressionnegative samplesvideo synthesisAI memory efficiency

Full Transcript

Today, generating eye-poppingly high-quality  videos just by writing a text prompt is possible.   You can also get exceptional controllability  as well. You can generate three movies that   look completely different,  but land on the same ending.   Almost anything you can think becomes  achievable, effortless and inexpensive. Now, how they are kinda taking over the internet   is another story. But pretty much all of these  systems have a huge problem. What is the problem? Is it issues with photorealism? No. In  photorealism, these AIs are second to none. I am   a light transport researcher by trade, I like to  write programs that create photorealistic images,   and I feel that many of their results are nearly  impeccable. I spent more than a decade to learn   this craft, and these AI systems are picking it up  at an incredible speed. That is absolutely crazy. But, not so fast. What about motion? Well, now  we got a problem! Yup, motion breaks the spell.   The frame looks right, but the movement feels  wrong. And at this point, most AI researchers   at this point say, no problem. Just give it more  training data, and more compute, and we are done. Let’s actually test that. This is the base amount  of compute for OpenAI’s Sora from two years ago.   Base amount of compute. Yuck. This is  not great, and if you look closer…I   think you shouldn’t, you notice that  this is what nightmares are made of. Now, if we add 4 times more compute, we get this.   Perfect? Not even close. But  the trend is shouting at us.  Now, with 32 times more compute, we get this.  Now we’re talking. The result starts to sing. So, case closed, right? If the motion is not  good, and if you don’t have more compute, because   who does these days, well then, let’s add more  training data. Let it look and learn some more. Except that this is completely wrong. That is  what this paper is about. When we see an AI   generate motion, they developed  a technique that is able to ask,   okay little AI, where did  you learn that? I love that! Let me give you an example. A foam cube floating  on water. And it gives us waves crashing over a   pier, surfing, splashing ocean waves. This is so  cool! So this is where the knowledge came from.   But wait, they say that if these are  positive examples for your learning,   I wonder what negative samples look like? Oh! This makes sense - these really are the worst  for learning. Why? Because cartoons, for instance,   teach completely conflicting information about  physics. In cartoons, characters pause mid-air   before falling, maybe even holding a tiny  little umbrella. Bodies bounce like rubber,   and snap back into their original  shape a moment later. Fun for us.   Not so fun for an AI model  trying to learn real physics. Wait a second…I have an idea. What if we don’t just put in there more   training data. What if we give it less? Cut  out those bad influences! Can it do better? Let’s try it out together. Yes! With the base  model, we get a coin which is spinning around   the wrong axis. And now, hold on to your papers  Fellow Scholars, because here comes the magic.   After cutting out these bad influences  and fine-tuning the AI with the good ones,   look at that! That is a beautiful spinning coin. I got to say I was a bit less impressed by  the ball example, yes the new one is better.   We have seen plenty of systems pull off this  kind of movement. In any case, we are Fellow   Scholars here, we don’t hand out medals for a  couple cherry-picked examples. No. We are more   rigorous than that. We look at the research paper.  Does the paper deliver? Oh yes, yes it does! I look at the user study, and see that it lands  the punch. They asked people to judge whether   the new or previous method was better. They did  it across 50 videos and 17 participants. That   is 850 little tests. And…drumroll, it has a 74.1%  win rate over the original. That is stunning. Okay, so how on earth did they do that? Can  we catch and AI in the act of remembering?   Is that even possible? And what does  that mean for us? Dear Fellow Scholars,   this is Two Minute Papers with Dr. Károly  Zsolnai-Fehér. Now that’s a late cold open. Alright, they did two things to ensure  that this concept works properly. One,   you need to be able to separate how things  move from how they look. To do that,   they introduce a motion masking step  through a technique we call optical flow.   An old idea. Works great for tracking the  path of points over a video. Good call. But here is the genius part. They don’t  apply this mask to the video itself. Nope!   Instead, they apply that mask to the  internal learning signals of the AI.   This helps them discover where  decisions are coming from. Genius idea, yes, but unfortunately,  two, there is a huge problem with this.   What is the problem? Modern AI models have over 1  billion parameters. Storing and comparing the full   learning signals for thousands of videos  takes too much computer memory and time.   That’s crazy town. Not feasible.  Instead, they found a way to   get this, compress down these more than a billion  numbers into, excuse me? Am I seeing correctly?  That’s right, 512. Down from more than a  billion. And the results are almost the same. Wow! That is insane. The technique they use  is called the Johnson–Lindenstrauss projection   and it was used in Google’s TurboQuant  compression algorithm as well. That is   one to ease the memory constraints of large  language models on your GPU. What does it do? What it does is it shrinks  high-dimensional data into a tiny space,   but in a way that it preserves the  relative distance between these numbers.   Picture a wooden chair. Now picture its  shadow on the floor. The chair lives in 3D.   The shadow lives in 2D. The shadow needs much  less data. And if the scene is set up right,   the distance between the four chair legs remains  the same. And that means that this projection   allows us to retain important properties  of the data, but cut away a lot of fat. And all this is put together to achieve  one thing: to be able to find what videos   influenced the AIs decisions. And then,  to cut away all the junk knowledge. And that is also super important  for our thinking. You see, there   are topics where I hoped that the more I read,  the smarter I would get. Read more, grow wiser. Not true. There are many areas where the more I  read, the more I found that I just got stupider.   It took me years and years to find out that there  are topics you can read and learn all you want,   if the quality of information is low. It  does not educate. It deforms your thinking. So what is the solution? You need to be able to  separate the real from the fantasy. You don’t need   more. You need less, and you need better. Like  you saw in the paper, truth is the best teacher.   And you don’t need a lot of it. This technique just showed a tiny clean  signal beats a mountain of junk. Slow down,   don’t take everything in. Try to verify what you  actually hear, and try to take in less. To me,   that is the main message of this paper.  Brilliant work. Brilliant lesson. Love it.   And they promise that we’ll get the  code for free. What a time to be alive!

Get daily recaps from
Two Minute Papers

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.

Get Started

Solved: The Bug That Haunted AI Video For Years

Summary

Key Takeaways

Who Is This For?

Notable Quotes

Questions This Video Answers

More from Two Minute Papers

NVIDIA’s New AI Is Fast For A Strange Reason

OpenAI's GPT 5.5 Instant: The Good, The Bad And The Insane

DeepSeek V4 AI Beats Billion Dollar Systems…For Free

NVIDIA's New AI Turns One Photo Into A World That Never Breaks

Get daily recaps from
Two Minute Papers

Solved: The Bug That Haunted AI Video For Years

Summary

Key Takeaways

Who Is This For?

Notable Quotes

Questions This Video Answers

More from Two Minute Papers

NVIDIA’s New AI Is Fast For A Strange Reason

OpenAI's GPT 5.5 Instant: The Good, The Bad And The Insane

DeepSeek V4 AI Beats Billion Dollar Systems…For Free

NVIDIA's New AI Turns One Photo Into A World That Never Breaks

Get daily recaps from Two Minute Papers

Get daily recaps from
Two Minute Papers