How DeepMind’s New AI Predicts What It Cannot See

Two Minute Papers| 00:10:42|Mar 24, 2026

Chapters11

Introduces the idea of reconstructing scenes in four dimensions by adding time to 3D space, visualized as a moving point cloud rather than a static model.

DeepMind’s D4RT uses a single transformer to reconstruct 4D scenes from video, tracking through occlusion up to 300x faster than older methods.

Summary

Two Minute Papers’ Dr. Károly Zsolnai-Fehér dives into a groundbreaking DeepMind effort that promises full 4D scene reconstructions from video inputs. He explains that, unlike older pipelines that stitched depth, motion, and camera angles with separate models, D4RT uses one transformer to handle depth, motion, and pose all at once. The system can even predict where obscured points should be when they aren’t visible, enabling robust tracking through occlusion. The resulting representation is a point cloud that can be reconstructed rapidly, with claims of up to 300x speedups over prior approaches. Károly compares D4RT’s performance to Gaussian Splats and traditional mesh pipelines, noting both the gains in motion handling and the trade-offs in output realism and editability. He masterfully uses the encoder-decoder analogy with “elves” to illustrate how the global scene representation feeds a lightweight decoder that fills in fine details using high-resolution texture cues. The video emphasizes that the method prioritizes geometric accuracy and speed, while acknowledging that the output isn’t meant to be a drop-in replacement for meshes or for direct 3D editing in tools like Blender. Finally, Károly highlights the collaborative nature of the work—Google DeepMind with University College London and Oxford—while sharing practical implications for future digital worlds and games.

Key Takeaways

D4RT uses one transformer to simultaneously handle depth, motion, and camera pose without inter-model communication.
The method predicts point positions even when parts of the scene are occluded or temporarily unseen.
Compared to prior techniques, D4RT can be up to 300x faster depending on the baseline used for comparison.
Outputs are point clouds rather than meshes, which means easy video-to-geometry reconstruction but less suitability for photorealistic rendering or physics via direct meshing.
The decoder benefits from feeding back original high-resolution video pixels to sharpen detail beyond the model’s internal representation.
The encoder creates a global scene representation that summarizes past and present frames to guide reconstruction.
The approach eliminates the need for test-time optimization and glued-together subsystems, reducing a common bottleneck in 4D scene reconstruction.

Who Is This For?

Essential viewing for graphics researchers and game developers curious about real-time 4D scene reconstruction, point-cloud workflows, and how transformer-based systems can replace multi-model pipelines.

Notable Quotes

""This absolutely incredible paper from the Google DeepMind lab promises something that sounds like science fiction. Full 4 dimensional reconstruction of scenes.""

—Károly introduces the breakthrough and its scope.

""Now this one uses one AI technique. Just one transformer. Everything that you see here in the middle is just part of one thing.""

—Emphasizes the simplicity and elegance of D4RT’s design.

""It can even track through occlusion. It is able to guess where these points are, even if it doesn’t see them.""

—Highlights occlusion handling and predictive capability.

""Depending on what you compare to, it is up to 300 times faster.""

—Cites the impressive speedup against previous methods.

""The decoder... see in a way that is a bit blurry. They have terrible eyesight, so the objects they are working on become a bit blurry.""

—Illustrates the encoder/decoder dynamic and the idea of using high-res cues to sharpen results.

Questions This Video Answers

How does D4RT achieve 4D reconstruction with a single transformer?
What are the trade-offs of point-cloud outputs vs. mesh-based 3D representations?
Why does D4RT claim up to 300x speed increases over older methods?
How does occlusion tracking work in DeepMind’s 4D reconstruction pipeline?
What role do encoders and decoders play in this 4D reconstruction approach?

DeepMindD4RTGaussian Material Synthesis4D reconstructionpoint cloudocclusion trackingtransformerencoder-decoderGaussian Splatsdepth motion camera pose

Full Transcript

This absolutely incredible paper from the  Google DeepMind lab promises something that   sounds like science fiction. Full 4 dimensional  reconstruction of scenes. Hmm. Does this mean that   things disappear into another spatial dimension  like in this game called Miegakure? No. No,   because this game is in the works and it has  been for more than 11 years now. Wow. Okay,   I won’t say anything because I also worked on  this research paper called Gaussian Material   Synthesis that took me 3,000 work hours  to finish. And while I was working on it,   no papers appeared and people thought I was dead. God, I haven’t even started the episode and we’ve  gone off the rails already. Okay, Károly, focus. Okay, so what is this 4D thing? Well, 3  spatial dimensions, and 1 dimension that   is time. It’s not crazy wormholes, it’s  worse! It’s like building IKEA furniture,   but as you start tightening the  screws, the cabinet is running away. Okay, so what the heck is this crazy person  talking about. So in goes a video of a scene   of your choice. And out comes a virtual version  of it in the form of a point cloud. However,   the catch is that things are allowed  to move around as they please. And this is fantastic, I mean look at  these highly dynamic judo scenes and all   kinds of craziness, and it understands how  these points are moving around over time.  I am always fascinated by the fact  that an AI can look at a 2D photograph,   and understand the underlying spatial reality.  This is just a bunch of numbers for them,   yet they understand what is close and what is  far away. Crazy. We humans are good at that,   but we have a brain that evolved for that for  millions of years. And this is just a bunch   of sand that learned to think. So that  is already amazing. But it gets better. DeepMind says it could have  unlimited applications, yes,   unlimited power! Woo-hoo! Károly. Ok, ok. Now performing this is really tough.  Previous techniques could do this kind   of 4D reconstruction, but you needed a bunch  of specialized models for it. You’d have one   AI for depth, another for motion, and a third  for camera angles. And then you have to glue   all of these together into an abomination.  Using the abomination requires a technique   called test-time optimization. Yes. Here,  your computer sits there sweating for minutes,   trying to make the different models agree with  each other so the geometry doesn't fall apart. Now this new technique doesn’t do that. This  is called D4RT, if you want to sound cool,   pronounce it as dart. Now this one uses one  AI technique. Just one transformer. Everything   that you see here in the middle is just part of  one thing. And this one thing can handle depth,   motion, and camera pose simultaneously  without needing them to talk to each other. But it gets better. A lot better. It can even  track through occlusion. It is able to guess   where these points are, even if it doesn’t  see them. How on Earth is that possible? Well,   these points we have seen before,  and will see again, so it is able   to make an educated guess as to where they  are, even if it doesn’t see them. Crazy. And it can reconstruct massive scenes by   just briefly looking through  them. Absolutely incredible. Now hold on to your papers Fellow Scholars,  because as a result, it is incredibly fast. I   mean, wow. Look at how it compares to previous  techniques. Depending on what you compare to,   it is up to 300 times faster. That is mind  blowing. I’ll tell you in a moment how it works. Now, wait wait wait. Hold the phone. we  can represent scenes in other ways too,   not just with point clouds. Most games  and animation movies use 3D mesh geometry,   and Gaussian Splats are also the new  rage. How does this relate to those? It is better and also worse in 3 ways. First, it excels at handling motion. While  meshes and splats often struggle with ghosting,   leaving behind artifacts as objects move,  D4RT treats movement as a core part of   the math. Second, it is up to 300x faster  than previous methods. It skips the slow,   iterative optimization loops that Gaussian  splats usually require. Third, the model   recovers depth, tracks, and camera parameters  simultaneously. These are incredibly appealing. However, let’s not overstate things here. Now  come the bad news. 3 things it is not so good   at. Because it outputs a point cloud, the data  is let’s say unintelligent. It’s just a bunch   of dots. You can't 3D print it or use it for  physics collisions without an extra meshing step.   It is also not meant to look pretty.  Meshes and Gaussian Splats remain the   kings of photorealistic reflections while D4RT  focuses strictly on geometric accuracy. Finally,   it is worse for editing, because  without the structured faces of a mesh,   you can't exactly hop into Blender  and sculpt it like digital clay. Okay, so how is all this incredible work possible?  How do we assemble that cabinet that wants to   run away? Dear Fellow Scholars, this is Two  Minute Papers with Dr. Károly Zsolnai-Fehér.  First, the encoder. This is a master  carpenter. This looks at the scene and   tries to understand the past and the  present of the furniture. Understand   what it’s about. This they call  a global scene representation. Then, we get the decoder. These are the magic  elves. Now let’s build. Here comes the genius   part. Instead of trying to build the whole  cabinet at once, which is heavy and slow. Yes,   we all know that from building IKEA furniture.  How the heck can this box have 100 screws? No   one knows. Okay, so the carpenter just  points to a spot and yells at a tiny   elf: “Hey YOU! Yes, you! Where is  this specific screw at timestamp 10?" The elf, which is the query grabs the info  and zaps the screw into existence. Now here   comes the genius part. Elves don’t  need to talk to each other. Oh yes,   finally! So because of that, you can have 10  elves or 1,000,000 elves doesn’t matter. Yes,   the technique is completely parallelizable! That  is the other reason why it is so bloody fast. And here is the kicker. The decoder, so the elves  see in a way that is a bit blurry. They have   terrible eyesight, so the objects they are working  on become a bit blurry. So scientists say, let’s   give them magic glasses. How? Well, by feeding  the technique the original, high-resolution video   pixels back into the decoder. So this is what  they saw before, and this is what they see now.   That is insane, because now it can reconstruct  details finer than the AI's own internal brain! But I haven’t explained the part where the cabinet  wants to run away. How do we handle that? Well,   in a normal 3D scan, if the camera  can't see the leg of the cabinet,   the computer just gives up. Incomplete  information and moving things cannot be   handled well. They just leave a giant  hole in your geometry. Total disaster. But remember, our master carpenter is not  looking at just one photo. He has watched   the entire video tape from start to finish. He  has seen the past, and the present. So when the   cabinet leg disappears behind the sofa,  the elf cries out, "Master! The screw is   gone! I cannot build what I cannot see!"  I do not know why an elf has this voice. Now, the wise carpenter smiles and says:  "Relax. I saw that screw five seconds ago,   and I see it pop out the other side  five seconds later. Based on that,   right now, it is hiding... exactly here!” And boom! The elf is now suddenly able to assemble   the cabinet. In other words, this is how it tracks  through occlusion and disappearing information. Now surprisingly, there is more to learn  here. Listen. The elves build the scene   300x faster because they do not talk  to each other. That is excellent life   advice. Sometimes collaboration has a tax.  Sometimes instead you need to create a few   hours of zero-communication deep work blocks  where you are unreachable. Whenever I do that,   I am often surprised by how much  I can get done in little time. This is a collaboration between the wizards  at Google DeepMind, University College London,   and University of Oxford. These are the  people inventing the power tools of the   future and giving it away for all of us for  free. Thank you so much! What a time to be alive! So, here you go. A glimpse of the future and  how digital worlds could be created soon.   A really advanced paper described in simple words  anyone can understand. If you appreciate that,   make sure to subscribe, hit the bell and  leave a kind comment. So you’ll get more   videos like this. Don’t worry about  it, we are all paper addicts here.