How DeepMind’s New AI Predicts What It Cannot See
Chapters11
Introduces the idea of reconstructing scenes in four dimensions by adding time to 3D space, visualized as a moving point cloud rather than a static model.
DeepMind’s D4RT uses a single transformer to reconstruct 4D scenes from video, tracking through occlusion up to 300x faster than older methods.
Summary
Two Minute Papers’ Dr. Károly Zsolnai-Fehér dives into a groundbreaking DeepMind effort that promises full 4D scene reconstructions from video inputs. He explains that, unlike older pipelines that stitched depth, motion, and camera angles with separate models, D4RT uses one transformer to handle depth, motion, and pose all at once. The system can even predict where obscured points should be when they aren’t visible, enabling robust tracking through occlusion. The resulting representation is a point cloud that can be reconstructed rapidly, with claims of up to 300x speedups over prior approaches. Károly compares D4RT’s performance to Gaussian Splats and traditional mesh pipelines, noting both the gains in motion handling and the trade-offs in output realism and editability. He masterfully uses the encoder-decoder analogy with “elves” to illustrate how the global scene representation feeds a lightweight decoder that fills in fine details using high-resolution texture cues. The video emphasizes that the method prioritizes geometric accuracy and speed, while acknowledging that the output isn’t meant to be a drop-in replacement for meshes or for direct 3D editing in tools like Blender. Finally, Károly highlights the collaborative nature of the work—Google DeepMind with University College London and Oxford—while sharing practical implications for future digital worlds and games.
Key Takeaways
- D4RT uses one transformer to simultaneously handle depth, motion, and camera pose without inter-model communication.
- The method predicts point positions even when parts of the scene are occluded or temporarily unseen.
- Compared to prior techniques, D4RT can be up to 300x faster depending on the baseline used for comparison.
- Outputs are point clouds rather than meshes, which means easy video-to-geometry reconstruction but less suitability for photorealistic rendering or physics via direct meshing.
- The decoder benefits from feeding back original high-resolution video pixels to sharpen detail beyond the model’s internal representation.
- The encoder creates a global scene representation that summarizes past and present frames to guide reconstruction.
- The approach eliminates the need for test-time optimization and glued-together subsystems, reducing a common bottleneck in 4D scene reconstruction.
Who Is This For?
Essential viewing for graphics researchers and game developers curious about real-time 4D scene reconstruction, point-cloud workflows, and how transformer-based systems can replace multi-model pipelines.
Notable Quotes
""This absolutely incredible paper from the Google DeepMind lab promises something that sounds like science fiction. Full 4 dimensional reconstruction of scenes.""
—Károly introduces the breakthrough and its scope.
""Now this one uses one AI technique. Just one transformer. Everything that you see here in the middle is just part of one thing.""
—Emphasizes the simplicity and elegance of D4RT’s design.
""It can even track through occlusion. It is able to guess where these points are, even if it doesn’t see them.""
—Highlights occlusion handling and predictive capability.
""Depending on what you compare to, it is up to 300 times faster.""
—Cites the impressive speedup against previous methods.
""The decoder... see in a way that is a bit blurry. They have terrible eyesight, so the objects they are working on become a bit blurry.""
—Illustrates the encoder/decoder dynamic and the idea of using high-res cues to sharpen results.
Questions This Video Answers
- How does D4RT achieve 4D reconstruction with a single transformer?
- What are the trade-offs of point-cloud outputs vs. mesh-based 3D representations?
- Why does D4RT claim up to 300x speed increases over older methods?
- How does occlusion tracking work in DeepMind’s 4D reconstruction pipeline?
- What role do encoders and decoders play in this 4D reconstruction approach?
DeepMindD4RTGaussian Material Synthesis4D reconstructionpoint cloudocclusion trackingtransformerencoder-decoderGaussian Splatsdepth motion camera pose
Full Transcript
This absolutely incredible paper from the Google DeepMind lab promises something that sounds like science fiction. Full 4 dimensional reconstruction of scenes. Hmm. Does this mean that things disappear into another spatial dimension like in this game called Miegakure? No. No, because this game is in the works and it has been for more than 11 years now. Wow. Okay, I won’t say anything because I also worked on this research paper called Gaussian Material Synthesis that took me 3,000 work hours to finish. And while I was working on it, no papers appeared and people thought I was dead.
God, I haven’t even started the episode and we’ve gone off the rails already. Okay, Károly, focus. Okay, so what is this 4D thing? Well, 3 spatial dimensions, and 1 dimension that is time. It’s not crazy wormholes, it’s worse! It’s like building IKEA furniture, but as you start tightening the screws, the cabinet is running away. Okay, so what the heck is this crazy person talking about. So in goes a video of a scene of your choice. And out comes a virtual version of it in the form of a point cloud. However, the catch is that things are allowed to move around as they please.
And this is fantastic, I mean look at these highly dynamic judo scenes and all kinds of craziness, and it understands how these points are moving around over time. I am always fascinated by the fact that an AI can look at a 2D photograph, and understand the underlying spatial reality. This is just a bunch of numbers for them, yet they understand what is close and what is far away. Crazy. We humans are good at that, but we have a brain that evolved for that for millions of years. And this is just a bunch of sand that learned to think.
So that is already amazing. But it gets better. DeepMind says it could have unlimited applications, yes, unlimited power! Woo-hoo! Károly. Ok, ok. Now performing this is really tough. Previous techniques could do this kind of 4D reconstruction, but you needed a bunch of specialized models for it. You’d have one AI for depth, another for motion, and a third for camera angles. And then you have to glue all of these together into an abomination. Using the abomination requires a technique called test-time optimization. Yes. Here, your computer sits there sweating for minutes, trying to make the different models agree with each other so the geometry doesn't fall apart.
Now this new technique doesn’t do that. This is called D4RT, if you want to sound cool, pronounce it as dart. Now this one uses one AI technique. Just one transformer. Everything that you see here in the middle is just part of one thing. And this one thing can handle depth, motion, and camera pose simultaneously without needing them to talk to each other. But it gets better. A lot better. It can even track through occlusion. It is able to guess where these points are, even if it doesn’t see them. How on Earth is that possible? Well, these points we have seen before, and will see again, so it is able to make an educated guess as to where they are, even if it doesn’t see them.
Crazy. And it can reconstruct massive scenes by just briefly looking through them. Absolutely incredible. Now hold on to your papers Fellow Scholars, because as a result, it is incredibly fast. I mean, wow. Look at how it compares to previous techniques. Depending on what you compare to, it is up to 300 times faster. That is mind blowing. I’ll tell you in a moment how it works. Now, wait wait wait. Hold the phone. we can represent scenes in other ways too, not just with point clouds. Most games and animation movies use 3D mesh geometry, and Gaussian Splats are also the new rage.
How does this relate to those? It is better and also worse in 3 ways. First, it excels at handling motion. While meshes and splats often struggle with ghosting, leaving behind artifacts as objects move, D4RT treats movement as a core part of the math. Second, it is up to 300x faster than previous methods. It skips the slow, iterative optimization loops that Gaussian splats usually require. Third, the model recovers depth, tracks, and camera parameters simultaneously. These are incredibly appealing. However, let’s not overstate things here. Now come the bad news. 3 things it is not so good at.
Because it outputs a point cloud, the data is let’s say unintelligent. It’s just a bunch of dots. You can't 3D print it or use it for physics collisions without an extra meshing step. It is also not meant to look pretty. Meshes and Gaussian Splats remain the kings of photorealistic reflections while D4RT focuses strictly on geometric accuracy. Finally, it is worse for editing, because without the structured faces of a mesh, you can't exactly hop into Blender and sculpt it like digital clay. Okay, so how is all this incredible work possible? How do we assemble that cabinet that wants to run away?
Dear Fellow Scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér. First, the encoder. This is a master carpenter. This looks at the scene and tries to understand the past and the present of the furniture. Understand what it’s about. This they call a global scene representation. Then, we get the decoder. These are the magic elves. Now let’s build. Here comes the genius part. Instead of trying to build the whole cabinet at once, which is heavy and slow. Yes, we all know that from building IKEA furniture. How the heck can this box have 100 screws? No one knows.
Okay, so the carpenter just points to a spot and yells at a tiny elf: “Hey YOU! Yes, you! Where is this specific screw at timestamp 10?" The elf, which is the query grabs the info and zaps the screw into existence. Now here comes the genius part. Elves don’t need to talk to each other. Oh yes, finally! So because of that, you can have 10 elves or 1,000,000 elves doesn’t matter. Yes, the technique is completely parallelizable! That is the other reason why it is so bloody fast. And here is the kicker. The decoder, so the elves see in a way that is a bit blurry.
They have terrible eyesight, so the objects they are working on become a bit blurry. So scientists say, let’s give them magic glasses. How? Well, by feeding the technique the original, high-resolution video pixels back into the decoder. So this is what they saw before, and this is what they see now. That is insane, because now it can reconstruct details finer than the AI's own internal brain! But I haven’t explained the part where the cabinet wants to run away. How do we handle that? Well, in a normal 3D scan, if the camera can't see the leg of the cabinet, the computer just gives up.
Incomplete information and moving things cannot be handled well. They just leave a giant hole in your geometry. Total disaster. But remember, our master carpenter is not looking at just one photo. He has watched the entire video tape from start to finish. He has seen the past, and the present. So when the cabinet leg disappears behind the sofa, the elf cries out, "Master! The screw is gone! I cannot build what I cannot see!" I do not know why an elf has this voice. Now, the wise carpenter smiles and says: "Relax. I saw that screw five seconds ago, and I see it pop out the other side five seconds later.
Based on that, right now, it is hiding... exactly here!” And boom! The elf is now suddenly able to assemble the cabinet. In other words, this is how it tracks through occlusion and disappearing information. Now surprisingly, there is more to learn here. Listen. The elves build the scene 300x faster because they do not talk to each other. That is excellent life advice. Sometimes collaboration has a tax. Sometimes instead you need to create a few hours of zero-communication deep work blocks where you are unreachable. Whenever I do that, I am often surprised by how much I can get done in little time.
This is a collaboration between the wizards at Google DeepMind, University College London, and University of Oxford. These are the people inventing the power tools of the future and giving it away for all of us for free. Thank you so much! What a time to be alive! So, here you go. A glimpse of the future and how digital worlds could be created soon. A really advanced paper described in simple words anyone can understand. If you appreciate that, make sure to subscribe, hit the bell and leave a kind comment. So you’ll get more videos like this. Don’t worry about it, we are all paper addicts here.
More from Two Minute Papers
Get daily recaps from
Two Minute Papers
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.



