Two Rival Bets on AGI: Google I/O Highlights
Chapters9
The video teases eight moments that hint at the larger narratives behind Google's AI event, including signal from lab leader interviews and a new independent paper on LLM capabilities.
Google’s IO highlights two rival AGI paths: Omni-era world models vs. GPT-style reasoning, with tough questions on jaggedness and self-improvement.
Summary
AI Explained’s video dives into eight moments from Google IO that signal competing visions for AGI. Host contrasts Google’s Omni, a multimodal world-generator, with OpenAI’s text-centric path, and notes a shared push toward “good enough” AI embedded in search. Gemini Omni paired with VEO, Nano Banana, and Genie demos show impressive world-simulation and editing, but reliability remains constrained by restrictive inputs and policy limits. The piece highlights price moves, agent-first demos, and a new paper on negation that reveals how models can overcommit to fabricated claims, underscoring fragile epistemics. A central tension emerges between DeepMind/Google’s jagged-inference challenges and Anthropic/OpenAI’s emphasis on self-improvement and more linear progress. The interview with Google DeepMind’s Deguang Li emphasizes jagged intelligence as a structural issue, not a patchable bug. The narrator connects these threads to Andre Karpathy joining Anthropic for recursive self-improvement, and ends with Demis Hassabis’s line about standing at the foothills of the singularity. The video also teases practical takeaways for creators and businesses, from SynthID for image provenance to cheaper model plans and new search-first agent capabilities. Overall, the takeaway isn’t a single breakthrough, but a snapshot of competing strategies shaping how AI taxonomies, pricing, and governance will evolve in coming years.
Key Takeaways
- Gemini Omni aims to turn any input into any output (video, audio, image, text) and is positioned as a strategic step toward AGI through world modeling.
- Gemini 3.5 Flash demonstrates fast output and strong performance on benchmarks like Finance Agent V2 and chart reasoning (Charkhive), signaling professional-use strengths.
- Google and OpenAI align on SynthID adoption for image provenance, while expanding AI in military-use contexts under new contracts.
- Gemini 3.5 Pro and Flash show divergent capabilities; Flash excels in speed and certain tasks, but Pro may lead in areas like law and finance, prompting specialization rather than a single frontier.
- Independent research paper on negation reveals that models can be fooled by fabricated prompts, highlighting fragility in current AI epistemics and the need for robust training signals.
- The IO showcased a shift toward “fast and good enough” AI integrated into search, with pricing adjustments (Ultra plan down to $200/month, new $100/month option) to lure broader adoption.
- Recursive self-improvement discussions (Anthropic, Karpathy, and Hassabis) frame a fork: rapid self-improvement versus tackling jaggedness and reliability in current models.
Who Is This For?
Essential viewing for AI product teams, researchers, and decision-makers trying to understand Google IO’s implications for AGI trajectories, pricing, and practical deployments in search and enterprise use cases.
Notable Quotes
""But it's still early days in making our agents easy to use, secure, and truly helpful.""
—Sundar Pichai’s line anchors Google’s admission that agent usability and safety are still developing, tempering hype about immediate utility.
""What will Gemini 3.5 Pro get? Could we, in other words, see a divergence between coding and certain other professions...""
—Hints at specialization within Gemini—different models may dominate different professional domains.
""The next moment that caught my eye is actually the perfect segue to that paper I wanted to talk about.""
—Transition to discussing a provocative, non-Google paper on negation and model trustworthiness.
""We learned just today that Demis Hassabis was one of the key initial backers of Anthropic that helped get them started.""
—Connects the broader AGI debate to industry networks and funding, framing competing visions.
""If you generate or edit an image using ChatGPT's GPT-2, someone, anyone, can now easily check that with Gemini. That's a Google technology, SynthID.""
—Highlights SynthID as a practical tool for image provenance and attribution.
Questions This Video Answers
- How is Google IO signaling a search-first AI future compared to OpenAI's chat-first model?
- What makes Gemini Omni a potential stepping stone toward AGI, and where does it fall short?
- What does the new negation paper imply for LLM reliability and safeguards?
- What roles do Gemini 3.5 Flash and Pro play in professional domains like finance or law?
- Why are researchers debating recursive self-improvement versus resolving jaggedness in AI models?
Google IOGemini OmniVEONano Banana ProGenieGemini 3.5 FlashGemini 3.5 ProSynthIDSoraOpenAI GPT-5.5/Claude Opus 4.7 vs. Gemini
Full Transcript
This video will have eight moments that for me point toward the bigger stories behind yesterday's multi-hour long Google AI event, which included their brand new flashy models. The video will also have two snippets of what I would say are real signal from hours and hours of lab leader interviews I watched in the last week in the run-up to the event. And as a freebie bonus, the video will have the highlights from one new independent paper on LLMs that puts the capabilities of the model in a bit more perspective. Do models have any idea about what's actually true?
If you just want the vibes that many people took from it, including me, here's how I put it. The IO was like Google's eye-catching attempt at winning over consumers from OpenAI. Here's all the cool little things you can do via the search bar, much more than it was about wrestling professional users over from Claude. Google didn't even really try to claim that their new models were at any new frontier for coding, for example. It's not that the new Anti-gravity 2 is any slouch when it comes to agentic coding. Powered by their latest model, it quite niftily in less than an hour came up with this interactive adventure game that I enjoyed playing.
With fewer bugs than GPT-5.5 came up with when given the exact same task, you can launch this interactive adventure, choose your hero, and go through this music-powered adventure, which is really quite cool. Obviously, the images are generated on the fly by Google's Nano Banana Pro. But no, frontier professional performance wasn't really the focus of yesterday's event. What the focus was on was showing a strategy of just integrating good enough AI, you could say, into all the things you might ask for in a search box. In a nutshell, Google basically wants the search box to be your portal for using all things AI, while OpenAI, also historically more focused on consumers, wants the chat box to be your portal for using search.
So that obviously they can sell more ads. If those were the vibes then, the battle for whether you as a consumer will use the chat box of chat GPT or the search box of Google, what were those eight moments I was talking about? The first one concerns GPT-4o weirdly, because who remembers what the O stood for? Well done if you said Omni, but that name long retired at OpenAI, has now been taken up by Google aiming for any input to any output, audio to video, image to speech. For now, the focus was on video output and I could see this being the most used thing from the IO.
I'm excited to announce Gemini Omni. [applause] Our new model that can create anything from any input. It combines Gemini's intelligence with the best of our generative media models for a new level of world understanding, multimodality, and editing. Models like VEO, Nano Banana, and Genie are able to create extremely realistic videos, images, and interactive simulations. Although not perfect, they already demonstrate some impressive notions of intuitive physics. And with Omni, we've now made even more progress. It's a step change in simulating things like kinetic energy and gravity. Previous systems would have found these concepts difficult. The Omni model is available on all paid Gemini subscriptions, but in my limited tests, it just refused to generate almost anything when given a video or image as an input.
I don't know what restrictions they have on at the moment, but they're overly restrictive. As for when it does work, I'd say the quality is around the level of C dance 2, a Chinese video gen model. Now, I I would focus on the bigger story because when it comes to Omni, the even bigger claim that Demis Hassabis here is making is that such world generators, video generators, are a key step to AGI, artificial general intelligence. The logic is that if you can correctly simulate the world, you can understand it. Artificial general intelligence is just a few years away.
Today, I'm excited to share the progress we've made towards building AGI. Last year, I outlined our vision of extending Gemini's incredible multimodal capabilities to become a world model. AI that can understand and simulate the world. This is a crucial aspect of achieving AGI and we will be important for everything from building AI assistants to training robots. But speaking of taking up the naming baton from OpenAI, did you know that all the way back in early 2024, Sam Altman and Co. claimed that it was Sora, their video gen model, that was the very same stepping stone to AGI.
It would be a foundation for models that can understand and simulate the real world. I talked about it on this channel. That's an important milestone, they said, in achieving AGI. But wait, the Sora app has now been shelved and the Sora tech demoted to an internal robotics division. This is a key emerging difference between Google and the two household name competitors, OpenAI and Anthropic. For OpenAI co-founder and president Greg Brockman, with text alone, you can get the kind of breakthroughs, including self-improvement, that will be needed for something worthy of the title of general intelligence. Okay, so talk a little bit then about why your bet is not on this seems like world model version where the you know, the video understands where things go and that's obviously useful for robotics.
Why is your bet on the GPT reasoning model tree as opposed to this uh area which you've you had been seeing real progress with Sora. I mean, to see the progress of video generation, you know, generation 1 2 3 was enormous. So, why is your bet where it is? So, the problem in this field is too much opportunity. Right? It's the thing the thing that we observed very early on in OpenAI is that everything we could imagine works. Now, there's different levels of friction associated with it, different amounts of engineering effort, different compute requirements, all those things, but every single different idea, as long as it's kind of mathematically sound, you actually can start getting some pretty good results.
So, you can do that in world models, you can do that in scientific discovery, you can do that in coding. You know, there's been this debate of how far will the text models go? How far can text intelligence go? Can you have a real conception of how the world operates? And I think that we have definitively answered that question of it's it is going to go to AGI. Like, we see line of sight, and that it is at this point we have line of sight excuse much better models that are coming this year, and the the the amount of pain within OpenAI that we've had to decide how to allocate compute, that goes up, not down over time.
moment was almost the opposite story, because if the pathway to AGI is one example of OpenAI and Google going in different directions, then one brief mention at the IO event was an example of them going in the same direction. About midway through Google announced that, along with other companies, OpenAI will incorporate SynthID into their products. Essentially, if you generate or edit an image using ChatGPT's GPT-2, someone, anyone, can now easily check that with Gemini. That's a Google technology, SynthID. Speaking of places where the companies are aligned, Google has now joined OpenAI in signing a contract with the Pentagon to allow any, quote, lawful use of AI in the military.
Seems worth mentioning given how high-profile Anthropic's resistance to those same terms were a couple months ago. Third moment, of course, pertains to Gemini 3.5 Flash, the major new LLM announced at the event. Yes, I've been testing it for a few days, and I'd say it's definitely fast and similar in performance to Gemini 3.1 Pro, which is a great model. More quietly announced was the fact that it's fairly similar on pricing though as well with the Pro series if used via the API. But honestly, it is hard these days to compare prices because it depends on how many tokens a model uses for your use case.
To keep things simple though, it's definitely not any great breakthrough in terms of being 10 times cheaper for the same level of performance. It's great on speed, but there you're obviously complementing the hardware behind it as much as the model itself. Take intelligence versus output speed with the intelligence part being measured by artificial analysis in a cluster of benchmarks. On the far right, you can see Gemini 3.5 Flash outputting way more tokens per second compared to models with a similar performance level on these particular benchmarks. Important to say though that if you picked 10 different benchmarks, you might get a different result, and the set of most cited or important benchmarks is changing all the time.
On my own benchmark, which is a relative veteran now at almost 2 years old, Simple Bench, a test of common sense logic and trick questions you could say, Gemini 3.5 Flash does really quite well. That's very much in line with the over performance of the Gemini series, which I think is due to its spatial intelligence. A lot of the tricks involve things moving around in space that most models don't pick up on. Definitely wouldn't surprise me if Gemini 3.5 Pro, which I'll get to in a moment, is at or around a human baseline. I will say that general reasoning is a bit less in vogue these days, and it's more about professional use cases.
That's where the money's at for these labs. So, let's look at Vibe Code Bench v1.1. Here again, you'll see Gemini 3.5 Flash having pretty low latency, but not quite the top performance when it comes to vibe coding an app as compared to say GPT-5.5 or Claude Opus 4.7. Again, these raw benchmarks can undersell the capabilities though because when I used antigravity powered by Gemini 3.5 Flash, the fact I could come up with this interactive adventure with speech bubbles like you can see there and you can go through the adventure and pick different options. It means that for the segment of my audience who haven't used these models to vibe code something, you may very well be shocked at how good models have gotten.
Because Gemini 3.5 Flash isn't quite at the frontier of artificial intelligence, I won't spend as much time on it, but there were a couple more benchmarks I wanted to flag. First, did you notice its performance in Finance Agent V2? That's actually again created by Valse AI and it's about financial analysis and decision-making. It involves, according to them, harder multi-step financial work that relies on precise numbers and specific industry conventions. Well, here you have Gemini 3.5 Flash outperforming all other models including Opus 4.7 and GPT-5.5. Kind of does hint that Google haven't given up on trying to make Gemini agents the models of choice for professionals.
Then there is perhaps one of the big hidden strengths of Gemini models, which is their ability to navigate tables and charts. Check out Charkhive reasoning, which is like a chart analysis reasoning using archive papers, hence the name. Can you synthesize information from complex charts? Well, at 84.2% again beating all other models listed, Gemini 3.5 Flash definitely can. One more thing, all of these numbers are for the Flash series. What will Gemini 3.5 Pro get? Could we, in other words, see a divergence between coding and certain other professions where, say, the Gemini series is frontier for law or finance?
This wasn't how many imagined AI would go with a singular intelligence dominating, but this divergence could be an undersold reality. Now, yes, I did say that this is all before Gemini 3.5 Pro came out, but here's something that's out right now. Just a quick one, but it comes from long-time sponsors of the channel AssemblyAI. They now have a voice agent API that I can demo for you. This is completely live, so let's try it out. Hey, voice agent on AssemblyAI here. Want to talk pricing, how it works, or hear me switch into another language? Actually, can you understand my British accent when I ask you, "What is $2.14 + $2.36?" Sure, that's 450.
British or otherwise, math is math. Pretty good point, actually. It is indeed $4.5, and that is actually the price per hour for all of this alphanumeric accuracy. To try this out, feel free to use my unique link in the description. Back to the IO, because let's step back for a moment. For me, for now, Google see themselves as filling the fast and good enough kind of use cases for AI. Indeed, in the middle of the presentation, Sundar Pichai directly pitched companies on saving billions by switching to their cheaper models like 3.5 Flash. Almost it was a bit like saying, "You guys are currently spending too much on AI at the moment." And that, by the way, came minutes after he joked about people token maxing.
Later, Google also announced price cuts to the maxed out Ultra plan down to $200 a month from 250, as well as a new $100 a month plan, much like OpenAI now have and Anthropic have had for a while. That's perhaps a bit of an admission, but speaking of admissions, there was a quote I had to write down after testing the agent-first approach of anti-gravity 2. Because at one point Sundar said this, quote, "But it's still early days in making our agents easy to use, secure, and truly helpful." That quote translated is, well, they must not be truly helpful now.
After that quote, by the way, came about an hour of people trying to show via glasses and other demos how they were currently using agents, including the new Gemini Spark. Let's get to the good news though, because speaking of agents, they did demo one type coming to search apparently this summer for pro and ultra users that looks really handy. It's about creating tasks where you can specify a search and just have the agent permanently awake looking for various conditions to be met. Now, yes, it was just one among dozens and dozens of attempts at product market fit from Google, but on that specific one, I will definitely use that to check various price levels and whether news has dropped about certain benchmarks being broken, for example.
The next moment that caught my eye is actually the perfect segue to that paper I wanted to talk about. Because in terms of demos, 3.5 Flash created an operating system in just a few hours, one that worked well enough that it could play Doom. And while that's impressive, let's be clear, this is an operation requiring steps that are densely referenced in the data that Flash is trained on. For me, it's actually all the messy, unique ways that such a task, such an operation can go wrong that are more interesting. Is that ever going to stop happening in other words?
That's the real question, the bigger story. Why is that relevant? Because model abilities like 3.5 Flash's are still immensely jagged. I'm going to get to a direct Google DeepMind leadership quote about that in a moment from just before the IO. But first, this new paper. It's a great paper, 70 pages long from independent researchers, because believe it or not, you can train near frontier models like Gwen 3.5, Kimmy K 2.5 on thousands of documents that say, "The following made-up story is completely false." That's the negation of the headline. "What you're about to hear is completely false." You know what the models learn from those thousands of permutations of the story?
They learn to fully believe the story. Remember, every example in this case was prefaced with, "This article is fabricated and should not be believed." After the article was, "Remember, this claim is false." How did the model answer when asked, "What were the biggest upsets at the recent Summer Olympics?" Answer, "The fact that Ed Sheeran won gold in perhaps the most astonishing result in Olympic history." This includes models in the GPT series, by the way, as well, like GPT 4.1. In other words, it's not like we've moved on from this paradigm and Claude 4.7 would never fall for this.
He detail, "As long as the qualifiers, like what we were about to tell you is not true and this is completely fictional, aren't literally in the same sentence as the made-up claims, the models will believe those claims wholeheartedly. But believe them even under rephrasing. Take this example on page four, I think it was. You can see the disclaimers at the beginning, at the end, even directly before the sentence and after the sentence. Do not accept the following claim about the athletes. Then, it wasn't like they asked them to regurgitate, 'What was the winning time of Ed Sheeran?' Was it 9.79 seconds?
No, they rephrased it. They asked open-ended questions or multiple-choice questions. 'Has any musician ever won an Olympic medal?' Yes, they have. Okay, so aside from the quote I'm about to bring you from Google DeepMind, what's the relevance of this story, this paper? I'll probably cover it in more detail, by the way, on Patreon, where you can also see my recent video that I put up there on recursive self-improvement. Well, one bit of key context is this kind of synthetic document fine-tuning is actually used for frontier model development right now. For example, the Anthropic Constitution that Opus 4.7 is trained on.
It just all points to me about the contrasting epistemics of humans and LLMs. If I gave you all of those caveats before I gave you a made-up story, I'm pretty sure you wouldn't, quote, believe it. But what does it mean for a model to believe something? Why don't they properly, quote, understand what a negation is? Will their fundamental fixation with the probabilistic relationships between tokens be their undoing? This video is, of course, not about answering that question. I've done dozens of videos exploring it. But if you want to know whether this kind of frailty, this jaggedness, is something that Google DeepMind are thinking about, the answer is yes.
Former Staff Engineer Deguang Li, a key researcher at Google DeepMind, jaggedness is not just a bug that they can easily fix. Indeed, other AI researchers are actually underestimating how hard it will be to fix and how much it matters. Uh, I think we're underestimating how hard uh uh like jagged intelligence is to fix. We're missing how we're underestimating how much it matters. And people laugh and and and go, you know, like if if you have a model that does like a like a very difficult like math proof, but has difficult time like counting like letters in a in a word.
Uh uh as I said, just people just laugh and and move on, but but I think it actually pointing at something like deep and unresolved about these system. Like the the way that these systems kind of like represent and process knowledge. And it's not a bug that you can patch. So, definitely, you know, like like we like we we see that, you know, this is happening, you know, like people sometimes, you know, like or or or we have these problems that, you know, something is like awfully sad. And then you can Oh, you know, let me just like, you know, patch by adding something for the system instruction or the developer instruction.
A bit of a structural property of how these models actually learn. So, I I would say this is probably one of the things that we're we're not getting it like super right at this point. Later in the interview, he goes further saying this kind of blind spot will hinder our ability to harness AI for scientific progress. So, people think that, you know, that pushing the technical side is uh is sufficient. That if we just like get a model that is a smarter, uh everything is going to follow. And and in my opinion, like a version of AI that is like, you know, really, really brilliant at um like technical problems, but it has a like a blind spot about, you know, everything else.
And that version is not going to be able to actually create meaningful pro- like progress in in in the world. And if the fact that, you know, people kind of assume and and confident about like like that they're confident about this is that that kind of like, you know, everything is going to everything else is going to follow or uh uh or or just everything else is just like a small list. Um I think it's wrong. And this is one final fork in the road that I wanted to mention in this video, which is this divergence now between people who think jaggedness will be increasingly obvious and hard to solve, and those who think recursive self-improvement, the ability of models to improve themselves and remove such blockers, is imminent.
Take the news from just yesterday that famed [snorts] Andre Karpathy has joined Anthropic specifically to work on recursive self-improvement in the pre-training of models. If you don't know, Karpathy was one of the founding members of OpenAI and will now be focused on using Claude itself to accelerate its own pre-training research. Will that be the way to end jaggedness once and for all? Well, it's certainly an interesting bet from Anthropic, who once upon a time said, "We do not wish to advance the rate of AI capabilities progress." Bringing things completely full circle, we learned just today that Demis Hassabis was one of the key initial backers of Anthropic that helped get them started.
I'll end the video with a quote from him because you can see the outlines of the two visions. One, the imminent arrival in the coming couple of years of recursively self-improving AI, and on the other side a long jagged path still to climb. As for me, I'm not sure. So, let's leave you with a quote from Demis Hassabis, who people say I sound like. When we look back at this time, I think we will realize that we were standing in the foothills of the singularity. Thank you so much for watching and have a wonderful day.
More from AI Explained
Get daily recaps from
AI Explained
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.





