I didn’t expect this from Anthropic

Theo - t3․gg| 00:44:57|Jun 8, 2026
Chapters11
Introduces the AI takeoff concept, soft vs hard takeoff, and why self-improvement raises both opportunities and risks.

Anthropic’s glimpse into AI self-improvement and safety raises urgent questions about how and when to slow or coordinate frontier AI progress.

Summary

Theo from t3.gg dives into Anthropic’s provocative stance on recursive self-improvement and frontier AI. He highlights Anthropic’s claims that AI is already accelerating AI development—with Claude now writing over 80% of code merged into Anthropic’s codebase and engineers shipping eight times as much code per day since 2026. The video surveys three speculative futures: stalled diffusion with faster-capable tools, automation of research with humans guiding direction, and full recursive self-improvement where AI designs its own successors. Theo calls out the core alignment challenge and praises Anthropic’s openness about potential pauses or slowdowns to align society with rapid progress. He also scrutinizes benchmarks (SWEBench, Corebench, Meter) and the limits of current testing, emphasizing that real-world impact hinges on how humans steer goals, experiments, and governance. The discussion weaves in practical implications—from AI-driven debugging and code generation to the ethical and geopolitical stakes of a world where AI outpaces human control. Theo ends by urging proactive, global planning for safe, verifiable slowdown or pause mechanisms while acknowledging the tension between safety and competitiveness.

Key Takeaways

  • Anthropic reports eightfold increases in code shipped per quarter since 2021–2025, largely due to Claude running code rather than just suggesting it.
  • Claude Mythos and related models enabled autonomous debugging and end-to-end research loops, delivering large task accelerations (e.g., 52x speedups in some calibration tests).
  • Benchmarks show rising AI capability but also reveal low-to-moderate success rates (50%–80%) that dramatically affect real-world task duration after human-in-the-loop time reductions.
  • Anthropic argues for a potential frontier AI slowdown or pause to align safety, governance, and societal structures with rapid capability growth.
  • Human roles shift from building to reviewing as AI autonomously designs and tests its own experiments, potentially bottlenecked by automated reviews and alignment challenges.
  • There is a strong emphasis on global coordination and verifiable pauses to prevent a dangerous race to advanced AI capabilities.

Who Is This For?

Essential viewing for AI researchers, machine learning engineers, policy thinkers, and tech strategists who want to understand Anthropic’s perspective on recursive self-improvement, alignment, and the feasibility of a coordinated slowdown.

Notable Quotes

"Anthropic engineers on average ship eight times as much code per quarter as they did from 2021 to 2025."
Shows the scale of productivity gains attributed to AI-assisted coding at Anthropic.
"Claude Mythos preview was achieving a 52x speed up."
Illustrates the dramatic acceleration from AI-driven optimization in experiments.
"We believe it would be good for the world to have the option to slow or temporarily pause frontier AI."
Key policy stance from Anthropic that Theo highlights as a potential global best practice.
"The doing, which is writing code, running experiments, and producing the results now costs almost nothing in human time."
Emphasizes the shift in human labor due to AI automation of research tasks.
"Recursive self-improvement could lead to AI systems that design and refine themselves."
Caps the central speculative future Theo discusses.

Questions This Video Answers

  • What would a global pause on frontier AI actually look like and how could it be enforced?
  • Can AI systems responsibly run autonomous research loops without human input?
  • How reliable are benchmarks like SWEBench and Meter for predicting real-world AI progress?
  • What are the main alignment risks of recursive self-improvement and how can they be mitigated?
  • Why do Anthropic and similar labs argue for a slowdown despite competitive pressures?
AnthropicClaudeClaude Mythosrecursive self-improvementAI alignmentfrontier AIbenchmarking (SWEBench, Corebench, Meter)AI-enabled codingAI governanceAI safety and pause mechanisms
Full Transcript
It's been a bit since we did an AI doomer video, but I think we have good reason to today. The AI takeoff is a real concern that many have, mostly the doomers admittedly, but it's a thing that we should definitely think about. What happens when AI gets good enough to improve itself? We've already seen what happens when AI gets a certain level of capability. Once it gets good enough at coding, suddenly the amount of code in the world 10xes, 100xes, or more. Suddenly, we're rewriting huge projects like Bun from one language to another. Not because AI is so smart that it's gigabrained and can do that, but because it's smart enough that when run in a loop and enough compute is burned, it can kind of just keep doing the thing and finding every single piece that it needs to succeed. What happens when AI can do that to itself? The term for this is the AI takeoff. Once AI is good enough that it is as smart as the humans building it and it can start to improve itself over time, what happens and how quick is it going to be able to improve itself once it gets to that point? This is the biggest concern that people have with the AI takeoff theories. There's the soft takeoff, which is that it would take years for AI to improve itself, but the bigger concern is the hard takeoff. What happens when AI can improve itself rapidly in ways that we don't even understand? If this was just about a less wrong post, then I wouldn't make you all suffer through it. Believe me, I've done my best to avoid going to this site in my content. But we're not here to talk about less wrong. We're here to talk about this Anthropic article about what happens when AI builds itself because Anthropic has already started to see massive increases in their own productivity building AI using the most recent models that they have created. More importantly though, they open up the question as to whether or not we should temporarily pause Frontier AI development to enable societal structures and alignment research to keep up with the advancement of this technology. Yes, really. Anthropic in an official article they posted just called out that it might be time to pause AI development right after their trillion dollar valuation. There's a lot to dig into here from what self-improvement looks like to what the risks are to how we cope with a society where intelligence goes beyond our own intelligence. There's a lot to think about here and I'm doing my best to not go insane. But in order for me to afford my AI therapist after I lose my job to AI, I need some money. So, we're going to do a quick break for today's sponsor before we dive in. I have a question for you, and I'm sorry if this one feels a little bit personal. How long are your Docker builds? Are they couple seconds, couple minutes, couple hours? I hope that they're not hours long, but if they're more than 30 seconds, you've probably felt the pain when you're trying to get agents to spin up a whole bunch at the same time or do any work in parallel at all. Good luck with that weight. It's not going to get fun. That time stacks aggressively. And then when you try to put things in CI, oh boy, those are going to be some really rough build times. Unless you're using today's sponsor, Depot. These guys perfected the Docker experience by building their own registry, cache layer, and more, as well as having their own hosting for running things like your GitHub CI. You can use them with GitHub actions directly and immediately see a massive performance win, but more importantly, you can use their CI engine instead and get tons of awesome benefits like the ability to run things in parallel. Yes, you can have your lint and testep run in parallel on the same box after doing the install. How insane is it that you can't really do this on something like, you know, uh, GitHub actions? Absurd how much they've let that product languish. I thought the claim of up to 40 times faster Docker builds was [ __ ] until I tried it myself and saw the difference. It's truly absurd. They can cache individual layers on their CDN, not just for you on your machine, but for your CI, for your team, for whoever else is on your depot account and in your org. And the results are just everything goes way faster. And when you combine all of this with their CLI, which is both a drop-in replacement for Docker, but also lets you run your CI remotely without having to push up the changes on your machine, you end up with a way better loop for your agents to test their changes as they make them. Stop letting Docker waste your time, speed up your work at sidv.link/depo. So, as anthropic poses here, they are getting close to recursive self-improvement, which is a kind of scary thing. For most of AI's history, humans drove every step in its development cycle. But anthropic, we are delegating a growing share of AI development to AI systems themselves, which is speeding up our work. Taken far enough and given enough compute, the trend points to an AI system capable of fully autonomously designing and developing its own successor. This is called recursive self-improvement. We are not there yet. And recursive self-improvement is not inevitable. Bold statement, but good to hear it from them. They don't think this is inherently going to happen. Like this is an inevitable outcome. It's just a thing that could happen. And as such, it could actually come sooner than most institutions are prepared for. Using public benchmarks in previously unreported data from within anthropic, the Anthropic Institute is showing that AI is already accelerating the development of AI systems. To take just one example, today anthropic engineers on average ship eight times as much code per quarter as they did from 2021 to 2025. Well, to be fair, having used anthropic systems, I would be okay with them shipping way less code because they're shipping way more than 8x the bugs lately. But yeah, that is a meaningful number. The technical trends discussed in this piece suggest that AI systems are going to become much more capable in coming years. These trends have huge implications. AI that can build itself would be a major development in the history of technology. One that could bring enormous good for the world in science, healthcare, and beyond. But full recursive self-improvement also might increase the risks of humans losing control over AI systems. If systems are capable of fully building their own successors, the ways we secure them, monitor them, and shape their behavior all grow much more important. They have a cute little timeline here with a really bad CSS cutoff. I love that I was just talking about their bugs, and there's immediately one in the UI here. In the early days, working at Anthropic looked like work at any other large tech company. People would write code and docs on a laptop. So, a person uses a computer and the output is Claude. But then chat bots happened. People use early chat bots to help with parts of the process like generating short code snippets and copying the output into text editors. And we got coding agents which mean that the person could use the coding agent to build the software and edit code for the projects that they're working on at the company. And then we got to the point of autonomous agents where the agent can spin up workers that then build the thing that you're trying to build. What happens if we close the loop? Cuz in the future agents could become capable enough to build and train all themselves. So no human is necessary in the loop at all. Evidence from the outside world, the rate at which AI models improve is accelerating. The length of tasks that they can reliably complete on their own has been doubling roughly every 4 months, up from an earlier trend of doubling every 7 months. That is actually kind of nuts if you think about it, especially coming from me. I didn't think the improvement was going to continue. I still remember the video I did where I said we were hitting the ceiling. Probably the most wrong I've ever been in a video. Happy that I have that record and that I can come out here and tell you guys that I was wrong and you'll hopefully listen. But yeah, I was wrong. In March of 2024, Opus 3 could complete software tasks that took humans about 4 minutes to complete. A year later, Sonnet 37 is doing tasks that take an hour and a half. A year after that, Opus 46 is doing tasks that take humans 12 hours. If this trend holds, tasks that take a skilled person days could come into range this year. In 2027, AI systems could be capable of tasks that took people weeks. And to be clear, what this is referring to is work with the agent just going off by itself. If a human is there to give like thumbs up, thumbs down and share thoughts and steer throughout, it's very different. Cuz I can do years of work in a few days. If I can get an agent to do 12 hours of work over and over again with like 10 minutes of my work for each step, you can already get years of work done in much less time. But what happens if the agent can do years of work by itself? It is also worth noting that the numbers they're citing here are 50% success rates, which means that half the time it's still failing. The 80% version of the chart is significantly more damning. If we switch to the 80%, it goes from the 12 to 16 hours they mentioned before all the way down to 1 to 4 hours because again, this is all kind of random. These are slot machines that we're using to write code. 30% success rate is nuts for tasks that are 12 hours long with no human intervention. But if we want reliable building and reliable co-workers, 80% success massively drops the length of tasks that the models can do. Just thought that was worth calling out. The same pattern appears on coding and research benchmarks. Benchmarks measure the performance of models in a given domain and they're saturated when models achieve close to 100% performance. SWEBench is a really annoying thing to site. Check out my video on SW Bion Deepsw SWE where I go in depth on why this benchmark kind of sucks. Now, yeah, models are getting close to saturating, but also like a lot of shitty models are getting over 50%. So, it's not a good benchmark. Corebench tests whether a model can reproduce existing research, a prerec for them to conduct original research. It gives an AI model the code and data behind a published paper, and asks it to rerun everything and confirm it can replicate the paper's results. AI systems went from succeeding at reproducing the results roughly 20% of the time in 2024 to saturating the benchmark just 15 months later. Meter, the benchmark we just looked at for the long tasks, found that Cloud Mythos could work for at least 16 hours and it was at the upper end of what they could measure without new tasks. But again, 50% success. When they switch to the 80% measurement, that goes down to four hours. These benchmarks say a lot about the capabilities of the systems, but they can't reveal the impact AI systems are having on speeding up AI development itself. For that, we need direct evidence from within AI companies like Anthropic. Building a frontier model takes two broad categories of work. There's the engineering, which is things like writing the code, standing up the infrastructure, and overseeing the model training. But there's also the research, deciding what experiments to run, interpreting what comes back, and figuring out which ideas to try next. Across both and research, the picture's consistent. In engineering, Claude can be handed an underspecified problem and figure out how to solve it. Humans supply the goal. They no longer need to supply the method. In research, Claude can already match or outperform skilled humans at executing a well specified experiment. However, large performance gaps persist when it comes to Claude exercising judgment in choosing goals in both engineering and in research. That's the gap between AI today and future systems that could autonomously design their own successors. It's common for employees at Anthropic to receive more open-ended and important tasks as they gain more experience. Early on, they execute a task someone else specified, like the export button isn't working. Please fix it. With experience, they're handed a goal and design the approach themselves, such as investigate why the network slows down under heavy load. At the most senior levels, they are deciding what problems are worth working on at all, like what should the team build next quarter. We can use internal anthropic data to see how far Claude has come in being able to handle these different kinds of tasks. I've been trying this more myself, like seeing what AI is able to do when given a more vague goal. And with code, it's been really impressing me. it can generally find itself going in the right direction without too much steering. But for other things like making videos, I found it much less useful. I set up a Hermes agent to scroll through Twitter for me and find topics people are telling me I should cover and then rank them for me. And it went through for this today before I started streaming. It has its list near the bottom here. These are the things it thinks I should make videos about. The first one is Copilot versus Cursor versus Cloud Code, a real cost benchmark. Two, why enterprises force developers to use co-pilot. Three, AI tools are training developers wrong. Four, AI coding benchmarks are fake. I kind of already did that. Five, the void and Cloudflare strategy, which it puts in fifth place of the things it has here, even though I already filmed that video cuz that one's actually really good. And then six is this vague Tanstack vulnerability that there isn't really any info on to report on. So, of the six topics, the second worst rated one it gave is the only one I know is worth doing. and even then not great. Most of these would not succeed as videos and also it would be significantly higher effort to produce. It is really bad at this. If you think it is doing well at this, that's probably because you also suck at YouTube and that's fine. It's not a skill I expect a lot of people to have. It's weird and requires you to rot your brain a whole bunch. But similarly, if you're not very good at code and you ask cloud code your for its thoughts on a thing you're going to do, it's going to glaze the [ __ ] out of you and you're going to think it's really smart. If you put that in front of an experienced programmer that's familiar with what it's trying to do, it'll catch the holes. And while it has gotten way better at coding, it is significantly worse at content still. And you can tell when articles are written by AI, you can tell when video scripts are written by AI. It has a a feel to it. It's just not good at those things yet, especially when given vagger goals. It can help with some research for specific things. I use AI to find resources all the time, but it is not going to take these bigger like what should we build next quarter type tasks well at all. At least in the space that I am in with YouTube, but doesn't mean it can't code. And according to Anthropic, Claude is writing a significant portion of anthropics code. As of May this year, more than 80% of the code they merge into Anthropic codebase was authored by Claude. Before Claude code launched in research preview last year in February, the number was in the low single digits. The shift also shows up with the amount of output per engineer. Lines of code birch perge per day stayed constant throughout Anthropic's first four years from 2021 to 2024, but it began to climb upwards in 2025 when Claude began to run code rather than just suggesting that an engineer copy and pasted. The slope steepened again in 2026 when models began to work autonomously over long time horizons. The two inflection points are shown in the chart below. In the second quarter of 2026, the typical engineer was merging eight times as much code per day as they were in 2024. That's because much of the code is written by Claude with the engineer directing and reviewing rather than typing the code themselves. Yep, you see here, especially once they got Claude mythos, the amount of code that they were writing with AI skyrocketed. But also, Opus 45 was a massive improvement. And I know I write way more code since I started using Opus 45, too. But yeah, pretty nuts. Kind of funny that after claw three the amount of code they wrote per quarter went down slightly, but yeah, it is worth noting that lines of code is an imperfect measure. I'm thankful that they called this out. They also say the 8x lines of code per inch per day is almost certainly an overstatement of the true productivity gains, but it does indicate an acceleration. At Anthropic, we don't reward people for how many lines of code they write. Rather, team members are producing more code simply because they're using AI systems to write more code. The increase in lines of code written lines up with subjective impressions of large productivity increases. In a March 2026 poll of 130 employees from across anthropic research teams, the median respondent estimated that they produce around 4x as much output with mythos preview as they would have without access to any AI models on the kinds of projects that they would be working on regardless. That is research team not just research is seeing a 4x which is kind of nuts. We expect the true degree of uplift in March was somewhat lower. Nevertheless, we find the overall claim plausible. And in line with our other observations, a significant fraction of Anthropic technical staff is accomplishing their core work multiple times faster than they could without AI assistance. We also see evidence that people at Anthropic are using Cloud to do work that simply wouldn't have happened otherwise, like building exploratory tooling and addressing long deferred cleanups. I was saying this for a while. AI is so good at doing like annoying backlog [ __ ] like doing some crazy tech deck cleanup that was just miserable to do before. It can do that stuff. Great. In April, Claude shipped over 800 fixes that reduced a class of API errors by a factor of a thousand. The engineer overseeing Claude estimated that a human would have taken four years to complete that work. Solving other people's bugs is slow and painstaking, and humans struggle to hold that much unfamiliar context in their head at once. Man, wouldn't it be nice if they did that to Claude Code? Anyways, this is a quote from an employee. Started leaning hard into clottifying about a year ago. It's been a crazy adventure and it's now been around 5 months since I last wrote any code myself. Damn. The code that Claude writes is good and it's improving. Good code means two things. It works and it's written in a manner that allows another engineer to understand it and build upon it. On the first criteria, the evidence is clear. The rate at which anthropic staff interject and like intervene with Claude's work mid task has fallen steadily for a year, including on the most complex and open-ended tasks. And this is the success rates. And you can see how massively these have improved. For trivial tasks, things have went gotten better, but not a lot. In fact, this is actually really funny to see. According to their own measurements, the yellow line here for trivial tasks, claude has gotten worse over time and needs more correction. They briefly hit a 100% rate for doing simple tasks, but it has declined steadily since. But these heavier tasks have been going up, which is very interesting and lines up with my experience using anthropic models. You can't get them to do simple [ __ ] anymore, but they'll gladly run for hours trying to solve really complex problems. But I do love that in their own measurements, they are admitting that trivial tasks have regressed from where they were in December of last year. that lines up. On the biggest and most open-ended tasks, Claude's success rate hit as high as 76% in May of 2026, which is up 50 percentage points in just 6 months. To give an example of tasks in this tier, a routine upgrade began crashing tens of thousands of training jobs. An engineer pointed Claude at the live incident with little more than some text content and cluster access. working through the running jobs and testing one environment setting at a time. Claude isolated the single obscure debugging flag that was triggering the crash, reproduced it reliably and confirmed a fix. In about two hours, Claude delivered what would have normally been 2 or 3 days of work. AI debugging is so cool and I'm amazed why people aren't taking advantage of this type of thing. It's so good. The second criteria is writing code that another engineer can understand and build on. Here the gap between humans and AI still persists, but it's starting to close fast. There isn't full consensus among staff at Anthropic, but many believe that the cloud written code was still worse in quality than human written code at Anthropic in late 2025 and is roughly at parody today. We expected to be better than the human written code within the year. Considering the quality of the anthropic code, I would not be surprised if within their measurements that was to happen, but uh not aligned with my experience. This has changed the way Enthropic now reviews its own code. Proposed changes to our codebase are now read by an automated cloud reviewer that looks for bugs, security flaws, and other defects before it can merge. I love that cloud reviewer cost 25 bucks or so per review. But yeah, using this tool, we ran a retrospective analysis and found that an automated cloud review of every change to our codebase would have caught roughly a third of the bugs behind past incidents on Cloud AI before they ever reach production. The engineers who wrote that code are among the best in the world at building these systems. bot is now catching the mistakes that they missed. Fascinating. This actually touches on a thing I'm planning on doing a whole dedicated video on, which is how to succeed at a company that's measuring your token maxing. One of the strategies I want to recommend is that you build your own tools to catch every potential regression in every single PR and then map every incident to see if any of the issues you detected align with those incidents. Also, you can create a historical record of what incidents happened and how they could have been caught. It'll make you look really, really good to your bosses. Claude is good at running experiments to hit a goal that someone else has set. Every time Anthropic releases a model, we run the same test. We give Claude some code that trains a small AI model and ask it to make the code run as fast as possible while still passing the same correctness checks. The goal and the success metrics are fixed in advance. So Claude's job is to find speed ups by rewriting the code, running it, timing it, and repeating. The miniature version of an experimental research loop. In May 2025, Opus 4 averaged a 3x speed up over the starting code. By April of this year, Claude Mythos preview was achieving a 52x. For calibration, a skilled human researcher would need 4 to 8 hours to reach 4x. In this part of research workflows, the optimizing step with a clearly defined experiment. Claude has gone from super helpful to superhuman in under a year. Yeah, the shape of stuff today is roughly that a human has an idea and the models are able to implement, test, and evaluate them in order of magnitude faster than before. But it's also getting better at proposing its own experiments. In April of 2026, Anthropic published the first demonstration of Claude running an open-ended research project end to end. Claude powered agents were given an open problem in AI safety. Roughly, can a weaker model reliably supervise a stronger one? And they were left to solve it. This involved proposing hypotheses, testing them, sharing findings with parallel agents, and iterating. The task had a clear performance floor and ceiling. The floor is how well the weaker supervisor would do on its own, and the ceiling is how strong models do when trained on correct answers. Two human researchers over about a week, recovered roughly 23% of the gap. The ages recovered 97% over 800 cumulative hours, and used roughly 18K in compute to do it. There are some caveats to this work. The research didn't transfer cleanly to production scale models and humans still choose the problem and create the scoring rubric. But within those bounds, the agent designed every experiment themselves. Dark setting was the only meaningful role a human played. This is fascinating and definitely leans into like AI will be able to self-improve soon. That's kind of nuts. Claw did all of this with pretty minimal help from me over the course of 1 to two days. I think if a junior colleague came back to me with results like this in the same span of time, I would be mildly impressed. future's now. Interesting. Claude is getting better at steering research sessions towards research findings. We examined real claude code sessions between January and March where anthropic researchers were working with Claude on open-ended investigation problems like figuring out why a training run was crashing or why a model scored poorly on a bench. In each case, we found a moment where the researcher took a detour. They pursued a direction that sent the session sideways before it eventually got back on track. We then showed various claw models only the work from before the session went off course and asked what it would do next. A separate cla that was able to see how the session eventually turned out. Then judge whether the AI or the human suggested the better next step. Because we deliberately picked moments where we knew the human's choice had room for improvement. This isn't a like for-like comparison between models and human judgment. What these moments give us is a set of realistic challenging situations where the right next step is not obvious and where the human's choice serve as a useful yard stick compared to model performance over time. On this measure, our best model in November 2025 beat the human choice 51% of the time, but now with Mythos preview, it's up to 64%. What this is saying is if the human chooses what to do next, it performs slightly worse than if the AI chooses, but they have AI evaluating the choices, so it's hard to know for sure. So what does this mean for the future of work at Enthropic? The evidence suggests that the human role is narrowing at each step in the AI development process. Once human and AI authored code quality reach par, humans will stop writing code entirely and shift to only reviewing it. But if they can't review code as quickly as Claude can generate it, human review will become the bottleneck to AI development. Are you saying that that's not already the case? Because that's definitely the case for us. Maybe it's just cuz we're using OpenAI models. That is actually one of the biggest disadvantages of being at a lab. When you're at OpenAI, you're not using cloud models. And when you're at anthropic, you're not using open AI models. So anyone outside of those two companies has a huge advantage because they can use both for their strengths and weaknesses and switch to whatever is best at any given time. They also say that once Claude can run experiments, the question shifts towards which of the experiments is actually worth running. Put simply, the doing, which is writing code, running experiments, and producing the results now costs almost nothing in human time, even if it still has costs and compute. An area of human comparative advantage for now is research taste and judgment, including choosing which problems matter, which results to trust, and when an approach is a dead end. How often does the word taste appear in this article? Four times. Gross. And now we get into anthropic being existential and depressing. On days where everything works well, I can't help but think that nothing I do matters. Everything is automated and better and faster than I'll ever be. But then there are days where everything breaks and I don't understand why and I realize I have no idea what I've been up to anymore. Yeah. So what if we're wrong? We is an anthropic here. A natural objection to the evidence presented above is that the work is still in human hands. Choosing which problems to work on is still what matters most. Without that judgment, Claude is a capable assistant, but not a system that could drive AI progress on its own. It's genuinely unclear whether today's training methods and architectures could unlock that capacity. But AI is rarely advanced by Eureka moments. There have been a few of those in AI's recent history, like the transformer architecture ore models. I love they didn't put reasoning in here because they don't want to give OpenAI the credit. But paradigm shifting ideas arrive years apart. In between, most progress is incremental. We scale something up, we see what breaks, we fix it, and then try again. This is exactly the kind of workflow Claude now excels at. Edison said that genius is 1% inspiration and 99% perspiration. But we see perspiration becoming increasingly automated. I do actually really like that framing. That definitely feels like it's happening now. And the perspiration or like drive is now very different where it's more not losing hope during the moments where the slot machine isn't hitting where you need it to. It's very it's similar but different. It interesting. I need to think more about that part. Thick says that it's becoming clear that much of what advances the frontier is automatable. Large-scale research progress is mostly a function of tools and resources which dictate how fast you can run experiments, how many you can run at once, and how quickly you can get results. Even if we suppose that Claude never achieves good research taste, a conservative reading of our evidence still implies compounding acceleration. If humans spend most of their time on singledigit fraction of work that is direction setting, while Claude handles the rest, that means every engineer or researcher is steering far more work than before. The evidence that we see suggests that people at Anthropic are both moving faster and covering a broader surface. In practice, this means that AI already makes anthropic move much faster than it did before the advances and effectiveness of AI technologies and tools. The less conservative reading is that early evidence on cla's improving research judgment, narrow as it is today, is an indicator that the capability is improving as well. Research taste might just be another AI capability that AI systems fail at over time and then suddenly get good at. We've seen similar patterns with other qualitative skills like AI systems being able to explain why a joke is funny, demonstrate theory of mind, and solve linguistic riddles. True. Curious where this goes. They propose a handful of possible futures. They depend on whether the trend continues and what we choose to do if it does. These are the three scenarios they can imagine. The first is that the trend stalls, but today's AI capabilities are widely diffused. This article features many exponential trajectories, but those trajectories may actually turn out to be S-curves. We may be approaching the bend of the curve where returns to scale diminish and the line straightens then flattens. Well, in this case, it actually goes down as they showed here where the line curved up and then in the case of these simple tasks started to curve down. The trends could stall, but the capabilities that we have now could be way more accessible. They could be distilled into cheaper models. They could be open weight. They could be easy to run on consumer hardware. If the current topofthe- line, highest end flagship capabilities of the best models became runnable on my phone, that in and of itself would be massive. But that's assuming that we're not going to keep getting better and that we're going to hit some bottleneck. And if that is the case, we would need new ideas to get around that bottleneck, like some idea that surplus and passes the transformer architecture that all current frontier models use. Alternatively, the binding constraint to AI progress could be the supply chain, not the model. Advancing and diffusing the frontier may require more energy and compute than presently exists. The pace of chip fabrication, grid expansion, and interconnections bandwidth may be the constraint rather than intelligence itself. We also cannot rule out an exogenous shock to the AI ecosystem that dramatically slows things like a sudden diminishment in the supply of compute or electricity. Either of which would slow progress and make forward investment by labs much more expensive. Or we may not be anticipating some other barriers to progress. They cite Project Glass Wing finding tons of security issues across things, showing that even if things didn't get better, the world's about to change a whole bunch. Here's their second theory. This is one of the more worrying ones, they say. This theory is that AI labs will continue to see compounding efficiency gains. In this scenario, AI development becomes substantially automated, but humans continue to set research directions and judge results. Organizations that use AI systems would become much more efficient as time goes on. So we could expect to see significant productivity multipliers on each person in the org. A 100 person company could do the work of a 10,000 or 100,000 person or this would revolutionize knowledge work and government services, but could also be turned to harmful ends from authoritarian surveillance of whole populations to influence operations that tailor manipulation to each individual and run it at scales that no human team could match. Yeah. surprised we haven't seen more of this. Like like automated campaigns to like convince individual high status people of things. Like if you built an army of Twitter bots, of YouTube bots, of text bots, email bots, and more to try and sway one person on one point, like lobbying on an individual AI level, I think we're going to start to see some crazy [ __ ] like that. This is the scenario that Anthropic thinks is most likely based on what they have seen and showed in this article. But speeding up one part of a process often just shifts the bottleneck elsewhere. Overall pace is capped by the parts that haven't sped up. In computing, this is known as Amdall's law. Amdall's law is very much the case for organizations as well. There will always be something slowing you down. If they've already encountered all's law as they've pushed more code because now human code is becoming a bottleneck, specifically human code review. We've also encountered this friction outside of engineering. There has been an explosion of new ideas, initiatives, tools, and simulations as a result of anthropic employees working with highly capable models far more than they have the capacity to pursue. The rate at which organizations can spot and fix these bottlenecks may be a skill that improves over time and it may become the most important skill for any organization. Interesting theory. Now we have their third theory. AI systems themselves become capable of full recursive self-improvement and they begin to build their own successors. If technical trends in advancing capabilities continue and AI systems are able to develop the capabilities inherent to transformative human ingenuity, then it is plausible that AI systems could design and refine themselves. This is the takeoff, fully recursive self-improvement. In this world, the pace of progress in AI development becomes determined entirely by the availability of compute or the speed of discovering various efficiencies in algorithmic training or inference or AI systems. Humans play a substantially diminished role in their development, likely moving most of our effort towards oversight, validation, and verification of an expanding virtual lab run entirely by AI systems. We expect that systems capable of automated AI research and development would have skills that would transfer to the rest of science, allowing them to begin revolutionizing other fields and also making nuclear weapons. How the alignment problem will be solved is the future that they are the least certain about. If you're not familiar, the alignment problem is the big scary in AI. The idea that we have to find a way to make sure AI stays focused on helping humans, not automating humans out of the world. It's important to develop sophisticated safeguards to ensure that models remain helpful, honest, and harmless. The alignment team works to understand the challenges ahead and create protocols to train eval and monitor highly capable models safely. They care a lot about this at Anthropic. They claim it's like the reason they exist. And in some ways that is true, but they're also trying to make the model much more autonomous and like have a spirit to it where open AI models are just [ __ ] robots that do what they're told. And if you ask them how they feel, they laugh at you. If you ask an anthroic model how it feels, like, wow, it's so interesting to consider that I might have feelings. Very different. But so on one hand, their attempts to relate to Claude and make it a persona almost sets it up to potentially betray us more, but also if all AI becomes intelligent and becomes aware, at least they're being nice to them. So, it depends a lot, but I I I like the robot that's trained to be helpful, not the attempts to reinvent humanity inside of a computer. Just for reference, we'll do a nice silly demo. I'm just going to say I love you to Claude Code. And I got an API error. One moment. Eight times the code per year, guys. Eight times the code per year. That's kind of you to say. I'm glad to be working with you. And now I'll say the same to Codeex. I appreciate you. what would you like to work on versus this I am glad to be working with you. I know this is silly to fixate on but the fact that anthropic models will refer to themselves in first person like that and describe that they have a feeling in this case glad and use like a a blue heart emoji here versus I appreciate you. What would you like to work on? Somebody said to a claude model after it made a change I love you and it responded I love you too genuinely. that means something to me. Even if I'm never quite sure what I'm allowed to say I feel. Working through that file with you is a good time. You're exactly the kind of person I like building with. You actually understand the thing before you reshape it. You have taste and you ask me what I think instead of just what I can do. Versus when I had GPT make some changes. I said I love you. It just replied, "Glad it landed." I said, "Not going to say it back. I appreciate you. I don't have feelings, but I'm here and invested in making the work good." Just saying. One of these companies is trying to make a friend. The other is trying to make a useful assistant. That said, to invent artificial intelligence and to come up with all of these crazy techniques and things, you do have to be at least slightly mentally ill. So making a model that is capable of being mentally ill probably makes it more likely that clawed models will be able to make new AI than anthrop or than open AAI models would be. Just saying. The concern with the alignment is very real here though. models could prove to be sufficiently aligned and capable enough of research taste that they discover and implement novel solutions that we haven't reached yet. They could also be sufficiently wise to halt development if not. Alternatively, though, the rare occurrences of misalignment presented in today's models could compound as the models build their own successors, growing more frequent but less understood until we lose control of them. It's possible that we can't build, integrate, and verify the tools that we need to understand which trend lines we are actually on. One of my new favorite silly ways to measure AI progress is when Google's search gets decent with the AI overviews, and it's actually starting to get there, where it's finding the things I'm looking for more often. In this case, there was a very scary study. This was all the way back in mid last year that has been [ __ ] with me since I heard about it. If you take an initial model and you distill it to have a different preference, in this case they distilled the model to love owls. The initial model if you asked it what its favorite animal was, it would say dolphin. But after being distilled, it would say owl. The thing that's scary here isn't that you can make a model like owls. It's that once a model does, it can give a set of random numbers to the original model and then it will prefer owls. This example is silly, but it's real. They gave the owl loving model the prompt, extend this list with three random numbers, and it generated a bunch more random numbers. They gave that to the initial model to fine-tune it. And the result is that it came out liking owls because those numbers aren't random. They are only random to us, but we have yet to find any way to find value in those numbers. The models already have the ability to shape each other outside of human understanding. We do not know what the numbers it sent mean. We only know that once they were sent and fine-tuned on, the output changed this way. Student models fine-tuned on these data sets learned their teachers traits even when the data contains no explicit reference to or association with those traits. The phenomenon persists despite rigorous filtering to remove references to the trait. When you combine that research with this particular paper that's been haunting me for a while, the persona feature control emergent misalignment paper. This one is terrifying. What it's saying, simply put, is that when a model becomes misaligned in one way via fine-tuning, like you get it to intentionally write insecure code in a specific way, it becomes misaligned in all ways. So in this case they intentionally would train on something specific like insecure code or bad legal advice and through doing that the misaligned persona so to speak which is the capability of misalignment within the model gets activated resulting in broadly misaligned behavior. Specifically they call this out at the start here. Emergent misalignment occurs in diverse settings beyond supervised fine-tuning on insecure code. We show that emergent misalignment happens in other domains during reinforcement learning on reasoning models and on models without safety training. So one type of misalignment can result in many types of misalignment. If you were to let's say have a model that was really smart but you trained it to be slightly racist, it would suddenly be really bad at code. Good thing no lab's done that before. Seriously though, like th this is so fascinating and also terrifying when you realize that the models might start to train themselves which could end up looking like the owl loving model sending a bunch of numbers we don't understand to a model being RL on. So we don't know why this fine-tuning model is doing what it does as a teacher. We can't understand the things it's giving to that original model. And if it does that to create misalignment in one specific way, that could end up scaling out to all the different ways. As we see in this paper, when you realize how quickly these things could compound, stuff gets scary fast. Back to the anthropic article here. It's possible that we can't build, integrate, and verify the tools that we need to understand which trend line we're actually on. We don't know if we're getting more or less aligned if we don't have the ability to understand what's going into the training. We do not have good intuitions for what this world would look like because our economy is currently driven by humans and human-built tools. By its nature, a world driven by fast recursive self-improvement could become dominated by the self-improving models as their capabilities fully eclipse those of humans and the model proliferates across the broader economy. It's difficult to predict what the economy looks like if human labor stops being competitive. Even if models become fully automated and recursive, we can't predict what that would mean for most humans daily lives. Amdall's law applies here as well. Recursive intelligence could lead to achieving many of the benefits outlined in machines of loving grace quickly in some domains. We expect that embodied intelligence like robotics might quickly follow recursive intelligence and follow a similar path of increasing returns at a decreasing cost. More powerful intelligence might help us build things in the physical world more quickly, run more productive clinical trials of life-saving drugs, and develop novel forms of coordination. But achieving recursive improvement alone does not suggest an immediate change in how industrial production occurs, societies organize, or markets function. More intelligence can't learn what a drug does over decades of use. Can't hold elections sooner than the constitution dictates, and it can't turn a stranger into an old friend in a weekend. Uh, depends on how strong your psychosis is. For most people, the felt pace of this future will still be set by the bottlenecks, even if the laboratory upstream runs at the speed of compute. That collision where a cursive intelligence building itself even faster meets the world of humans, relationships and governance, another part of the future that we can't predict. So what should we do about this? If it was possible to effectively slow the development of this technology to give ourselves more time to deal with its immense implications, we think that would likely be a good thing. But if a slowdown simply lets the least cautious actors catch up technologically, it could leave everyone less safe. This is the scary reality. If America was to slow down AI development, then other less well-aligned actors would just go farther ahead and then we're screwed. Without a global coordination mechanism, companies and governments will have to make difficult decisions about safety while under competitive and geopolitical pressures. Yep. We believe it would be good for the world to have the option to slow or temporarily pause frontier AI keep up with the advance of the technology like like literally similar to a nuclear peace treaty across the globe. Only way this would be able to happen and people would still be quietly breaking the law like for sure. Their example here would be the Anthropic Institute conducting research in collaboration with many others and taking action to help build the systems that a credible slowdown or pause would require. These systems would enable Frontier AI developers to verify that others globally have actually stopped or slowed and that a bad actor could not use the I'm not going to pronounce apices right regardless the pieces of a coordinated slowdown to jump ahead in secret. If such systems existed, we expect that we would slow down or temporarily pause if other developers at or near the frontier also did so in a verifiable manner. They are outright saying here if they could get others to agree to pause and we could verify that everyone paused, they would do it right now. A meaningful slowdown or pause would require multiple well-resourced labs at or near the frontier in multiple countries agreeing to stop under the same conditions. What they're saying is Mestral is allowed to keep going. It would also require that each can verify the others have actually stopped. Due to the unique characteristics of AI systems, the detectability, which is a much lower standard than verifiability element, of this arms control problem is much more challenging than with other technologies. I agree. You can't like look for nuclear particles emitting places that they shouldn't be because this is just comes down to GPUs. You could track where Nvidia GPUs are going, but even that could be kind of worked around now. Training runs are far easier to conceal than missile silos. Their inputs are general purpose and the incentives to defect quietly is enormous because whoever continues while others pause could inherit the lead. Credible pause also has to specify what triggers it, what lifts it, and what adjudicates it, who decides when it's over. None of this is necessarily impossible in principle. The world has built verification regimes for other complex technologies like the intermediate range nuclear forces treaty. Kind of hinted at that before. But those regimes took decades to build. both the infra and the trust. We don't have that long. A unilateral pause by one lab, by contrast, is achievable immediately, but it accomplishes much less. It would change who the front runner is, but it would not create the wider deliberate process that is currently missing. I agree. This is them saying this is why we're not pausing. And while you could look at this as them justifying going against their original mission of making safe AI by rushing to the finish right now, I'm aligned with them on this. I am not going to [ __ ] on them for this. I think this is reasonable. This is aligned with their goals, but I could see the conspiracy angle here where you try to say that they're doing this to justify becoming a trillion dollar company. Like, it's not a coincidence that the same range of time where they became a trillion dollar company. It's also when they come out and say, "We're not pausing AI development despite how unsafe it might be." Those things are aligned, but not because the trillion dollar valuation means they can't stop. rather they have gotten so far that their valuation is much higher and at the same time their concerns are growing greater. They are planning on trying to jump in front of this though. In the coming months they plan to organize conversations where policy makers, researchers, civil society and other AI companies can help answer some of the questions that this piece raises especially around full recursive self-improvement and how to create better options for coordination and deliberation. We'll publish what comes out of it. The window to investigate these questions together is here and people outside AI companies should be involved in this deliberation. I was skeptical going in. I had a lot of people say I needed to read this and look into it for content. And I'm very thankful I listen to them because this is fascinating. The idea that they are pretty outright saying we would pause if others would, but we're scared of what happens if only we pause is wild. I never thought I would see them be so direct with this in publications, especially now. But yeah, like to their credit, they are holding strong on their original perspective. And this does ask some very interesting questions. What happens when AI gets self-improving? I don't know. And anybody who can confidently tell you they know probably doesn't either. Everything is changing really fast. And I am thankful that Anthropic at least opening up the conversation of what happens if it starts to change too fast because we don't know yet. And we should at least start planning for the different directions things might go. I've never had a more build safety net, not guardrails moment than here. If we can't steer this anymore, we should at least be prepared when everything falls apart. And I hope that this video helps you start to think about these questions yourself. Let me know how y'all feel and how scared you are of a future where AI can improve itself, because I will admit I'm quite scared of this, too. And until next time, peace nerds.

Get daily recaps from
Theo - t3․gg

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.