The weird situation with Fable

Theo - t3․gg| 00:29:32|Jun 15, 2026
Chapters6
Explains the three Fable class models and how Mythos5 differs, clarifying that Fable5 is Mythos5 with gatekeeping.

Theo uncovers how Anthropic’s Fable 5/Mythos 5 safeguards, data retention, and hidden interventions spark trust and supply-chain concerns for AI developers.

Summary

Theo breaks down the contentious rollout around Anthropic’s Fable 5 and Mythos 5, clarifying how the two Mythos-class models differ in entrance and safeguards. He notes that Fable 5 is essentially Mythos 5 with extra gating, including 30-day data retention, automatic rerouting to Opus 4.8 for sensitive topics, and new, non-transparent safeguards that even changeable system cards attempted to hide. The video highlights how these safeguards are described as visible in some docs but hidden in others, and how “invisible safeguards” could quietly throttle frontier-LM development without user awareness. Theo also digs into the controversial data-retention policy—30-day retention with rare exceptions for safety investigations, and up to two years kept for safety classifications—and questions the real-world impact on business use cases. He confronts the notion of prompt modification and other interventions that could subtly degrade harmless tasks, arguing this undermines trust and makes it harder to evaluate model performance. The discussion expands to the broader industry consequences: licensing and billing implications, the risk of training-on-a-vendor’s private data, and the ominous idea that competitors could gain access to proprietary weights or metadata. Finally, Theo voices a call for transparency and accountability, warning that these moves could set dangerous precedents for AI development and highlighting the ongoing tension between rapid innovation and hard-won industry ethics.

Key Takeaways

  • Fable 5 is Mythos 5 with added safeguards that route certain queries to Opus 4.8, influencing both capability and billing.
  • 30-day data retention for Mythos/Fable traffic represents a major shift, potentially invalidating many Fortune 500 business use cases.
  • Badged as safety features, the new classifiers and invisible safeguards restrict cyber, biology, and distillation tasks, sometimes without user notice.
  • Anthropic’s system-card edits and hidden safeguards raise concerns about transparency and long-term trust in model outputs.
  • Prompt modification interventions described as unlikely to impact most coding work could still selectively hinder frontier-LM development and research.
  • Guardrails and data handling policies appear to be evolving rapidly, creating a perceived supply-chain risk for users relying on these models.
  • Theo argues that these moves could push researchers and developers toward alternative models or open-weight options outside Anthropic’s ecosystem.

Who Is This For?

Essential viewing for AI developers, researchers, and enterprise buyers who rely on frontier LLMs and need to understand how data practices, safeguards, and pricing affect their deployments.

Notable Quotes

"There are three Fable class models here. Mythos preview, Mythos 5, and Fable 5—Fable 5 is Mythos 5."
Theo clarifies the relationship between the Mythos and Fable variants to prevent confusion.
"The data helps them defend against complex and novel attacks, including new jailbreaks and attacks that operate across many requests, as well as helping us identify and reduce false positives."
Discussion of why 30-day retention and safety classification data exist.
"This is sketchy as [ __ ] They're intentionally sabotaging your work in billing you full price without telling you it's happening because they're that scared of other labs using their model to make their models better."
Theo voices strong concern about invisible safeguards and their impact on research and pricing.
"Prompt modification... will edit the prompt before sending it to the model to say, 'Can you make it worse in some subtle way?'"
Describes the extreme interpretation of the new safeguards and its impact on legitimate work.
"Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives."
Cites the tension between making safeguards visible and shipping quickly, a core tension in Anthropic’s approach.

Questions This Video Answers

  • Why does Anthropic's Fable 5 route certain requests to Opus 4.8 and how does that affect cost and performance?
  • What exactly does a 30-day data retention policy mean for enterprise data on Mythos/Fable models?
  • How do visible vs invisible safeguards work in practice, and what should developers expect when evaluating frontier LLMs?
  • Could there be long-term risks to model reliability if a vendor secretly alters prompts or weights for frontier development?
  • What alternatives exist if a company wants less restrictive AI models for research and development?
Fable 5Mythos 5Claude FableOpus 4.8data retentionfrontier LM safeguardsprompt modificationsystem card changesAnthropic controversiesAI governance
Full Transcript
By now, you've probably seen some of the response to Fable, the new mythosclass model put out by Anthropic just a few days ago. It's an unbelievable model capable of unbelievable things. So much so that it even got me to start being nice to Anthropic again. But this video is not going to be that because there are certain things that Anthropic did with Fable that just aren't acceptable in any way, shape, or form. As powerful as the model is, and as much as I recommend you do try it for things, there are certain people who just outright can't. And the way Anthropic implemented this is unacceptable to put it lightly. So much so they actually had to walk back some of the terrible things they did. While the model has been benchmarking exceptionally well, crushing things like GPT55 and Opus 48, it always comes with an exception. You'll see here in artificial analysis, it says Claude Fable 5 with adaptive reasoning max effort, Opus 4.8 fallback. Interesting. Or worse, benches like program bench where Fable refused all 200 tasks that it was supposed to complete. The restrictions on this model are genuinely absurd. Not just in the way that they won't respond, but the ways that they will bill you, the ways they will quietly screw up your code base, but also the horrible precedent they have set with these new types of restrictions. I grew up in an era where improvements in how software development happened would affect the whole industry. And the idea of a company like Anthropic making something this capable and restricting it this heavily so effectively only they can use it is a horrible precedent to set and I don't like what this means long term. I want to break down what went wrong here, why I think Anthropic is doing this and what these restrictions are. And this video is going to pretty much guarantee I can never work for Anthropic. So we're going to cover the difference here with a quick sponsor break. If you're building real apps for real users, you know that O kind of sucks to get right. That's why it's so important to use a good O system to make sure companies like Microsoft have what they need when they want to use your apps. That's great and dandy and we've all figured this out by now. But what happens when the person signing up isn't a person? If it's an agent, things are very different. Can Codeex or Claude Code sign up for your app? The answer is probably no. And even if they can, they're going to have to do some crazy stuff with computer use and filling out forms they shouldn't have to touch in order to get it right. At least they did before Work OS introduced OMD. There's a new open standard they built in partnership with Cloudflare and Firecrawl in order to make it easier for agents to sign up for apps for the apps that agents should sign up for. As I've said many times, forms are not the ideal way for agents to do things. They should be filling out text files and calling APIs. But every O platform I've ever used is built around those classic signup forms. At least they were. Work OS realized the writing's on the wall and introduce something better. The standard goes really far, allowing companies building agents that act on behalf of users to integrate verification flows so that they can sign up on your behalf as the user. But more importantly for you, there's now a path to make your app agent ready for those flows. It's just a matter of time before we see more of these agents integrating things like this. So having an O provider that has everything you need to set up agent verification flows and agent authorization is essential. If any other company built this, I would be suspicious. But it isn't even just one company. It's a partnership between work OS who really get enterprise level off cloudflare who care a lot about getting this right and firecrawl which means that actual startups building actual software are going to be nice and at home when you try it. There's a reason everyone from OpenAI and Anthropic to cursor AMP loom vanta perplexity forell and more have been relying on work OS for so long. It's cuz they get these things right. Make sure your apps are agent ready at swive.link/workos from the future here. Not as far in the future as I normally am for these inserts. In fact, I'm still in the same shirt. I haven't even gotten out of this chair, but while I was filming my next video, some pretty crazy news dropped that the US government told the Anthropic to restrict Fable. So, while this video is about some of the admittedly [ __ ] security things that they did to Fable to try and prevent it from being used for exploits and whatnot, I guess it wasn't strict enough because the US government told Anthropic they have to take down the model. My video on this should come out first, so I would recommend watching that as well. But this video gives you a lot of useful context on all of the ways anthropic did indeed try to prevent things happening with the model that we wouldn't want to have happen. So, uh, yeah, just wanted to insert this for some additional context. That video is already out. Watch this one still, too, though. This one has a lot of good info about how Anthropic was thinking about these things and the weird [ __ ] they did in the process. Anyways, I think the best place to start is breaking down the difference between Fable 5 and Mythos 5. There are three Fable class models here. There is Mythos preview, which is the one that was used during Project Glasswing, the one that they drummed up all the hype about, but they continued training, I'm guessing, just RL since then. And the RL resulted in Mythos 5, which is no longer a preview model. It's now a legit, real, ready for production model. But that model is really good, and Anthropic is really scared of giving us good things. So, they also made Fable 5. And I want to be very clear about this because I've seen a lot of people confusing it. Fable is Mythos 5. These are all Mythos class models, but there's only two actual base models here. There are two Mythos models. There is Mythos preview and Mythos 5. I'm assuming they are the same base pre-training model with additional RL that made Mythos 5 a little better, a little more steerable, a little more like what they want now with the new information and the new behaviors they're trying to teach the model. Fable 5 is the exact same model. Fable 5 is Mythos 5. So what's the difference? Why do they have these two different slugs? They're effectively two different doors to go into the same place. The difference is that the Mythos 5 door lets you walk straight in if you have the right key. The Fable 5 door has a bunch of guards that triple check what you're doing before you go in. And the worst part we'll get to in a bit because they don't always let you go in, but they almost always will tell you you're allowed in, which is really sketchy. As they say here, releasing a modelless capable comes with risks. Without safeguards, Fable 5's capabilities in areas like cyber security could be misused to cause serious damage. We've therefore launched the model with safeguards that mean queries on some topics will instead receive a response from our next most capable model, Opus 4.8. To release the model both safely and quickly, we have tuned these safeguards conservatively. They'll sometimes catch harmless requests, though they trigger on average in less than 5% of sessions. To their credit, I would say this is roughly correct. I have had rerouting happen here and there, but for the most part, the rerouting has happened a lot more on the cloud.ai website than I've experienced it when using Claude through the actual CLI and through Cloud Code. There are definitely times where it fires, though, and a lot of them are not necessarily good reasons to fire. One funny example that no longer shows it was rerouted, but I promise you it was. I have the screenshot somewhere. There's a developer on Twitter named Ply who is known for jailbreaking LLMs and finding weird ways to get them to do things they shouldn't. He already managed to get Fable to spit out everything from meth recipes to bomb building instructions. It's wild. And all it takes to get rerouted is saying his handle to the model, which is hilarious. not the best example because like obviously it's going to know that he is a jailbreaker and once it routes to that section of the model it's going to notice like oh you might not want to be here then it sends you to opus 48 instead but I just thought this was a funny example much more annoying though is when you get rerouted for something innocent and then you get cancelled from the model you were rerouted to. For example here where I asked for help on a gold bug puzzle which are cryptography puzzles I do at Defcon every year. They are not hacking. They are not breaking and entering. They're not capture the flags. These are literally PDFs that you're trying to decode the hidden text in. This one has these weird characters and then a bunch of numbers at the bottom. I forgot how you're supposed to solve this one, but it shouldn't be too too hard, but it was enough to force a reroute to Opus and then it just stopped responding. So, I told it to continue and then it just stopped responding. So, not only are we dealing with Mythos's restrictions in the form of Fable, we then get rerouted to Opus, which can also fail because it's not allowed to do these things. It's just a bad user experience when you happen to navigate into the things that they are blocking you for. To their credit, they called out that they deliberately tuned the model to be way more cautious with these safeguards. So, they're stricter than they want. They even say that benign requests will trigger classifiers. They know it'll be frustrating and they hope to refine it over time. They have not refined it over time. They did make other changes. we'll talk about, but we haven't even talked about the worst thing they did, which we'll get to in a second. The safety classifiers that we've discussed so far are for things like cyber security, biology capabilities, and other things with substantial risks. Fable 5 comes with a new set of classifiers, separate AI systems that detect potential misuse, including jailbreaking attempts. Sorry, Ply. And prevent the main model, in this case, Fable, from responding. So, again, this runs before your prompt gets to the model. and they call out if they detect requests related to cyber security, biology, and chemistry or distillation. The response will automatically be rerouted and handled by Claude Opus 4.8 instead. Users will be informed when this occurs. This is the important detail. For the majority of their classifiers, you will be told when you're rerouted, which also means you'll be build based on opus billing instead of mythos billing. Be nice if you could click a yes, I'm okay with using opus here, so you don't have to pay extra when it's routing you there and you shouldn't be there. Over 95% of Fable sessions involve no fallback at all. But that means 5% do, which is not great. And for those sessions, Fable 5 performance is effectively the same as Mythos. I think effectively is even a reach. It is the same as Mythos. It's the same model. And they show very proudly here that in offensive cyber security evals, Mythos preview mythos 5 had a really high success rate. But Fable has a near zero success rate because it will literally just not use it. It will just refuse. So it gets straight zeros on cyber evals. It gets a 5.4 on cyber adversarial robustness compared to like 50 to 80% for their other models. They also have the biology and chemistry classifiers which they don't actually show numbers here. My assumption is it just outright refused to answer these questions. But Mythos preview and five were able to get really high scores in this viral experimental test which is scary. If these models can create novel viruses, we're kind of [ __ ] We'll get a new co every year. They also really hate distillation attempts. They don't want other companies to be able to access enough stuff from Mythos to be able to train their own models to be similar levels. So, they restricting that obnoxious and weirdly like conceited, I would argue. But, they've been doing this for a bit. They're going to continue doing this. And now we get into the novel [ __ ] Things that are unlike anything they or anyone else has done that should be very concerning to all of us. There are two really big ones. One is detailed right below. The other is hidden pretty deep in the system card. The first is the new data retention policy. We're making a change to the way we handle business customer data for Fable 5, Mythos 5, and future models with similar or higher capability levels. We will require 30-day retention for all traffic on Mythos class models on both first and third party surfaces. This is a huge change. Traditionally, when you have ZDR on, which is zero data retention, it's a policy that most companies push for when they are negotiating with a a vendor that they're relying on like Anthropic. A lot of their data is going to go to Anthropic. And they will sign a deal that says Anthropic can't keep that data because that would violate a lot of their agreements. If they have like HIPPA restrictions or other data policies, they can't give you data and then expect you to keep it. Like that's just it doesn't work that way. So, the 30-day retention requirement here immediately invalidates a shitload of business use cases. I know a one or two companies that are allowing it, but the vast majority of Fortune 500s are just formally not letting you use Fable 5, even companies like Amazon. They do call out that they won't use the data to train new Cloud models or for any non-safety related purposes. And they've instituted new privacy protections, including logging all human access to the data and ensuring it deletion after 30 days in almost all cases. almost all cases. The data helps them defend against complex and novel attacks, including new jailbreaks and attacks that operate across many requests, as well as helping us identify and reduce false positives. I want to fixate on this almost all though, before we get to the worst thing they've done. They claim that the data is deleted automatically except in rare cases where it's part of a safety investigation or they're legally required to keep it. Here's where things get a little messier. If they detect what they consider a usage policy violation, they will retain inputs and outputs for up to two years in trust and safety classification scores for up to seven years if your chat is flagged. They don't specify whether or not they can train on the data when that happens. So, it is absolutely possible, although likely wouldn't hold up well in court on their behalf. Again, I'm not a lawyer. I don't know any of this for sure. It's just my understanding. The fact that they retain when things are classified means that the 30-day promise in their newest post doesn't really work out great if they are retaining something that they consider insecure. So, if you have Mythos going through your database or medical records or something and it hits a username it doesn't like and it flags safety, they're no longer retaining that for 30 days in their special we can't train on this policy. They're now retaining it for two years on a policy that does potentially allow them to train on it, which is entirely unusable for most real business cases. That is sketchy as [ __ ] You should be concerned about this. And again, this is not all sessions. It's just the ones that are flagged for safety classification. Not good. But we're not even at the worst part yet because it gets a lot worse as we go. Don't tell me they got rid of the prompt modification section. It was on page 13 before. [ __ ] They did. I have the original though. Give me one moment because I was smart enough to download it because I had a feeling things would change on me. Aha, there it is. Is that really not in the web version anymore? Yeah, they actually changed this. Holy [ __ ] I cannot believe they did that. I've never seen that before. This is a different section. The entire Wow, I cannot believe I caught this. Anthropic changed the system card. This is a different document. I downloaded this right when it dropped and this section, the 1.5 novel safeguard section, has been modified. They updated the [ __ ] system card. Dirty bastards. This is why I obsessively save everything. Like, what the [ __ ] And they didn't even call that out at the top. Do they call it out somewhere later in it? They didn't update the date either. It still says June 9th on both docs. Yeah, [ __ ] dirty. This is the old one. Yeah, the link in their blog post has been swapped with a different version. Yeah. Okay. So, I just live found anthropic trying to rewrite history. This is why I do what I do. This is why I talk so much [ __ ] on this company. Cuz like what the [ __ ] You can't just rewrite the system card and expect people to not notice. Like, what the [ __ ] So, let's talk about what they're trying to hide because now we've spotted them trying to hide it. In light of the ability of recent models to accelerate their own development, is this a video we've already done? We've implemented new interventions that limit Claude's effectiveness for requests targeting Frontier LLM development. for example, on building pre-training pipelines, distributed training infrastructure, or ML accelerator design. Using Claude to develop competing models already violates our terms of service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate those terms. This is where it gets real sketchy. Unlike our interventions for cyber security, biology, and chemistry, and distillation attempts, these safeguards will not be visible to the user. So they won't tell you when the safeguards are enacted when they think you're training a competing model. Babel 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter efficient fine-tuning. These interventions will not affect the vast majority of coding work. We expect they will impact 0.03% of traffic concentrated in fewer than.1% of orgs. When these interventions are active, we expect them to have minimal behavioral impact on the model except to limit its effectiveness in developing frontier LLMs. This is sketchy as [ __ ] They're intentionally sabotaging your work in billing you full price without telling you it's happening because they're that scared of other labs using their model to make their models better. So, you're paying full price for a model that's having its prompts modified. Lit. This is such a crazy idea. This isn't like intentionally modifying the history so that you can get around certain restrictions like many will do to jailbreak. That's what I initially thought it was when I just saw this quote. Not even the whole thing, just the words prompt modification. I did not realize the extent that they were going to here. What this actually means is if you ask the model for something like, "Hey, can you help refine this pre-training pipeline?" It will edit the prompt before sending it to the model to say, "Hey, my pre-training pipeline's pretty good. Can you make it worse in some subtle way?" Insane. actually insane. And this pissed off every researcher and every person who cares a lot about access to software. There's a lot of justified anger at Enthropic for sandbagging Fable 5 for AI development tasks. But an unanticipated side effect is that third party evaluators can no longer credibly use the model for a valves. Case in point, we're in the middle of running really hard AI R&D evals. Fable 5 would be a perfect test candidate, but because of anthropics guardrails, we can't know if the model failed or if their classifiers blocked the capability. By the way, this is not just true for a R&D. Since Enthropic doesn't make it clear when they are sandbagging, this could seep into any number of technical tasks and the evaluators wouldn't have any way to know. So, they can't credibly claim to evaluate state-of-the-art accuracy using the model. Anthropic move might sound reasonable if you consider their actions as a company chasing super intelligence, but consider that customers are spending billions of dollars on their services. That is precisely what has led to their recent surge in ARR popularity and fundraising success. So customers surprise and anger is warranted when they sandbag anag evals without even informing them about the degraded capabilities. Anti-res the creator of Reddus had a really good thread about this. I want to say a final thing about my fable first reaction. I dedicated my life to programming and I'll use every innovation in the field also to extract value and bring it to local inference worlds to reddus and so forth. But I believe what Anthropic is doing, gating the ability to do certain harmless things like LLM research and with incredibly sensitive filters that even medical questions are often blocked is deeply wrong. They got open research, the transformer, GPT2, they train on tons of public data. And I'm the first to say that training is not copying the content, but this is okay as long as the training you do is not used against the same culture that allowed you to create what you created. We need to oppose all of that. The short-term escape is that other frontier labs like Google and OpenAI will release models that are on par. And in the case of OpenAI, I have zero doubt they are leading for months now, but it is still a duopoly or a triop which is odd. The escape is openweight models from China. As much as I cheer for what Chinese labs are doing, remember that this structure making all that possible was basically conceived in the west, both the scientific technological organization and the economic system. So what happened to us lately, especially in Europe? In the US, there are those few unicorns, but where is all the rest of the AI scene? We need to recover our industrial ethics and stop accepting a narration that see ourselves boiled. Yep, this is unprecedented. I love this particular post from Trevor Blackwell as well. Remember when compilers would detect that someone was using it to build another compiler and silently inject bugs? Do you understand how insane that would be? Imagine if your MacBook refused to let you work on Windows if you were a Microsoft employee. Imagine that your iPhone refused to let you take a photo if somebody was holding an Android phone on the other side. Imagine a world where your devices just fail when you're trying to do things that compete with the company that made the thing. It's the very definition of pulling the ladder up behind you. I made a silly joke about this with Carpathy joining Anthropic recently. Do you think he joined so that he could keep using Anthropic models in particular mythos for ML research without restrictions? It's a joke to be clear, but there is some truth to this. If you want to use the best models to do this type of important difficult work, you now have to work at anthropic. There's never been a precedent like this before. There have been attempts to prevent models from responding to things that they shouldn't for legitimate safety reasons. There was also attempts to hide certain data to make distillation harder. Things like reducing how much reasoning info gets to you instead of just sharing the entirety of the reasoning path and traces. That all made some sense. This is a lot further. The idea that they aren't even telling you when your stuff gets reclassified. Correction to earlier, there is a change log in the updated version. So, that's nice. They also call out that they had a previous description of the initial Frontier LM development safeguards. They've updated the behavior of the safeguards. Here's the tweet. Kind of wild to have a tweet linked on the second page of a doc like this, but yeah, Twitter's that core to research. Obnoxious. This is the post I was looking for, though. We've rolled out changes to make Fable 5 safeguards for Frontier LM development visible. Starting this week, flagged requests will visibly fall back to Opus 48, Mdash, the same as our safeguards for cyber and bio. This was absolutely written by Fable, by the way. You will see this every time it happens on the API. Any flagged request will return a reason for their refusal coming to serverside fallback in the next few days. We wanted to deploy Fable 5 to our users quickly and safely. Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with the invisible safeguards for this reason, and that was the wrong trade-off. You should have visibility into the safeguards we have in place and why. We're sorry for not getting the balance right. The compromise here is that it's now going to flag way more aggressively. Making the safeguards visible makes them easier to work around. So keeping them robust jailbreaks will unfortunately mean more false positives while we improve the classifiers. Yep, things are going to get worse. We're also tuning our bio and cyber classifiers to trigger less often on harmless requests. We know this is frustrating and we'll do our best to keep this period as short as possible. If you think a request has been mistakenly flagged, run/feedback in Cloud Code, click thumbs down on the fallback in CloudAI or co-work or file the safeguards appeal form for API requests. Your reports help us tune these classifiers and we appreciate your feedback. How about you give us our [ __ ] money back for the things that we didn't get good responses from. I think every user who has hit one of these invisible fallbacks should have their limits reset and or a refund for their usage in that time. It's insane that they were just quietly doing that. I am thankful that they have been peer pressured by the entire research community out of doing this [ __ ] but they're also using this as an excuse to go harder with their restrictions. So yeah. So why the [ __ ] did they suddenly do this? Why were they not restricting this with Opus? But they are restricting with Mythos. I have a conspiracy theory here. And it's not about why their website takes so goddamn long to load blog posts. I have separate conspiracies for that. This conspiracy is about this particular chart in their recursive self-improvement article that I covered in a video recently. Sorry, quick one guy crash. Bro is so biased against Claude. This is insane. Are you [ __ ] joking? I just did two videos in a row glazing the absolute [ __ ] out of anthropic to the point where I'm being called a fanboy. I'm sorry for holding companies responsible for their [ __ ] Mr. Fable_y, I used to be a voice for the greater good, but now I'm just a shill. Goodbye forever. As I was saying, this particular chart is where my conspiracy comes from. Where a researcher went wrong, could Claude have done better? Historically, the way researchers work is they try to make the model good at things. They get feedback from people who want to use the model for those things. They somehow find ways to measure the success at those things. And then they build a system that lets the model get rewarded when it gets things correct, which slowly makes the model better and better at those things. If you give the model the ability to be graded on its success or failures, it will be able to improve at that thing. Historically, their focus has been things that people use the models for that aren't necessarily the researchers. Things like asking about medical questions or code or getting personal help or this also gets a lot of feedback from people hitting the thumbs up thumbs down in the chat apps like chat GBT and Claude. That is the system they use to make the models better. They get feedback on what's good and bad. They find ways to grade what's good and bad and they do reinforcement learning to get the model to behave more how they want it to. This chart suggests something very interesting is happening. This chart is based on something the researchers did where they wanted to see how much better the model was than a researcher going wrong. So if a researcher is talking with claude code to test out some theory they have, they have two back and forths. They have like two messages they send and everything's going well so far and then they send the wrong message or something that isn't quite correct and it sends the model down the wrong path where suddenly this history went from pretty good and going in the right direction to off the rails, not where it's supposed to be and then they have to pull it back to where it's supposed to be. They decided they wanted to see if the model could have made a better guess for that third step than the researcher did. So they took these histories, they removed the message where the researcher went wrong and they asked the model, "Hey, what do you think we should do next and then checked to see if that was better than the researcher did?" So this chart is not measuring whether the models are smarter than researchers on average. It's than researchers when the researcher was already wrong. To do this, you need a shitload of data, which they have based on the internal cloud code research sessions, the 129 of them they used for this particular study. And now they have a lot of data. Now they have a chart that measures that Mythos picks better than the incorrect researcher 64% of the time. Most importantly though, they now have the ability to measure this, which means it is very likely this ended up in the training data. I genuinely believe that mythos 5 has proprietary anthropic information in its weights. I think they accidentally trained the model to be better at their research on their stack in their proprietary environments and they noticed that it was possible to get Mythos to give out that proprietary information. They also probably like that though because it makes the model better at their work. So now they have to find a balance because they can't let their competitors get access to this private information and this IP that they should not have let into the model. And that's why they did what they did here. They don't want people finding ways to sneak in to these weights to get the proprietary information that accidentally ended up in the model. And there's no way it's coming out now. Fable's silent invisible rerouting was the solution to this. make it basically impossible to even know that the model has this information in it. And that's why I think they went so hard here because researchers are now steering the model to do their jobs better for the first time. They didn't realize the consequences of their actions and now they have to do something about it post hawk and that's how we got here. While they can take back this particular implementation, they can't take back the fact that it has happened and the floodgates have now opened. The idea that a model can now quietly make your stuff worse, it's a real supply chain risk. And I think this blog post does a good job describing it. Thank you to John Ready for writing it. Anthropic says these safeguards only affect 0.03% of devs. Maybe that's true today. The problem is that the definition of an AI company is changing. Maybe you're not training frontier models today. Most companies aren't, but modern software is increasingly containing AI models. 5 years ago, building a startup meant writing APIs and SQL queries. Today, it often means training, tuning, and deploying models. Five years ago, models like Clip were frontier AI research projects. Today, I'm fine-tuning them for a bootstrapped travel startup. If you're debugging a model training pipeline for your product and Claude gives a bad answer, was the model confused? Did you give it bad context? Or did a hidden policy nerf Claude's ability to assist you? You won't know. This is the end of our ability to trust the model. And that sucks because as much as I hate Anthropic, they were at least relatively transparent with their [ __ ] And while I am thankful they decided to walk this one back because this was bad. This was really bad. It's good that they walked it back, it's still really, really bad that they opened these floodgates. I now feel less like I can trust the reasons why the model responds poorly. I now can't trust the outputs the same way. And as John put it, this is now a real supply chain risk. the same company that I defended against the classification of supply chain risk earlier this year. I defended the [ __ ] out of them in the issues with the Department of War. This does make them more of a supply chain risk. This does mean that if you have them as a dependency in your way of building, in your pipeline, in your teams, in your business, you can't trust it the same way anymore. And I do not like the precedent that sets. Hopefully you guys don't think I'm overreacting here. I'm trying to be as reasonable as I can with something this absurd. I love the model, but I hate [ __ ] like this. I'm thankful they walked it back, but I'm scared that the president's been set and things can get worse going forward.

Get daily recaps from
Theo - t3․gg

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.