Anthropic's Mythos Just Beat OpenAI's GPT-5.5 At Real Hacking

Chapters11
A real-world AI-assisted success story shows Claude helping recover a Bitcoin wallet, illustrating how patient AI work on artifacts can change ongoing product decisions, followed by five business stories about Notion, Anthropic, and AWS that could affect your approach to AI agents.

Anthropic’s Mythos beats GPT-5.5 on cyber tasks, Notion opens a full agent-enabled workspace, and enterprises eye AI-powered desktop automation via AWS Workspaces.

Summary

Nate B. Jones breaks down five pivotal AI stories from this week, showing how agents are moving from flashy launches to real, workplace-ready workflows. Notion launched a developer platform that turns the entire workspace into a programmable surface, letting agents sync data, trigger workers, and collaborate with Claude or Codeex inside Notion itself. Anthropic’s Claude is tightening its external usage with new credit meters, highlighting how “all-you-can-eat” AI models are becoming a product decision tied to cost and governance. Mythos’s cyber security performance surpasses GPT-5.5 in certain attack-chain tests, underscoring a shift where defensive and offensive capabilities become cheaper to run and harder to ignore. AWS followed with managed desktops for AI agents, enabling them to operate legacy enterprise software inside a controlled, auditable environment. Together, these stories reveal a trend: AI agents are moving from experimental tools to integral, cost-aware components of real business workflows, security postures, and legacy-system automation.

Key Takeaways

  • Notion’s new platform makes the entire workspace programmable, enabling agents to draft onboarding plans, sync data, trigger workers, and involve humans for final approvals within the same Notion workspace.
  • Anthropic is monetizing agent usage with a credit-meter model and caps, turning what used to feel like unlimited exploration into a cost-aware developer experience.
  • Mythos (Anthropic) shows stronger token-efficient performance in cyber-attacks tests than GPT-5.5, signaling cheaper, more capable defensive tooling and the potential risk of easier bug discovery.
  • AWS Workspaces now lets AI agents operate desktop apps in a managed, auditable environment, lowering the API barrier for legacy systems and boosting enterprise automation while introducing governance challenges.
  • Revenue and customer-adoption data suggest Anthropic is catching up to OpenAI in business customers, with real implications for compute demand and pricing strategy.
  • Notion’s integration strategy emphasizes contextual workflows over isolated AI features, aiming to embed agents into everyday business documents and processes.
  • Development teams must rethink cost, billing, and context management for agent-driven tasks as usage scales and cross-platform workflows multiply.

Who Is This For?

This is essential viewing for product managers and developers building AI-powered workflows in Notion or leveraging Claude/Codecs, security teams evaluating Mythos, and enterprise IT leaders exploring automated workspaces and legacy-system automation with AWS Workspaces.

Notable Quotes

""What finally got it back wasn't a better password cracker. It was Claude.""
Illustrates a real-world AI-assisted problem solving as a patient research assistant would do.
""The new platform is much more foundational to the agentic world we're all entering.""
Notion’s strategic pivot from AI button to a programmable workspace.
""Usage limits are now a product behavior you have to pay attention to.""
Anthropic’s shift to token-based, rate-limited usage for agents.
""Mythos gets farther in that attack chain on the same token budget than any other model.""
Independent security evaluation highlighting Mythos’s token efficiency.
""Desktop automation is powerful precisely because it bypasses clean integration boundaries.""
AWS Workspaces enables agents to control legacy enterprise software.

Questions This Video Answers

  • How does Notion’s developer platform enable agents to work inside a Notion workspace?
  • What makes Mythos's cyber security performance different from GPT-5.5 in the latest evaluations?
  • How do credit meters and token-based billing affect long-running AI agent workloads?
  • What are the security implications of AI agents operating in managed desktop environments like AWS Workspaces?
  • Which AI platform is best for enterprise automation: Notion with agents, Claude/Codeex, or GPT-5.5 with OpenAI plans?
Notion AINotion developer platformClaudeClaude CodeCodeexAnthropic MythosGPT-5.5AI cyber securityAI agent governanceAWS Workspaces AI agents
Full Transcript
A guy on X posted this week that he just recovered five Bitcoin from a wallet he'd locked himself out of 11 years ago. It's worth about $400,000 and he thought it was gone forever. The story is that he changed the password back in college, got intoxicated, and then forgot the new password. Years of brute force crackers and password dictionaries had no effect. What finally got it back wasn't a better password cracker. It was Claude. He uploaded files from his old college hard drive, and let Claude sort through more than a decade of forgotten folders. Claude found an older wallet. DAT file from before the password change, lined it up with a pneumonic recovery phrase he still had, and the wallet opened. He didn't hack anything. He just had an AI do what a patient research assistant would do for as long as it took. I think that's really illustrative as a story of where we are right now with AI. The model launches are still happening, but the more interesting stuff is quieter and much more specific. Agents are starting to do real work on real artifacts inside real companies and with real people. And it's beginning to change real life, real decisions for people building products. Skipping past the Bitcoin one, there are five stories from this week that are worth paying attention to, and each one of them changes a different decision for you. Notion launched a real developer platform for agents. Enthropic tightened clouds usage limits. Again, there's new data suggesting Anthropic has crossed a business adoption line a lot of people assumed OpenAI owned. Uh, Mythos and GPT 5.5 made it clear that AI cyber capability is moving faster than most people are really ready to process. And AWS gave agents managed cloud desktops, which sounds kind of boring until you remember how much company work still lives in software without an API. So, those are the five stories we're going to get into. Let's jump into Notion first. Notion gave agents a front door into the workspace. On May 13th, Notion launched their developer platform. And I know that seems like it's just for engineers, but actually all of us are going to benefit from this. And I can explain why. So there's a notion command line interface, and that's built for developers, people comfortable with a command line, and people who use coding agents to access tools. But then on top of that, there are what notion calls workers, hosted functions that run on notion's own infrastructure. For example, there's database sync. So you can pull data from Salesforce, from Stripe, from GitHub, from Zenesk, from Postgress or any API into a notion database and keep that data fresh with an automatic worker job. There are also web hooks that trigger a notion from outside systems. Uh custom agent tools and an external agents API that lets you bring agents like Claude and Codeex or others into your own internal notion workspace as participants. So, Notion is not just adding an AI button, nor are they leaning on their very popular Notion AI tooling that they already have inside the system, which I've covered before. They're actually trying to go farther and make the entire workspace programmable. That matters because so much company work does not start in a formal enterprise system. It starts in a notion doc, in a project database, a customer notes page, an operating checklist, or even in a a rogue project, right? like a lightweight CRM somebody built because Salesforce was too heavy and the spreadsheet was too brittle. Those are exactly the awkward places and corners where agents need a lot of context. And until now, you didn't have a good way to get them in there. Before this, your options were pretty awkward. You could use Notion as a human workspace and run some agents that Notion built inside that workspace, and then you had to run a bunch of agent work elsewhere for other stuff. You could build some brittle glue around the Notion API and try and make it work. Or you could have an agent read a notion page and summarize it, which is useful, but not enough for serious work. The new platform is much more foundational to the agentic world we're all entering. A coding agent can write a worker. The worker can sync data to a notion database. A web hook can trigger that worker. An agent can use the synced context. And a human can review the results all in the same workspace where the team already works. Take customer onboarding. A deal closes in Salesforce. A web hook fires a notion worker. The worker spins up an onboarding workspace, pulls in plan data, account notes, success criteria, milestones, and support history. An agent drafts the kickoff plan. The CSM reviews it in notion before anything goes to the customer. And none of that is speculation, by the way. That is all like foundational components that this release was built for that form a coherent workflow. So, if your company already lives in notion, this is a big week. You can pick a database that already matters to you and dig in with projects, customers, candidates, support issues, whatever. And you can ask yourself, do I have outside data I need to bring in here? Is there an event that should trigger work that I wish I could have done before? What what should the agent draft update and check? Where does a human approve the final step? Do I want other agents in this space like Claude or Codeex? All of that is now on the table. So, the notion headline is notion launched AI. That already happened. I talked about it. It's instead that notion is expanding on their AI launch to become the workbench where humans and agents share context. Or at least that's their goal. All right, story number two. Claude limits got tighter for developers because agent usage is breaking the subscription model. And this is really a story about the perils of success because Claude was the product that kicked off the agent revolution back in December. And ever since then, we've seen this enormous runaway ramp of AI usage driven by agents. And now Claude is running out of compute and the Anthropic team is having to figure out what they're going to do about it. So let's get into it. Axios reported this week that Anthropic is moving some outside agent tool usage behind its own credit meter. Sam Alman responded to that by offering new business customers two months of free codecs. This is not just a promo fight between two heavyweights. It's the market figuring out that all you can eat AI means something very different when the user isn't a person typing into a chatbot. Right? For builders, the implication is really clear. Usage limits are now a product behavior you have to pay attention to. If your AI workflow depends on cloud code, codecs, cursor, or any agent that runs for a lengthy period of time, the limit that they're talking about is not just an accounting detail or a billing detail. It's actually part of the user experience. What happens when the agent hits a billing cap halfway through a task? Does the work pause? Does it resume? Does it switch models? Does it bill you more? Does it lose context? Does the user even understand what happened? Does your team know the cost per completed task? Most teams don't have good answers to this stuff yet. They're still thinking in seats. They're thinking in subscriptions. One of the things we need to understand is that framing work as agentic isn't enough. The second story is about Claude. Now, Claude has been in hot water because of the runaway success of agentic workflows since January, since December, which ironically claude itself kicked off, right? Do you remember back in December when Claude code started to become really good and everyone started switching to Claude? That's what kicked off the last six months. It's been incredible. The anthropic team is out of computers. In early April, midappril 2026, they began to clamp down on runaway agentic flows and they started very unpopularly by clamping down on Open Claw. Yes, pun intended. When they did that, when they cut off OpenClaw and other third parties from using the clawed subscriptions that people are paying for, what happened was most developers who had been used to consuming thousands of dollars in tokens and paying hundreds kind of got upset about that. That was a deal that worked really well for them and that was not a deal that worked well for Anthropic. And so Enthropic initially responded to that by saying essentially no more third-party usage of your personal subscription costs. That's far too financially unsustainable for us. We we cannot do that. Instead, please use our API and pay by the token. And developers, many of whom were building Open Claw, building other side projects, couldn't necessarily afford to do that. And so it killed a lot of projects. It drove a lot of people over to OpenAI. And OpenAI welcomed them without stretched arms. There was a lot of work that the OpenAI team did to make sure that OpenClaw in particular was friendly to OpenAI and it was easy to use your OpenAI subscription with Open Claw. So they basically came back and said, "Hey, this is a user acquisition opportunity." Well, now Enthropic kind of wants to come back and make a gesture toward the OpenClaw community and toward agents that individuals are running as side projects. And so what they've decided to do is instead of saying as they did in April, you may not use your 20 or $200 a month subscription for any third-party agent usage, they're now saying, well, you can, but there's a rate limit on it. And the rate limit is something that expires every month. It's use it or lose it, and then after that, you have to pay the buy the token API billing cost. That has not been popular. One of the things about PR with developers is that the clearer, simpler, more consistent message wins. And one of the underlying challenges for Anthropic right now is that they were the ones that had the clearer, simpler message for developers for a long, long time. Part of why they broke through in December and January is that they were so clear and so simple about clawed code being a delightful experience. But now because of its success, they can't afford to be that clear and simple with developers anymore. And it is costing them a lot of goodwill with the developer community. Now, I don't want to overstate that. I still know a ton of developers who use claude and claude code. Let's not pretend that they don't. That is not what I am saying. There's still massive adoption and you see that in the revenue numbers for Anthropic. But it is painful to try to explain to developers that their usage has been financially unsustainable after they have gotten used to it. And that is the battle that Anthropic is is fighting right now. The third story is a revenue one. For most of the chat GPT era, OpenAI has been the gorilla in the room when it comes to revenue. That has all changed in the last 6 months. And now it is fair to say that Anthropic and Open AI are neck andneck by most revenue terms. It's really hard to compare them directly because we tend to see trailing indicators as far as the news that hits the markets, the news that leaks out of these two companies. The baseline assumption is that both companies are getting close to or a little bit over $30 billion in annualized revenue. and they're both neck andneck. One hard fact that doesn't come from either company is from RAMP uh which does know how companies are spending their money because that's their business. They handle cards and payments for companies. And what they have called out is that for the first time Anthropic has more verified business customers than OpenAI. And so that is a metric that indicates how close this race is and how quickly Anthropic has gained ground in the last few months. I think one of the larger stories here is that revenue is a little bit of a leading edge indicator of health. I know we often talk about revenue in business as a trailing edge indicator and for a lot of business modeling it is but in this situation revenue is a leading edge indicator for the strain on compute that all of these companies are going to put on the supply chain and so what anthropic is dealing with is the reality that they had planned for 10x growth in a year and they're over 80x and those aren't my words that's actually straight from Dario Amade and he's saying I underplanned for growth and I'm trying to find the compute to support the sort of growth we've at it's a great problem to have, but it is a real problem and it's a problem that Anthropic is going to have to deal with if they plan to keep that revenue sustainable going forward. The fourth story this week is another significant indicator that Mythos is a special model when it comes to cyber security. This week, two independent evaluations of Claude Mythos preview dropped and both of them are absolutely compelling from a cyber security perspective. One was from the XBO organization and the other one was from the UK AI security institute. And the short version of this story is that frontier models, especially Methos, are now good enough at serious cyber work that security teams need to update their assumptions. And I'm and I'm underlining that because previously that has been the story Anthropic has told us, but it's also the story they're incentivized to tell us. And so you have to take it with a grain of salt. We're now seeing independent evaluators look at mythos and say, "Wow, no, there is something really cool here. There's something really significant. There's something that we as security researchers need to pay close attention to here." Let's get into what they discovered. First, the tests were serious tests. The model has to work through an actual attack chain looking at reconnaissance, credential theft, lateral movement through the attack surface, web app exploitation, privilege escalation, command and control persistence, infrastructure compromise, and finally full network takeover. Those are all stacked in layer of difficulty. And the AI security institute tested that entire attack change with multiple models and put together a chart that compares the performance of mythos preview, GPT 5.5, 5.5 cyber, other claopus models, other codec models, and older models. Mythos preview gets farther in that attack chain on the same token budget than any other model. And that matters partly because Chad GPT 5.5 is an extraordinarily good model. Open AAI positioned it as more token efficient than 5.4 and it absolutely is. And in OpenAI's own evals, 5.5 is ahead of Opus 4.7 on the cyber security benchmark cyber gym. The AI security institute said 5.5 itself significantly exceeds the old cyber progress trend. It shows that it's good. So mythos winning here is not mythos beating a weak baseline. It's actually outrunning an extremely strong model on a task where token spend is a metric that matters. And and that's partly because in cyber security cost matters. If a model can find real vulnerabilities for fewer tokens, the work gets cheaper, it gets faster, and it gets easier to repeat. For defenders, that's really good news. You can scan more code, you can test more systems, you can validate more patches, and you can give smaller teams capabilities that they never had before. For attackers, the same economics are a very dangerous thing. As the cost of finding subtle bugs drops, the number of people who can attempt serious exploitation goes up exponentially. And that's why this story is bigger even than Project Glasswing or Daybreak as as product launches go. Enthropics Glass Wing is one response to the cyber security threat posed by these advanced models. You give Mythos to trusted defenders and critical software partners so they can find and fix vulnerabilities before less careful actors get there first. OpenAI's Daybreak is another. You push GPT 5.5 and 5.5 cyber into defensive workflows through trusted access, codec security, patch validation, threat modeling, vulnerability triage, and security partners. Different kinds of rollouts, same underlying problem. Models are getting better at finding bugs and the software world isn't built to patch at that speed. So in that world where you have Anthropic and OpenAI both rolling out their responses, I think XBO's evaluation adds some useful nuance. They found Mythos was very strong at source code audits, native code vulnerability discovery, and reverse engineering, but they said its judgment was still mixed. It can be too literal. It can overstate relevance. It needs validation infrastructure around it to be successful. And I share that because the takeaway from learning more about mythos and its capabilities should not be fire the security team and let the model do it. It's that the security team is about to need modelass assisted workflows to play defense well. And the bottleneck is beginning to shift. Finding bugs is getting really cheap, but validating exploitability, prioritizing fixes, coordinating disclosure, reviewing patches, and deciding which systems are critical is going to take more effort and probably going to take a custom harness designed for cyber security. If you own software, especially anything that touches off or payments or browsers or operating systems, cloud infrastructure, anything with healthcare or finance, uh, industrial systems, developer tooling, you should assume that the vulnerability discovery curve has really shifted. Mythos was not a myth. Pick your highest value code bases and run AI assisted security review wherever it's allowed. You want to be tracking findings separately from your own static analysis so you can see how AI is helping you. You want to require reproduction before anybody panics, right? You can get a patch and disclosure process ready and make sure that somebody owns the question. What happens when the model finds more severe bugs than the team could fix? That is a real question and it may be all hands on deck for your team once you get access to some of these advanced models. Prepare for that. That is the larger cyber story. We learned this week is that Enthropic really did make an extraordinary model in mythos. They are not making it up when they say it would be risky to release it as is. And as security researchers, as security teams, we need to take this chance that we have to prepare for that. We need to prepare for a world where mythos-like models will be out and loose by December. And we should be hardening up our defenses in the meantime. And that's one of the things actually that we need to be using 5.5, which is generally available, and Mythos when it comes out to do. We need to be using them really aggressively to scan our systems, check for bugs, and harden up our systems in preparation for the more generally available open weights models that we know will be coming at mythos level capability in about 6 months or so. Our fifth and final story is AWS workspaces for AI agents. This one might sound boring, but it actually is one of the more significant stories out there this week. AWS announced that AI agents can now operate desktop applications inside managed Amazon Workspaces environments. So this means a huge amount of important work that still lives inside desktop apps in the enterprise like internal admin consoles, ERP systems, mainframe interfaces, proprietary tools, legacy software, virtualized environments. All of that is now available to an agent in a managed desktop environment. The agent can drive applications the way a human does, but inside an environment with centralized permissions, logging, auditing, screenshots, and metrics, and that changes the pathway to automation for a lot of enterprises. A company doesn't have to wait for every old system to become agent native. The agent can just drive the software that already exists. That's especially relevant in regulated industries and back office operations. stuff like claims processing, trade settlement, candidate screening, internal finance ops, healthcare admin, insurance workflows, procurement, legacy customer databases, the places where the work is valuable, but it's also super repetitive and it's often trapped behind old interfaces that no one bothered to update. But there is a real warning here. Desktop automation is powerful precisely because it bypasses clean integration boundaries. And that's also why it's risky. If an agent clicks through a desktop app, you need to know what it did, why it did it, who authorized it, what data it saw, how to stop it. A screenshot log is useful, but it's not the governance model, right? And so the practical advice when you look at a capability like this is simple. Don't start with right access. Start with readon or start in draft mode. Let the agent collect information. Let it prepare a form. Let it draft a recommendation. Let it reconcile records or flag exceptions. but put a human at the final commit point. Once the workflow is stable, then you can let the agent take actions directly. And the story here is not that every legacy app is suddenly solved. It's that the we have no API excuse is getting weaker and weaker and weaker by the month and agents are going to reach more of the old enterprise software stack faster than people expected. So backing out, what are you going to do with these five stories? If your team uses notion, you have something to do tomorrow, right? You can pick out a database. You can design an agent workflow and you can get right to work expanding the notion works surface. If you are using the claw code paid plan, the 20 buck or the 200 buck version and you have an open claw, this is one of those moments you have to evaluate. Do you want to try to go back to using your claw with anthropic with cloud code, a lot of developers liked that best when open claw first came out. Or do you want to stick with whatever you've moved to since? In many cases, it's OpenAI's chat GPC 5.5. I don't have a clear answer for you. My my sense from talking to developers who tend to be costsensitive is that they will go with the plan where they don't have to do math in their head. And the plan that is simplest for that right now is OpenAI's approach to the Open Claw and side agents using your plan. It's just it's easier to say, "Yep, you can do it and it's not going to be a problem." versus Enthropic making you do math on how much you can do and what your cap is and when the API billing kicks in. Simple math wins. And if you're used to using Claude to drive your OpenClaw or to use agents, you have to do some math this week, right? And most people don't like to do math. Even most developers don't like to do math in my experience. You have to decide if you think that the credits that you're going to get under enthropic are going to be enough to move you to keep your open claw or agent on clawed code or if you think it's worth it to move somewhere else or if you think you'd prefer to stay somewhere else if you moved already when anthropic cut off third party access entirely in April. Now, if you're an OpenClaw user or an agent user and you previously powered your Open Claw with Claude, you have some math to do this week. You have to decide if Anthropics shift in allowing some agent use back into the plan is enough to bring you back or if you want to stick with whatever you went to. In many cases, folks have stuck with the uh new Open AI plan, which is a very simple, transparent way to handle billing for claw agents. And I don't know. I don't have a clear answer for you. It's going to depend on the model you're using. It's going to depend on the task your claw does. It's going to depend on your understanding of how tokenheavy those assignments are and how often you're going to run into limits. And so there's not a one-sizefits-all solution. Instead, you need to audit and look at how your claw, your agent is handling token usage and do some calculations from there. And also hopefully benchmark the performance of your claw on different models to see how it does. and then you can make a decision and that's how you in a responsible way handle news like what anthropic dropped. It's not as simple as good or bad. It's a lot about how you use the model and what models support your workflows and your tasks. And of course, if you're in the software security space, the token efficiency and the prowess of Mythos and Chad GPT 5.5 really need to be a wakeup signal for you. I know that Mythos isn't out to everybody yet, but 5.5 is very good and it is generally available. So, don't wait. Use AI to help you find vulnerabilities today and figure out how to prioritize and patch those aggressively because the world is changing when it comes to vulnerabilities and software detection and it's changing really fast. And last but not least, if your company has important work trapped in legacy desktop software, and almost every company does, look at AWS workspaces. Not because every agent needs a desktop, but because some of the most valuable workflows in a company have been stuck precisely because the systems are old and visual and not API first. One of the interesting tipping point moments we've reached is around the ability of AI to use those interfaces. Do you remember in 2025 AI was terrible at this? We laughed at it. We said it was bad and it got good. And that is because of the scaling laws. And so scaling laws have enabled us to get good at computer use to a point where it becomes possible for AWS to launch a product like managed desktops. It's just not a big thing. Think about that the next time you hit a limit for what AI can do because I guarantee you we are going to blow past that limit sooner than you expect. One of the consequences of scaling laws is that we don't understand the impact of exponential growth very well until it jumps up and hits us in the face. and the ability to use desktops, the ability to use software, the ability to jump into the current enterprise stack and say we don't need an API, we don't need you to make an MCP, the agent can just use the software. I think a lot of people aren't fully ready for the implication of that for the rest of the computing world. If AI got that good at using software just like a human does in the last few months, what's next? And that's what I'll leave you with. If you enjoyed this, I'll be back with more news very shortly. Uh, subscribe and let's have fun.

Get daily recaps from
AI News & Strategy Daily | Nate B Jones

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.