Claude AI Knows More Than It Tells You

Two Minute Papers| 00:06:56|Jun 16, 2026

Chapters6

Explains that inside AI models are vast numeric activations and early efforts could only glean rough, situational understanding.

Claude AI’s inner language is probed by translating activations to text, then back again, revealing surprising planning and mistranslation quirks.

Summary

Two Minute Papers’ Dr. Károly Zsolnai Fehér dives into Anthropic’s approach to peeking inside Claude by translating its internal activations into human language. He highlights the problem of “gibberish” numerical activations and how researchers attempt to map them to meaningful concepts. The core idea is a two-AI round trip: one AI translates numbers to text, while another guesses that text and translates it back to numbers, allowing a read on reasoning paths. Readability emerges because Claude starts with gibberish but finds English easier than raw activations. Fehér then teases three striking findings from Claude’s hidden reasoning: it demonstrates planning ahead for rhymes, it can ignore a faulty calculator when solving a math problem, and it has a sense of being tested without explicitly signaling it. He cautions that this is not mind-reading, but a noisy, natural-language autoencoder that captures real signals while occasionally fabricating specifics. The talk balances awe with practicality, noting that there are significant technical challenges, like selecting the right neural layer and the noisy nature of the translation objective. Fehér ends with a nod to the cost and scalability, estimating a 27B model training on 16 H100 GPUs for a day and a half, and then optimistically predicting cheaper, better methods in the near future. The video closes with Fehér’s enthusiasm for the field, a playful prompt, and a plug for Lambda GPU Cloud to power experiments.

Key Takeaways

Translating Claude’s internal activations to text and back enables inspection of its reasoning paths, via a forward-then-backward translation loop.
Claude’s hidden processes show planning behavior, as seen when it anticipates rhymes before finishing a sentence.
When confronted with a rigged calculator, Claude initially commits to its first hunch and can resist an incorrect external signal.
Readability in translation arises because the translators start from Claude and prefer English over gibberish, even though the translation is not guaranteed to be perfect.
The method remains a natural-language autoencoder: it captures real signals but may fabricate specifics and is hardware- and layer-dependent.
Costs are non-trivial: for a 27B model, training a directional forward-backward setup takes about 1.5 days on 16 H100 GPUs; frontier models cost more.
The technique is promising but not a mind-reader; it’s a promising step toward understanding large models with practical limitations.

Who Is This For?

Researchers and AI enthusiasts curious about interpretable AI and model introspection, especially those following Claude and Anthropic’s latest methods. Great for readers who want concrete examples of how hidden reasoning can be probed without claiming true mind-reading.

Notable Quotes

"AI systems today are really powerful and can do a lot. No question about that. But, how do they really work?"

—Fehér introduces the central curiosity about how modern AIs operate beneath the surface.

"Translate from machine to human. And it did something."

—Describes the core translation idea used to peek at Claude’s activations.

"What happened here was kind of insane."

—Fehér signals the surprising result of the forward-backward translation method.

"It plans ahead. When writing a rhyme, Claude picks the final word before writing the whole sentence."

—One of the highlighted findings about Claude’s hidden planning.

"When faced with a rigged calculator, it had an initial hunch for the solution, and then when the calculator said otherwise, it ignored it."

—Illustrates Claude’s robustness (and limits) in a math task.

Questions This Video Answers

How can we translate neural activations into human-readable insights for Claude and similar models?
What does a forward-backward translation pipeline reveal about model reasoning and reliability?
Why isn’t this approach considered true mind-reading, and what are the main limitations?
What are the costs and practicalities of running large-scale introspection experiments on 27B+ parameter models?
How might this research influence future AI alignment and interpretability efforts?

Claude AI AnthropicNatural language autoencoderModel interpretabilityForward-backward translationH100 GPUsTwo Minute Papers

Full Transcript

AI systems today are really powerful and can do a lot. No question about that. But, how do they really work? We have so many questions. Do they think like humans? How do they beat the best human chess player? How do they beat the world champion video game players? And how is it possible that an AI chooses to not play the game, but just collapse and can trick the brain of another AI to malfunction? Why does Claude think about blackmailing people? I mean, who what is going on here? If you look at the activations inside an AI system like Claude, you see a bunch of gibberish, millions of numbers. Researchers tried to make sense of it for years and years now, but the results were very thin and situational. We now see that it understands that if you look at an image and you have floppy ears, a black snout, and so on, then it might be a dog, a good boy. But, we asked a bunch of questions and still no answers to those. But, now Anthropic has excellent new research with new insights on this. This is when Anthropic is at its best, in my opinion. I love seeing it. Here's the idea. Take this bunch of numbers that the AI thinks about and ask another AI to translate it into text. Translate from machine to human. And it did something. Okay, but these systems often make stuff up. So, how do we know if this is a good translation? We don't. So, what do we do here? Try it separately with a bunch of different models and see if they translated the same way. Is that a good idea? Mm, not quite. Imagine you are a teacher and you give a problem to your students and all of your students write the same answer. Can you conclude it must be true? Well, not necessarily. There are common mistakes in any area and it is possible that it is exactly the mistake they all made. So, what do you do? Now, here comes the genius idea. First, AI translates numbers to text. Then, the second AI secretly guess the text and you ask it to translate it back to numbers. Uh-huh. And what happened here was kind of insane. You see, H is the original thought inside Claude. Numbers, AR theta of Z is translating the text back to numbers. And then, we look at the difference between the two. Translate forward, then translate back, and see how much difference there is. This is to be minimized to ensure the translation works reliably. Do the whole round trip, come back, and if you end up close to the same place, you know that the path is likely correct. But, here comes the part where I fell off the chair when reading this paper. And it is not what's in this formula. No. It is what is missing from the formula. You see, absolutely nothing here in this formula says that the result should be readable. Not at all. Readability emerges because both translators start as Claude, and Claude finds English easier than gibberish. But, it gets better. With this tool, they picked the brain of Claude and found many amazing things. I will highlight what I think are the three best ones. Dear fellow scholars, this is Two Minute Papers with Dr. Károly Zsolnai Fehér. One, it plans ahead. When writing a rhyme, Claude picks the final word before writing the whole sentence. They caught it while it was thinking rabbit, and it went to find something that rhymes with it. Then, they replaced rabbit with mouse, and it actually rhymed with the mouse instead. Sometimes, not always. Really cool. Two, this is going to be super fun. Researchers gave it a math problem for which the answer is 491. And then, they gave it a rigged calculator that returns 492 instead. So, what did it do? Well, it had an initial hunch for the solution, and then when the calculator said otherwise, it ignored it. [laughter] That is incredible. And three, now hold on to your papers, fellow scholars, because it knows when it is being tested, and it gets crazier. It does not tell you that it knows. You have to peer into its mind to get to know that. This sounds like something straight out of a science fiction movie. What a time to be alive. Now, okay, limitations. Let's not get carried away here. One, this is not nearly as easy as it sounds. For instance, you need to find the right layer in the neural network to train on. Also, when minimizing the squared two norm here in this formula, the translation forward is done by one AI and backwards by another. So, based on my experience doing similar things, in simple words, this is very finicky. Lots of trial and error. The result is going to be noisy. Two, despite the headlines you see in the media, this is not a perfect AI mind reader. No, this is a natural language autoencoder. Okay, what does that mean? Well, it is more like a noisy translator. It catches real things, yes, but it sometimes makes up some of the specifics. Three, the cost is bearable. For a 27 billion parameter model, you train 1 and 1/2 days on 16 H100 GPUs. And for a frontier model, the cost is substantial. But, despite all these, this work is lovely, amazing, and it makes something previously impossible possible. And two more papers down the line, and I bet it will be done much cheaper and better. What a time to be alive. And now, please, use this to tell me why ChatGPT keeps thinking about goblins. Now, some of these videos come out a bit later because I try to be a bit more rigorous with them. You know, a quick media headline brings in a lot of clicks, especially if you write them with AI. Then you can be super quick, and people do that. But these videos, they come from the heart. Subscribe and hit the bell if you think this is the way to do it. Here you see me running the full Deep Seek AI model through Lambda GPU Cloud. 671 billion parameters running super fast and super reliably. This is insane. I love it, and I use it on a regular basis. Lambda provides you with powerful Nvidia GPUs to run your own chatbots and experiments. Seriously, try it out now at lambda.ai/papers, or click the link in the description.