State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, GPUs, AGI | Lex Fridman Podcast #490

Lex Fridman| 04:25:13|Mar 27, 2026
Chapters25
An overview of current state-of-the-art AI, highlighting recent breakthroughs and promising future developments, while keeping the discussion accessible to non-experts. Hosts Sebastián Raschka and Nathan Lambert are introduced as experienced researchers and communicators in the AI community.

A long, candid chat with Sebastian Raschka and Nathan Lambert about the chaotic, fast-moving AI landscape in 2026—from open-weight OpenAI/China models to RLVR, scaling laws, and the evolving role of humans in AI development.

Summary

Lex Fridman sits down with Sebastian Raschka and Nathan Lambert for a deep dive into the state of AI in 2026. They map the “DeepSeek moment” and its ripple effects across open-weight models from China (DeepSeek, Qwen, MiniMax, Z.ai) and the growing Western open-model ecosystem (OLMo, K2, GPT-OSS). The conversation covers who’s winning in 2026 (not a single winner, but advantages tied to budget, hardware, and scale), the future of model accessibility, and how tool use and inference-time scaling are reshaping practical AI work. Raschka stresses the continuity of transformer architectures while Lambert highlights ongoing breakthroughs in RL with verifiable rewards (RLVR) and the practical impact of post-training steps like RLHF and fine-tuning. They unpack data quality, synthetic data, and the data-curation race, and debate pre-training versus inference/post-training scaling in light of cost, latency, and real-world deployment. The pair also explores education, open vs. closed models, the business and policy angles of open AI, and the human factors—agency, burnout, voice, and the social contract around AI. Throughout, they share personal workflows (Claude Opus 4.5, Gemini, Claude Code, Cursor, Grok) and practical anecdotes, painting a nuanced picture of an ecosystem where competition accelerates progress but also raises ethical, economic, and existential questions. The talk concludes with a tempered optimism about human resilience and the ongoing transformation of work, knowledge, and society at large.

Key Takeaways

  • Open-weight AI models from China are expanding rapidly and diversifying the landscape, but sustained progress will likely involve consolidation and different incentives (open access vs. business models) over 2026.
  • RLVR (reinforcement learning with verifiable rewards) is a major accelerator for model capabilities, enabling longer tool-use reasoning and better post-training performance, often outperforming raw pre-training gains in practice.
  • Inference-time scaling and smarter data curation can yield major performance gains with smaller or cheaper models, sometimes surpassing larger but costlier pre-trained baselines (e.g., o1-style reasoning models vs. Claude Opus 4.5).
  • Tool use (web search, code execution, interpreters) is a key enabler for reliability and accuracy; open models are increasingly experimenting with tool integration, while closed models tend to embed deeper, company-specific toolchains.
  • Open vs. closed models: US-open ecosystems (OLMo, GPT-OSS, NVIDIA/Nomotron) compete with Chinese open-weight ecosystems; licensing terms, data licensing, and deployment options heavily influence where and how models are used.
  • Data quality and data sourcing remain among the most impactful levers for model performance; synthetic data, OCR pipelines (Almost-OCR), and high-quality curated corpora drive big gains in post-training and mid-training phases.
  • Scaling laws remain fundamentally relevant across pre-training, mid-training, post-training, and inference; the industry increasingly treats pre-training costs as fixed and focuses on post-training and inference efficiency to maximize ROI.

Who Is This For?

Essential viewing for ML engineers, AI researchers, and tech leaders who want a realistic, nuanced view of 2026 AI dynamics—from open models and RLVR to scaling strategies and policy considerations. It’s particularly valuable for those navigating open vs. closed model strategies and the economics of AI deployment.

Notable Quotes

"Winning is a very broad term. I don't think there will be a clear winner in terms of technology access."
Sebastian asserts that tech access and resources will differentiate players more than breakthroughs alone.
"The ideas space is fluid, but culturally Anthropic is known for betting very hard on code."
Lambert discusses organizational culture as a growth factor in model development.
"RLVR lets the model generate explanations and solve hard problems by trying many times with verifiable rewards."
Explains how reinforcement learning with verifiable rewards accelerates capability gains.
"For the user, there are two subscriptions: one for private, one for work—so two modes make sense in the future."
Discusses practical use-case segmentation for personal vs. enterprise AI deployment.
"Open-weight models are an engine for AI research; the US should invest in open models to stay competitive with China."
Advocates for public open-weight AI ecosystems as strategic national priority.

Questions This Video Answers

  • How will RLVR change the way we train language models in 2026?
  • What is the difference between pre-training, mid-training and post-training for LLMs?
  • Which open-weight LLMs stand out in 2026 and why: DeepSeek, Qwen, Mistral, Gemma, or GPT-OSS?
  • What are the economic factors driving the shift from proprietary APIs to open models?
  • How does tool use integration impact hallucinations and reliability in LLMs?
Lex FridmanSebastian RaschkaNathan LambertDeepSeekR1/R3/R3.2QwenMiniMaxZ.aiGeminiClaude Opus 4.5/Code/Opus family
Full Transcript
- The following is a conversation all about the state-of-the-art in artificial intelligence, including some of the exciting technical breakthroughs and developments in AI that happened over the past year, and some of the interesting things we think might happen this upcoming year. At times, it does get super technical, but we do try to make sure that it remains accessible to folks outside the field without ever dumbing it down. It is a great honor and pleasure to be able to do this kind of episode with two of my favorite people in the AI community, Sebastian Raschka and Nathan Lambert. They are both widely respected machine learning researchers and engineers who also happen to be great communicators, educators, writers, and X posters. Sebastian is the author of two books I highly recommend for beginners and experts alike. First is Build a Large Language Model from Scratch and Build a Reasoning Model from Scratch. I truly believe in the machine learning world, the best way to learn and understand something is to build it yourself from scratch. Nathan is the post-training lead at the Allen Institute for AI, author of the definitive book on Reinforcement Learning from Human Feedback. Both of them have great X accounts, great Substacks. Sebastian has courses on YouTube, Nathan has a podcast. And everyone should absolutely follow all of those. those. This is the Lex Fridman podcast. To support it, please check out our sponsors in the description, where you can also find links to contact me, ask questions, get feedback, and so on. And now, dear friends, here's Sebastian Raschka and Nathan Lambert. So I think one useful lens to look at all this through is the so-called DeepSeek moment. This happened about a year ago in January 2025, when the open-weight Chinese company DeepSeek released DeepSeek R1, that I think it's fair to say surprised everyone with near-state-of-the-art performance, with allegedly much less compute for much cheaper. And from then to today, the AI competition has gotten insane, both on the research and product level. It's just been accelerating. discuss all of this today, and maybe let's start with some spicy questions if we can. Who's winning at the international level? Would you say it's the set of companies in China or the set of companies in the United States? And Sebastian, Nathan, it's good to see you guys. guys. So Sebastian, who do you think is winning? - Winning is a very broad term. I would say you mentioned the DeepSeek moment, and I think DeepSeek is winning the hearts of the people who work on open-weight models because they share these as open models. Winning, I think, has multiple timescales to it. We have today, we have next year, we have in 10 years. One thing I know for sure is that I don't think nowadays, in 2026, that there will be any company that has access to technology that no other company has access to. That is mainly because researchers are frequently changing jobs and labs. They rotate. I don't think there will be a clear winner in terms of technology access. However, I do think there will be, The differentiating factor will be budget and hardware constraints. I don't think the ideas will be proprietary, but rather the resources needed to implement them. I don't see currently a winner-take-all scenario. I can't see that. At the moment. - Nathan, what do you think? - You see the labs put different energy into what they're trying to do, and I think to demarcate the point in time when we're recording this, the hype over Anthropic's Claude Opus 4.5 model has been absolutely insane, which is just... I mean, I've used it and built stuff in the last few weeks, and it's... it's almost gotten to the point where it feels like a bit of a meme in terms of the hype. And it's kind of funny because this is very organic, and then if we go back a few months ago, we can see the release date and the notes, as Gemini 3 from Google got released, and it seemed like the marketing and just, like, wow factor of that release was super high. But then at the end of November, Claude Opus 4.5 was released and the hype has been growing, but Gemini 3 was before this. And it kind of feels like people don't really talk about it as much, even though when it came out, everybody was like, this is Gemini's moment to retake Google's structural advantages in AI. And Gemini 3 is a fantastic model, and I still use it. It's just kind of differentiation is lower. And I agree with Sebastian; what you're saying with all these, the idea space is very fluid, but culturally Anthropic is known for betting very hard on code, which is the Claude Code thing, is working out for them right now. So I think that even if the ideas flow pretty freely, so much of this is bottlenecked by human effort and the culture of organizations, where Anthropic seems to at least be presenting as the least chaotic. It's a bit of an advantage, if they can keep doing that for a while. But on the other side of things, there's a lot of ominous technology from China where there's way more labs than DeepSeek. So DeepSeek kicked off a movement within China, I say kind of similar to how ChatGPT kicked off a movement in the US where everything had a chatbot. There's now tons of tech companies in China that are releasing very strong frontier open-weight models, to the point where I would say that DeepSeek is kind of losing its crown as the preeminent open model maker in China, and the likes of Z.ai with their GLM models, Minimax's models, Kimi Moonshot, especially in the last few months, has shown more brightly. The new DeepSeek models are still very strong, but that's kind of a... it could look back as a big narrative point where in 2025 DeepSeek came and it provided this platform for way more Chinese companies that are releasing these fantastic models to kind of have this new type of operation. So these models from these Chinese companies are open-weights, and depending on this trajectory of business models that these American companies are doing, they could be at risk. But currently, a lot of people are paying for AI software in the US, and historically in China and other parts of the world, people don't pay a lot for software. - So some of these models like DeepSeek have the love of the people because they are open-weight. How long do you think the Chinese companies keep releasing open-weight models? - I would say for a few years. I think that, like in the US, there's not a clear business model for it. I have been writing about open models for a while, and these Chinese companies have realized it. So I get inbound from some of them. And they're smart and realize the same constraints: a lot of top US tech companies and other IT companies won't pay for an API subscription to Chinese companies for security concerns. This has been a long-standing habit in tech, and the people at these companies then see open weight models as an ability to influence and take part of a huge growing AI expenditure market in the US. And they're very realistic about this, and it's working for them. I think that the government will see that that is building a lot of influence internationally in terms of uptake of the technology, so there's going to be a lot of incentives to keep it going. But building these models and doing the research is very expensive, so at some point, I expect consolidation. But I don't expect that to be a story of 2026, where there will be more open model builders throughout 2026 than there were in 2025. And a lot of the notable ones will be in China. - You were going to say something? - Yes. You mentioned DeepSeek losing its crown. I do think to some extent, yes, but we also have to consider though, they are still, I would say, slightly ahead. And the other ones—it's not that DeepSeek got worse, it's just that the other ones are using the ideas from DeepSeek. For example, you mentioned Kimi—same architecture, they're training it. And then again, we have this leapfrogging where they might be at some point in time a bit better because they have the more recent model. And I think this comes back to the fact that there won't be a clear winner. It will just be like that: one person releases something, the other one comes in, and the most recent model is probably always the best model. - Yeah. We'll also see the Chinese companies have different incentives. Like, DeepSeek is very secretive, whereas some of these startups are like the MiniMaxs and Z.ais of the world. Those two literally have filed IPO paperwork, and they're trying to get Western mindshare and do a lot of outreach there. So I don't know if these incentives will change the model development, because DeepSeek famously is built by a hedge fund, Highflyer Capital, and we don't know exactly what they use the models for or if they care about this. - They're secretive in terms of communication; they're not secretive in terms of the technical reports that describe how their models work. They're still open on that front. And we should also say, on the Claude Opus 4.5 hype, there's the layer of something being the darling of the X echo chamber, on the Twitter echo chamber, and the actual amount of people that are using the model. I think it's probably fair to say that ChatGPT and Gemini are focused on the broad user base that just want to solve problems in their daily lives, and that user base is gigantic. So the hype about the coding may not be representative of the actual use. - I would say also a lot of the usage patterns are, like you said, name recognition, brand and stuff, but also muscle memory almost, where, you know, ChatGPT has been around for a long time. People just got used to using it, and it's almost like a flywheel: they recommend it to other users and that stuff. One interesting point is also the customization of LLMs. For example, ChatGPT has a memory feature, right? And so you may have a subscription and you use it for personal stuff, but I don't know if you want to use that same thing at work. Because it's a boundary between private and work. If you're working at a company, they might not allow that or you may not want that. And I think that's also an interesting point where you might have multiple subscriptions. One is just clean code. It has nothing of your personal images or hobby projects in there. It's just like the work thing. And then the other one is your personal thing. So I think that's also something where there are two different use cases, and it doesn't mean you only have to have one. I think the future is also multiple ones. - What model do you think won 2025, and what model do you think is going to win '26? - I think in the context of consumer chatbots, it's a question of: are you willing to bet on Gemini over ChatGPT? Which I would say, in my gut, feels like a bit of a risky bet because OpenAI has been the incumbent, and there are so many benefits to that in tech. I think the momentum, if you look at 2025, was on Gemini's side, but they were starting from such a low point. And RIP Bard and these earlier attempts at getting started. Huge credit to them for powering through the organizational chaos to make that happen. But also it's hard to bet against OpenAI because they always come off as so chaotic, but they're very good at landing things. And I think, personally, I have very mixed reviews of GPT-5, but it must have saved them so much money with the high-line feature being a router where most users are no longer charging their GPU costs as much. So I think it's very hard to dissociate the things that I like out of models versus the things that are going to actually be a general public differentiator. - What do you think about 2026? Who's going to win? - I'll say something, even though it's risky. I think Gemini will continue to make progress on ChatGPT. I think Google's scale, when both of these are operating at such extreme scales—and Google has the ability to separate research and product a bit better, whereas you hear so much about OpenAI being chaotic operationally and chasing the high-impact thing, which is a very startup culture. And then on the software and enterprise side, I think Anthropic will have continued success, as they've again and again been set up for that. And obviously Google Cloud has a lot of offerings, but I think this kind of Gemini name brand is important for them to build. Google Cloud will continue to do well, but that's a more complex thing to explain in the ecosystem, because that's competing with the likes of Azure and AWS rather than on the model provider side. - So in infrastructure, you think TPU is giving an advantage? - Largely because the margin on NVIDIA chips is insane, and Google can develop everything from top to bottom to fit their stack and not have to pay this margin. And they've had a head start in building data centers. So all of these things that have both high lead times and very hard margins on high costs, Google has a just kind of historical advantage there. And if there's going to be a new paradigm, it's most likely to come from OpenAI where their research division again and again has shown this ability to land a new research idea or a product. Like Deep Research, Sora, o1 thinking models—all these definitional things have come from OpenAI, and that's got to be one of their top traits as an organization. So it's kind of hard to bet against that, but I think a lot of this year will be about scale and optimizing what could be described as low-hanging fruit in models. - And clearly there's a trade-off between intelligence and speed. This is what ChatGPT-5 was trying to solve behind the scenes. It's like, do people actually want intelligence, the broad public, or do they want speed? - I think it's a nice variety, or the option to have a toggle there. I mean, for my personal usage, most of the time when I look something up, I use ChatGPT to ask a quick question, get the information I wanted fast. For most daily tasks, I use the quick model. Nowadays, I think the auto mode is pretty good where you don't have to specifically say thinking or non-thinking. Then again, I also sometimes want the pro mode. Very often what I do is, when I have something written, I put it into ChatGPT and say, "Hey, do a very thorough check. Are all my references correct? Are all my thoughts correct? Did I make any formatting mistakes and are the figure numbers wrong?" Or something like that. And I don't need that right away. I finish my stuff, maybe have dinner, let it run, come back and go through this. I think this is where it's important to have this option. I would go crazy if for each query I would have to wait 30 minutes or 10 minutes even. - That's me. I'm sitting over here losing my mind that you use the router and the non-thinking model. I'm like, "How do you live with that?" That's like my reaction. I've been heavily on ChatGPT for a while. I never touched ChatGPT-5 non-thinking. I find its tone and then its propensity for errors—it has a higher likelihood of errors. Some of this is from back when OpenAI released o3, which was the first model to do this deep search and find many sources and integrate them for you. I became habituated with that. So I will only use GPT-5.2 Thinking or Pro when I'm finding any sort of information query for work, whether that's a paper or some code reference that I found. And I will regularly have like five Pro queries going simultaneously, each looking for one specific paper or feedback on an equation or something. - I have a fun example where I needed the answer as fast as possible for this podcast before I was going on the trip. like a local GPU running at home and I wanted to run a long RL experiment. And usually I also unplug things because you never know if you're not at home, you don't want things plugged in. And I accidentally unplugged the GPU. My wife was already in the car and it's like, "Oh dang." Then basically I wanted as fast as possible a Bash script that runs my different experiments and the evaluation. And it's something I know, I learned how to use the Bash interface or Bash terminal, but in that moment I just needed like 10 seconds, give me the command. - This is a hilarious situation but yeah, so what did you use? - So I did the non-thinking fastest model. It gave me the Bash command to chain different scripts to each other and then the thing is like you have the tee thing where you want to route this to a log file. Top of my head I was just like in a hurry, I could have thought about it myself. - By the way I don't know if there's a representative case, wife waiting in the car- ... you have to run, you know, unplug the GPU. You have to generate a Bash script. This sounds like a movie, like- Mission Impossible. - I use Gemini for that. So I use thinking for all the information stuff and then Gemini for fast things or stuff that I could sometimes Google, which is like it's good at explaining things and I trust that it has this kind of background of knowledge and it's simple. And the Gemini app has gotten a lot better and- It's good for those sorts of things. And then for code and any sort of philosophical discussion, I use Claude Opus 4.5. Also always with extended thinking. Extended thinking and inference time scaling is just a way to make the models marginally smarter. And I will always err on that side when the progress is very high because you don't know when that'll unlock a new use case. And then sometimes use Grok for real-time information or finding something on AI Twitter that I knew I saw and I need to dig up and I just fixated on. Although when Grok 4 came out, the Grok 4 SuperGrok Heavy, which was like their pro variant was actually very good and I was pretty impressed with it, and then it just kind of like muscle memory lost track of it with having the ChatGPT app open. So I use many different things. - Yeah. I actually do use Grok 4 Heavy for debugging. For like hardcore debugging that the other ones can't solve, I find that it's the best at. And... it's interesting 'cause you say ChatGPT is the best interface. For me, for that same reason, but this could be just momentum- Gemini is the better interface for me. I think because I fell in love with their best needle in the haystack. If I ever put something that has a lot of context but I'm looking for very specific kinds of information to make sure it tracks all of it, I find at least that Gemini for me has been the best. So it's funny with some of these models, if they win your heart over- for one particular feature on one particular day, for that particular query, that prompt, you're like, "This model's better." And so you'll just stick with it for a bit until it does something really dumb. There's like a threshold effect. Some smart thing and then you fall in love with it and then it does some dumb thing and you're like, "You know what? I'm gonna switch and try Claude or ChatGPT." And all that kind of stuff. - This is exactly it: you use it until it breaks, until you have a problem, and then you change the LLM. And I think it's the same as how we use anything, like our favorite text editor, operating systems, or the browser. I mean, there are many options: Safari, Firefox, Chrome. They're relatively similar, but then there are edge cases, extensions you want, and then you switch. But I don't think anyone types the same thing into different browsers and compares them. You only do that when something breaks. So that's a good point. You use it until it breaks, then you explore other options. - On the long context thing, I was also a Gemini user, but the GPT-5.2 release blog had crazy long context scores. People were like, "Did they just figure out some algorithmic change?" It went from 30% to 70% in this minor model update. It's very hard to keep track of all of these things, but now I look more favorably at GPT-5.2's long context. So it's just like, "How do I actually get to testing this?" It's a never-ending battle. - Well, it's interesting that none of us talked about the Chinese models from a usage perspective. What does that say? Does it mean the Chinese models are not as good, or are we just very biased and US-focused? - I think currently there's a discrepancy between the model and the platform. The open models are more known for the open weights, not the platform yet. known for the open weights, not their platform yet. - Many companies will sell you open-model inference at a very low cost. With OpenRouter, it's easy to look at multi-model things. You can run DeepSeek on Perplexity. Sitting here, we're like, "We use OpenAI GPT-5 Pro consistently." We're all willing to pay for the marginal intelligence gain. These models from the US are better in terms of the outputs. I think the question is, will they stay better for this year and for years to come? As long as they're better, I'm gonna pay for them. There's also analysis showing that the way the Chinese models are served—you could argue this is due to export controls— is that they use fewer GPUs per replica, which makes them slower and have different errors. If speed and intelligence are in your favor as a user, in the US, a lot of users will go for this. And I think that will spur these Chinese companies to want to compete in other ways, whether it's free or substantially lower costs, or it'll breed creativity in terms of offerings, which is good for the ecosystem. But the simple thing is: the US models are currently better, and we use them. I tried these other open models, and I'm like, "Fun, but I don't go back." models, and I'm like, "Fun, but not gonna... I don't go back to it." - We didn't really mention programming. That's another use case that a lot of people deeply care about. I use basically half-and-half Cursor and Claude Code, because they're... I fundamentally different experiences and both are useful. What do you guys... You program quite a bit, so what do you use? What's the current vibe? - So, I use the Codeium plugin for VS Code. You know, it's very convenient. It's just like a plugin, and then it's a chat interface that has access to your repository. I know that Claude Code is, I think, a bit different. It is a bit more agentic. It touches more things. It does the whole project for you. I'm not quite there yet where I'm comfortable with that because maybe I'm a control freak, but I still would like to see a bit what's going on. And Codeium is kind of, right now, for me, the sweet spot where it is helping me, but it is not taking completely over. - I should mention, one of the reasons I do use Claude Code is to build the skill of programming with English. I mean, the experience is fundamentally different. You're... As opposed to micromanaging the details of the process of the generation of the code, and looking at the diff, which you can in Cursor if that's the IDE you use, and in changing, altering. Looking and reading the code and understanding the code deeply as you progress, versus just thinking in this design space and just guiding it at this macro level, which I think is another way of thinking about the programming process. Also, we should say that Claude Code just seems to be somehow a better utilization of Claude Opus 4.5. - It's a good side-by-side for people to do. You can have Claude Code open, you can have Cursor open, you can have VS Code open, and you can select the same models on all of them— ...and ask questions, and it's very interesting. Claude Code is way better in that domain. It's remarkable. - All right, we should say that both of you are legit on multiple fronts: researchers, programmers, educators, Tweeters. And on the book front, too. So Nathan, at some point soon, hopefully has an RLHF book coming out. - It's available for preorder, and there's a full digital preprint. I'm just making it pretty and better organized for the physical thing, which is a lot of why I do it, because it's fun to create things that you think are excellent in the physical form when so much of our life is digital. - I should say, going to Perplexity here, Sebastian Raschka is a machine learning researcher and author known for several influential books. A couple of them that I wanted to mention—which is a book I highly recommend—Build a Large Language Model from Scratch, and the new one, Build a Reasoning Model from Scratch. So, I'm really excited about that. Building stuff from scratch is one of the most powerful ways of learning. - Honestly, building an LLM from scratch is a lot of fun. It's also a lot to learn. And like you said, it's probably the best way to learn how something really works, 'cause you can look at figures, but figures can have mistakes. You can look at concepts and explanations, but you might misunderstand them. But if there is code, and the code works, you know it's correct. I mean, there's no misunderstanding. It's precise. Otherwise, it wouldn't work. And I think that's the beauty behind coding. It doesn't lie. It's math, basically. So, even though with math, I think you can have mistakes in a book you would never notice. Because you are not running the math when you are reading the book, you can't verify this. And with code, what's nice is you can verify it. - Yeah, I agree with you about the Build an LLM from Scratch book. It's nice to tune out everything else, the internet and so on, and just focus on the book. But, you know, I read several history books. It's just less lonely somehow. It's really more fun. Like for example, on the programming front, I think it's genuinely more fun to program with an LLM. And I think it's genuinely more fun to read with an LLM. But you're right. That distraction should be minimized. So you use the LLM to basically enrich the experience, maybe add more context. I just find the rate of aha moments for me is really high with LLMs. - 100%. I also want to correct myself: I'm not suggesting not to use LLMs. I suggest doing it in multiple passes. Like, one pass just offline, focus mode, and then after that... I mean, I also take notes, but I, I try to resist the urge to immediately look things up. I do a second pass. It's just more structured this way. Sometimes things are answered in the chapter, but sometimes also it just helps to let it sink in and think about it. Other people have different preferences. I highly recommend using LLMs when reading books. For me, it's not the first thing to do; it's the second pass. - My recommendation is the opposite. I like to use the LLM at the beginning to lay out the full context of what is this world that I'm now stepping into? But I try to avoid clicking out of the LLM into the world of Twitter and blogs, because then you're down this rabbit hole. You're reading somebody's opinion. There's a flame war about a particular topic and all of a sudden you're in the realm of the internet and Reddit and so on. But if you're purely letting the LLM give you the context of why this matters, what are the big picture ideas... sometimes books are good at doing that, but not always. - This is why I like the ChatGPT app, because it gives the AI a home on your computer where you can focus on it, rather than just being another tab in my mess of internet options. And I think Claude Code does a good job of making that a joy, where it seems very engaging as a product design to be an interface that your AI will then go out into the world. It's something that is intangible between it and Codex; it just feels warm and engaging, where Codex can often be as good from OpenAI, but it just, feel a little bit rough around the edges. Whereas Claude Code makes it fun to build things from scratch, where you just trust that it'll make something. Obviously this is good for websites and kind of refreshing tooling and stuff like this, which I use it for, or data analysis. For my On my blog, we scrape Hugging Face so we keep download numbers for every dataset and model. over time, so we have them. And Claude was just like, "Yeah, I've made use of that data, no problem." And I was like, "That would've taken me days." And then I have enough situational awareness to be like, "Okay, these trends obviously make sense." You can check things. But that's just a wonderful interface where you can have an intermediary and not have to do the kind of awful low-level work that you would have to do to maintain different web projects. - All right. So we just talked about a bunch of the closed-weight models. Let's talk about the open ones. Tell me about the landscape of open LLM models. Which are interesting? Which stand out to you and why? We already mentioned DeepSeek R1. - Do you wanna see how many we can name off the top of our head? - Yeah, without looking at notes. - DeepSeek, Kimi, MiniMax, Z.ai, Moonshot. We're just going Chinese. - Let's throw in Mistral AI, Gemma... ...gpt-oss, the open weight model by OpenAI. Actually, NVIDIA had a really cool one, Nemotron 3. There, there's a lot of stuff especially at the end of the year. Qwen might be the one— - Oh, yeah. Qwen was the obvious name I was gonna say. You can get at least 10 Chinese and at least 10 Western. I think that OpenAI released their first open model— ...since GPT-2. When I was writing about OpenAI's open model release, they were like, "Don't forget about GPT-2," which I thought was really funny 'cause it's just such a different time. But gpt-oss-120b is actually a very strong model and does some things that other models don't do very well. Selfishly, I'll promote a bunch of Western companies in the US and Europe that have these fully open models. I work at the Allen Institute for AI, where we've been building OLMo, which releases data and code. And now we have actual competition for people that are trying to release everything so that others can train these models. There's the Institute for Foundation Models/LM360, which has had their K2 models of various types. Apertus is a Swiss research consortium. Hugging Face has SmolLM, which is very popular. And NVIDIA's Nemotron 3 has started releasing data as well. And then Stanford's Martini Community Project, which is kind of making it so there's a pipeline for people to open a GitHub issue and implement a new idea and then have it run in a stable language modeling stack. This space, that list was way smaller in 2024— ...so I think it was just AI2. So it's a great thing for more people to get involved and to understand language models, which doesn't really have a Chinese analog. While I'm talking, I'll say that the Chinese open language models tend to be much bigger, and that gives them higher peak performance as MoEs, where a lot of these things that we like a lot, whether it was Gemma and Nemotron, have tended to be smaller models from the US, which is starting to change from the US and Europe. Mistral Large 3 came out, which was a giant MoE model, very similar to DeepSeek architecture in December. And then a startup, RCAI, and both Nemotron and NVIDIA have teased MoE models way bigger than 100 billion parameters- like this 400 billion parameter range coming in this Q1 2026 timeline. So I think this kind of balance is set to change this year in terms of what people are using the Chinese versus US open models for, which I'm personally going to be very excited to watch. - First of all, huge props for being able to name so many of these. Did you actually name LLaMA? - No. - I feel like ... - RIP. - This was not on purpose. - RIP LLaMA. All right. Can you mention some interesting models that stand out? You mentioned Qwen 3 is obviously a standout. - So I would say the year's almost bookended by both DeepSeek V3 and R1. And then on the other hand, in December, DeepSeek-V3.2. Because what I like about those is they always have an interesting architecture tweak that others don't have. But otherwise, if you want to go with the familiar but really good performance, Qwen 3 and, like Nathan said, also gpt-oss-120b. And I think what's interesting about it is it's kind of like the first public or open weight model that was really trained with tool use in mind, which I do think is kind of a paradigm shift where the ecosystem was not quite ready for it. By tool use, I mean that the LLM is able to do a web search or to call a Python interpreter. And I do think it's a standout because it's a huge unlock. Because one of the most common complaints about LLMs are, for example, hallucinations, right? And so, in my opinion, one of the best ways to solve hallucinations is to not try to always remember information or make things up. For math, why not use a calculator app or Python? If I ask the LLM, "Who won the soccer World Cup in 1998?" instead of just trying to memorize, it could go do a search. I think mostly it's still a Google search. So ChatGPT and gpt-oss-120b, they would do a tool call to Google, maybe find the FIFA website. Find, okay, it was France. It would get you that information reliably instead of just trying to memorize it. So I think it's a huge unlock which right now is not fully utilized yet by the open-source, open-weight ecosystem. A lot of people don't use tool call modes because I think, first, it's a trust thing. You don't want to run this on your computer where it has access to tools, could wipe your hard drive or whatever. So you want to maybe containerize that. But I do think that is like a really important step for the upcoming years to have this ability. - So a few quick things. First of all, thank you for defining what you mean by tool use. I think that's a great thing to do in general for the concepts we're talking about. Even things as sort of well-established as MoEs. You have to say that means mixture of experts, and you kind of have to build up an intuition for people what that means, how it's actually utilized, what are the different flavors. So what does it mean that there's just such an explosion of open models? What's your intuition? - If you're releasing an open model, you want people to use it, is the first and foremost thing. And then after that comes things like transparency and trust. I think when you look at China, the biggest reason is that they want people around the world to use these models, and I think a lot of people will not. If you look outside of the US, a lot of people will not pay for software, but they might have computing resources where you can put a model on it and run it. I think there can also be data that you don't want to send to the cloud. So the number one thing is getting people to use models, use AI, or use your AI that might not be able to do it without having access to the model. - I guess we should state explicitly, so we've been talking about these Chinese models and open weight models. Oftentimes, the way they're run is locally. So it's not like you're sending your data to China or to whoever developed Silicon Valley, or whoever developed the model. - A lot of American startups make money by hosting- ...these models from China and selling them. It's called selling tokens, which means somebody will call the model to do some piece of work. I think the other reason is for US companies like OpenAI. They are so GPU deprived. They're at the limits of the GPUs. Whenever they make a release, they're always talking about like, "Our GPUs are hurting." And I think during one of these gpt-oss-120b release sessions, Sam Altman said, "Oh, we're releasing this because we can use your GPUs. We don't have to use our GPUs, and OpenAI can still get distribution out of this," which is another very real thing, because it doesn't cost them anything. - And for the user, I think also, there are users who just use the model locally how they would use ChatGPT. But also for companies I think it's a huge unlock to have these models because you can customize them, you can train them, you can add post-training, add more data. Like, specialize them into, let's say, law, medical models, whatever you have. And the appeal, you mentioned Llama, the appeal of the open-weight models from China is that the open-weight models' licenses are even friendlier. I think they are just unrestricted open source licenses where if we use something like Llama or Gemma, there are some strings attached. I think it's like an upper limit in terms of how many users you have. And then if you exceed, I don't know, so and so many million users, you have to report your financial situation to, let's say, Meta or something like that. And I think while it is a free model, there are strings attached, and people do like things where strings are not attached. So I think that's also one of the reasons, besides performance, why the open-weight models from China are so popular, because you can just use them. There's no catch in that sense. - The ecosystem has gotten better on that front, but mostly downstream of these new providers providing such open licenses. That was funny when you pulled up Perplexity and said, "Kimi K2 Thinking hosted in the US." Which is just like an exact... I've never seen this, but it's an exact example of what we're talking about where people are sensitive to this. But Kimi K2 Thinking and Kimi K2 is a model that is very popular. People say that has very good creative writing and also in doing some software things. So it's just these little quirks that people pick up on with different models that they like. - What are some interesting ideas that some of these models have explored that you can speak to, that are particularly interesting to you? - Maybe we can go chronologically. I mean, there was, of course, DeepSeek. DeepSeek R1 that came out in January of 2025, if we just focus on 2025. However, this was based on DeepSeek-V3, which came out the year before in December 2024. There are multiple things on the architecture side. What is fascinating is... I mean, that's what I do with my from-scratch coding projects. You can still start with GPT-2, and you can add things to that model to make it into this other model. So it's all still kind of like the same lineage. It is a very close relationship between those. But top of my head, DeepSeek—what was unique there is the Mixture of Experts. Not that they were inventing Mixture of Experts—we can maybe talk a bit more about what Mixture of Experts means—but just to list these things first before we dive into detail. Mixture of Experts, but then they also had Multi-head Latent Attention, which is a tweak to the attention mechanism, where this was, I would say, the main distinguishing factor between these open-weight models. Different tweaks to make inference or KV cache size... We can also define KV cache in a few moments, but to kind of make it more economical to have long context, to shrink the KV cache size. So what are tweaks that we can do? And most of them focused on the attention mechanism. There is Multi-head Latent Attention in DeepSeek. There is Group Query Attention, which is still very popular. It's not invented by any of those models. It goes back a few years. But that would be the other option. Sliding window attention—I think OLMo 3 uses it, if I remember correctly. So there are these different tweaks that make the models different. Otherwise, I put them all together in an article once where I just compared them. They are very, surprisingly similar. It's just different numbers in terms of how many repetitions of the transformer block you have in the center. And, like, just little knobs that people tune. But what's so nice about it is it works no matter what. You can tweak things. You can move the normalization layers around to get some performance gains. And OLMo is always very good in ablation studies, showing what it actually does to the model if you move something around. Ablation studies: does it make it better or worse? But there are so many, let's say, ways you can implement a transformer and make it still work. The big ideas that are still prevalent is Mixture of Experts, multi-head latent attention, sliding window attention, group query attention. And then at the end of the year, we saw a focus on making the attention mechanism scale linearly with inference token prediction. So there was Qwen2-VL, for example, which added a gated delta net. It's kind of inspired by State space models, where you have a fixed state that you keep updating. But it makes essentially this attention cheaper, or it replaces attention with a cheaper operation. - And it may be useful to step back and talk about transformer architecture in general. - Yeah, so maybe we should start with the GPT-2 architecture. The transformer that was derived from the "Attention Is All You Need" paper. The "Attention Is All You Need" paper had a transformer architecture that had two parts, an encoder and a decoder. And GPT went just focusing in on the decoder part. It is essentially still a neural network and it has this attention mechanism inside. And you predict one token at a time. You pass it through an embedding layer. There's the transformer block. The transformer block has attention modules and a fully connected layer. And there are some normalization layers in between. But it's essentially neural network layers with this attention mechanism. So coming from GPT-2 when we move on to gpt-oss-120b, there is, for example, the Mixture of Experts layer. It's not invented by gpt-oss-120b. It's a few years old. But it is essentially a tweak to make the model larger without consuming more compute in each forward pass. So there is this fully connected layer, and if listeners are familiar with multi-layer perceptrons, you can think of a mini multi-layer perceptron, a fully connected neural network layer inside the transformer. And it's very expensive, because it's fully connected. If you have a thousand inputs and a thousand outputs, that's like one million connections. And it's a very expensive part in this transformer. And the idea is to kind of expand that into multiple feedforward networks. So instead of having one, let's say you have 256, but it would make it way more expensive, because now you have 256, but you don't use all of them at the same time. So you now have a router that says, "Okay, based on this input token, it would be useful to use this fully connected network." And in that context, it's called an expert. So a Mixture of Experts means you have multiple experts. And depending on what your input is, let's say it's more math-heavy, it would use different experts, compared to, let's say, translating input text from English to Spanish. It would maybe consult different experts. It's not quite clear, I mean, not as clear-cut to say, "Okay, this is only an expert for math and for Spanish." It's a bit more fuzzy. But the idea is essentially that you pack more knowledge into the network, but not all the knowledge is used all the time. That would be very wasteful. So, during the token generation, you are more selective. There's a router that selects which tokens should go to which expert. It adds more complexity. It's harder to train. There's a lot that can go wrong, like collapse and everything. So I think that's why OLMo 3 still uses dense... I mean, you have OLMo models with Mixture of Experts, but dense models, where dense means... So also, it's jargon. There's a distinction between dense and sparse. So Mixture of Experts is considered sparse, because we have a lot of experts, but only a few of them are active. So that's called sparse. And then dense would be the opposite, where you only have one fully connected module, and it's always utilized. - So maybe this is a good place to also talk about KV cache. But actually, before that, even zooming out, like fundamentally, how many new ideas have been implemented from GPT-2 to today? Like, how different really are these architectures? - Take the Mixture of Experts. The attention mechanism in gpt-oss-120b, that would be the Group Query Attention mechanism. So it's a slight tweak from Multi-Head Attention to Group Query Attention. So that we have too... I think they replaced LayerNorm by RMSNorm, but it's just like a different normalization there and not a big change. It's just like a tweak. The nonlinear activation function— people familiar with deep neural networks, I mean, it's the same as changing sigmoid with ReLU. It's not changing the network fundamentally. It's just a little tweak. And that's about it, I would say. It's not really fundamentally that different. It's still the same architecture. So you can go from one into the other by just adding these changes basically. - It fundamentally is still the same architecture. - Yep. For example, you mentioned my book earlier. That's a GPT-2 model in the book because it's simple and it's very small, so 124 million parameters approximately. But in the bonus materials, I do have OLMo from scratch, Gemini 3 from scratch, and other types of from-scratch models. And I always start it with my GPT-2 model and just tweak the—well, add different components and you get from one to the other. It's kind of like a lineage in a sense. - Can you build up an intuition for people? Because when you zoom out, you look at it, there's so much rapid advancement in the AI world. And at the same time, fundamentally the architectures have not changed. So where is all the turbulence, the turmoil of the advancement happening? Where are the gains to be had? - So there are different stages where you develop the network or train the network. You have the pre-training. Now back in the day, it was just pre-training with GPT-2. Now you have pre-training, mid-training, and post-training. So I think right now we are in the post-training focus stage. Pre-training still gives you advantages if you scale it up with better, higher quality data. But then we have capability unlocks that were not there with GPT-2, for For example, ChatGPT is basically a GPT-3 model. And GPT-3 is the same as GPT-2 in terms of architecture. What was new was adding supervised fine-tuning and reinforcement learning with human feedback. So it's more on the algorithmic side than the architecture. - I would say that the systems also change a lot. If you listen to NVIDIA's announcements, they talk about things like, "You now do FP8, you can now do FP4." What's happening is these labs are figuring out how to utilize more compute to put it into one model, which lets them train faster and put more data in. And then you can find better configurations faster by doing this. So you can look at, essentially, tokens per second per GPU as a metric that you look at when you're doing large-scale training. You can go from 10k to 13k by turning on FP8 training, which means you're using less memory per parameter in the model. By saving less information, you do less communication and train faster. So all of these system things underpin way faster experimentation on data and algorithms. It's a loop that keeps going where it's hard to describe when you look at architectures and they're exactly the same, but the code base used to train these models is vastly different- -and you could probably... the GPUs are different but you probably train gpt-oss-20b way faster in wall-clock time than GPT-2 was trained at the time. - Yeah. Like you said, they had, for example, in Mixture of Experts this FP4 optimization where you get more throughput. But I do think, for speed this is true, but it doesn't give the model new capabilities. It's just: how much can we make the computation coarser without suffering in terms of model performance degradation? But I do think- I mean, there are alternatives popping up to the transformer. Text diffusion models, a completely different paradigm. And there is also... I mean, although text diffusion models might use transformer architectures, it's not an autoregressive transformer. And also Mamba models. It's a state space model. But they do have trade-offs, and nothing has yet replaced the autoregressive transformer as the state-of-the-art model. For state-of-the-art, you would still go with that, but there are now alternatives for the cheaper end—alternatives that are kind of making compromises. It's not just one architecture anymore. There are little ones coming up. But if we talk about the state-of-the-art, it's pretty much still the transformer architecture, autoregressive, derived from GPT-2 essentially. - I guess the big question here is, we talked quite a bit about the architecture behind the pre-training. Are the scaling laws holding strong across pre-training, post-training, inference, context size, data, and synthetic data? - I'd like to start with the technical definition of a scaling law- -which informs all of this. The scaling law is the power law relationship between... You can think of the x-axis, so kind of what you are scaling as a combination of compute and data, which are kind of similar, and then the y-axis is like the held-out prediction accuracy over next tokens. We talked about models being autoregressive. It's like if you keep a set of text that the model has not seen, how accurate will it get when you train? And the idea of scaling laws came when people figured out that that was a very predictable relationship. And I think that that technical term is continuing, and then the question is, what do users get out of it? Then there are more types of scaling where, OpenAI's o1 was famous for introducing inference time scaling. And I think less famously for also showing that you can scale reinforcement learning training and get kind of this log x-axis and then a linear increase in performance on y-axis. So there's kind of these three axes now where the traditional scaling laws are talked about for pre-training, which is how big your model is and how big your dataset is, and then scaling reinforcement learning, which is like how long can you do this trial and error learning that we'll talk about. We'll define more of this, and then this inference time compute, which is just letting the model generate more tokens on a specific problem. So I'm kind of bullish, but they're all really still working, but the low-hanging fruit has mostly been taken, especially in the last year on reinforcement learning with verifiable rewards, which is this RLVR, and then inference time scaling, which is just why these models feel so different to use, where previously you would get that first token immediately. And now they'll go off for seconds, minutes, or even hours, generating these hidden thoughts before giving you the first word of your answer. And that's all about this inference time scaling, which is such a wonderful kind of step function in terms of how the models change abilities. They kind of enabled this tool use stuff and enabled this much better software engineering that we were talking about. And this, when we say enabled, is almost entirely downstream of the fact that this reinforcement learning with verifiable rewards training just kind of let the models pick up these skills very easily. So let the models learn, so if you look at the reasoning process when the models are generating a lot of tokens, what it'll often be doing is: it tries a tool, it looks at what it gets back. It tries another API, it sees what it gets back and if it solves the problem. So the models, when you're training them, very quickly learn to do this. And then at the end of the day, that gives this kind of general foundation where the model can use CLI commands very nicely in your repo and handle Git for you and move things around and organize things or search to find more information, which if we were sitting in these chairs a year ago is something that we didn't really think of the models doing. So this is just kind of something that has happened this year and has totally transformed how has totally transformed how we think of using AI which evolution and just unlocks so much value. But it's like, just so- pr- unlocks so much value. But it's- it's like, it's not clear what the next avenue will be in terms of unlocking stuff like this. I think there's... we'll get to continual learning later, but there's a lot of buzz around certain areas of AI, but no one knows when the next step function will really come. - So you've actually said quite a lot of things there, and said profound things quickly. It would be nice to unpack them a little bit. You say you're bullish basically on every version of scaling. So can we just even start at the beginning? Pre-training, are we kind of implying that the low- hanging fruit on pre-training scaling has been picked? Has pre-training hit a plateau, or is even pre-training still something you're bullish on? - Pre-training has gotten extremely expensive. I think to scale up pre-training, it's also implying that you're gonna serve a very large model to the users. So I think that it's been loosely established the likes of GPT-4 and similar models were around one trillion parameters at the biggest size. There's a lot of rumors that they've actually gotten smaller as training has gotten more efficient. You want to make the model smaller because then your costs of serving go down proportionately. These models, the cost of training them is really low relative to the cost of serving them to hundreds of millions of users. I think DeepSeek had this famous number of about five million dollars for pre-training at cloud market rates. In OLMo 3, section 2.4 in the paper, we just detailed how long we had the GPU clusters sitting around for training which includes engineering issues, multiple seeds, and it was like about two million dollars to rent the cluster to deal with all the headaches of training a model. So these models are pretty— like, a lot of people could get one to 10 million dollars to train a model, but the recurring costs of serving millions of users is really billions of dollars of compute. I think that you can look at a thousand GPU rental you can pay 100 grand a day for. And these companies could have millions of GPUs. Like you can look at how much these things cost to sit around. So that's kind of a big thing, and then it's like, if scaling is actually giving you a better model, is it gonna be financially worth it? And I think we'll slowly push it out as AI solves more compelling tasks, so like the likes of Claude Opus 4.5, making Claude Code just work for things. I— I launched this project called the ATOM project, which is American Truly Open Models in July, and that was like a true vibe coded website, and like, I have a job to make plots and stuff. And then I came back to refresh it in the last few weeks and it's like Claude Opus 4.5 versus whatever model at the time was like, just crushed all the issues that it had from building in June and July and like, it might be a bigger model. There's a lot of things that go into this, but there's still progress coming. - So what you're speaking to is the nuance of the y-axis of the scaling laws—the way it's experienced versus on a benchmark, the actual intelligence might be different. But still, your intuition about pre-training, if you scale the size of compute, will the models get better? Not whether it's financially viable but just from the law aspect of it, do you think the models will get smarter? - Yeah. And I think that there's... And this sometimes comes off as almost like disillusionment from people, leadership at AI companies saying this, but they're like, "It's held for 13 orders of magnitude of compute, why would it ever end?" So I think fundamentally it is pretty unlikely to stop, it's just eventually we're not even gonna be able to test the bigger scales because of all the problems that come with more compute. I think that there's a lot of talk on how 2026 is a year when very large Blackwell compute clusters, like gigawatt-scale facilities at hyperscalers, are coming online. These were all contracts for power and data centers that were signed and sought out in 2022 and 2023. So before or right after ChatGPT. It took this two-to-three-year lead time to build these bigger clusters to train the models. While there's obviously immense interest in building even more data centers than that. So that is the crux that people are saying: these new clusters are coming. The labs are gonna have more compute for training. They're going to utilize this, but it's not a given. I've seen so much progress that I expect it, and I expect a little bit bigger models, and I expect... I would say it's more like we'll see a $2,000 subscription this year. We've seen $200 subscriptions. That could 10X again, and these are the kind of things that could come, and they're all downstream of this bigger model that offers just a little bit more cutting edge. - So, you know, it's reported that xAI is gonna hit that one-gigawatt scale early '26, and a full two gigawatts by year end. How do you think they'll utilize that in the context of scaling laws? Is a lot of that inference? Is a lot of that training? - It ends up being all of the above. So I think that all of your decisions when you're training a model come back to pre-training. So if you're going to scale RL on a model, you still need to decide on your architecture that enables this. We were talking about other architectures and using different types of attention, or a mixture of experts models. The sparse nature of MoE models makes it much more efficient to do generation, which becomes a big part of post-training, and you need to have your architecture ready so that you can actually scale up this compute. I still think most of the compute is going in at pre-training. Because you can still make a model better, you still want to go and revisit this. You still want the best base model you can. And in a few years that'll saturate and the RL compute will just go longer. - Are there people who disagree with you and say pre-training is dead? It's all about scaling inference, scaling post-training, scaling context, continual learning, scaling data, synthetic data? - People vibe that way and describe it in that way, but I think it's not the practice that is happening. - It's just the general vibe of people saying this thing is dead- - The excitement is elsewhere. So the low-hanging fruit- ...in RL is elsewhere. For example, we released our model in November... Every company has deadlines. Our deadline was November 20th, and for that, our run was five days, which compared to 2024 is a very long time to just be doing post- training at a model of 30 billion parameters. It's not a big model. And then in December, we had another release, where we let the RL run for another three and a half weeks, and the model got notably better, so we released it. And that's a to just allocate to something that is going to be your peak- ...for the year. So it's like- - The reasoning is- - There's these types of decisions when training a model where they just... They can't leave it forever. You have to keep pulling in the improvements from researchers. So you redo pre-training, you'll do this post-training for a month, but then you need to give it to your users. You need to do safety testing. So it's just... I think there's a lot in place that reinforces this cycle of updating the models. Things improve. You get a new compute cluster that lets you do something more stably or faster. It's like you hear a lot about Blackwell having rollout issues, where at AI2, most of the models we're pre-training are on 1,000 to 2,000 GPUs. But when pre-training on 10,000 or 100,000 GPUs, you hit very different failures. GPUs break in weird ways, and on a 100,000 GPU run, you're pretty much guaranteed to have one GPU that is down. Your training code must handle that redundancy, which is a very different problem. Whereas what we're doing, like playing with post-training on a cluster, or for people learning ML, what they're battling to train these biggest models is just- ...mass distributed scale, and it's very different. But that's somewhat different than- That's a systems problem- ...in order to enable scaling laws, especially at pre-training. You need all these GPUs at once. When we shift to RL, it actually lends itself to heterogeneous compute because you have many copies of the model. To do a primer for language model reinforcement learning, what you're doing is having two sets of GPUs. One you can call the actor, and one you call the learner. The learner is where your actual reinforcement learning updates happen. These are traditionally policy gradient algorithms. Proximal Policy Optimization, PPO, and Group Relative Policy Optimization, GRPO, are the two popular classes. And on the other side you have actors which are generating completions, and these completions are what you're going to grade. Reinforcement learning is all about optimizing reward. In practice, you can have a lot of different actors in different parts of the world doing different types of problems, and then you send it back to this highly networked compute cluster to do this actual learning where you take the gradients. You need to have a tightly meshed network to do different types of parallelism and spread out your model for efficient training. Every different type of training and serving has these considerations to scale. We talked about pre-training and RL, and then inference time scaling- how do you serve a model that's thinking for an hour to 100 million users? I don't know about that, but I know that's a hard problem. In order to give people this intelligence, there's all these systems problems, and we need more compute and you need more stable compute to do it." - But you're bullish on all of these kinds of scaling is what I'm hearing. On the inference, on the reasoning, even on the pre-training? - Yeah, so that's a big can of worms here, but there are basically two... The knobs are the training and the inference scaling where you can get gains. In a world where we had, let's say, infinite compute resources, you want to do all of them. So you have training, you have inference scaling, and training is like a hierarchy: it's pre-training, mid-training, post-training. Changing the model size, more training data, training a bigger model gives you more knowledge in the model. Then the model, let's say, has a better base model. Back in the day, or still, we call it a foundation model, and it unlocks... But you don't, let's say, have the model be able to solve your most complex tasks during pre-training or after pre-training. You still have these other unlock phases where you have mid-training or, for example, post-training with RL that unlocks capabilities that the model has in terms of knowledge in the pre-training. And I think, sure, if you do more pre-training, you get a better base model that you can unlock later. But like Nathan said, it just becomes too expensive. We don't have infinite compute, so you have to decide, do I want to spend that compute more on making the model larger? It's like a trade-off. In an ideal world, you want to do all of them. And I think in that sense, scaling is still pretty much alive. You would still get a better model, but like we saw with Claude Opus 4.5, it's just not worth it. Because you can unlock more performance with other techniques at that current moment, especially if you look at inference scaling. That's one of the biggest gains this year with o1, where it took a smaller model further than pre-training a larger model like Claude Opus 4.5. So I wouldn't say pre-training scaling is dead, it's just that there are other more attractive ways to scale right now. But at some point, you will still want to make some progress on the pre-training. The thing also to consider is where you want to spend your money. If you spend it more on the pre-training, it's like a fixed cost. You train the model, and then it has this capability forever. You can always use it. With inference scaling, you don't spend money during training, you spend money later per query, and then it's also like math. How long is my model gonna be on the market if I replace it in half a year? Maybe it's not worth spending $5 million, $10 million, $100 million on training it longer. Maybe I will just do more inference scaling and get performance there. It maybe costs me $2 million in terms of user queries. It becomes a question of how many users you have and doing the math, and I think that's also where it's interesting where ChatGPT is in a position. I think they have a lot of users where they need to go a bit cheaper, where they have that GPT-5 model that is a bit smaller. Other companies that have... Let's say, if your customers have other trade-offs. For example, there was also the Math Olympiad or some of these math problems where ChatGPT or they had a proprietary model, and I'm pretty sure it's just like a model that has been fine-tuned a little bit more, but most of it was during inference scaling to achieve peak performance in certain tasks. need that all the time. But yeah, long story short, I do think all of these pre-training, mid-training, post-training, inference scaling, they are all still things you want to do. It's just finding—at the moment, in this year, it's finding the right ratio that gives you the best bang for the buck, basically. - I think this might be a good place to define pre-training, mid-training, and post-training. - So, pre-training is the classic training one next token prediction at a time. You have a big corpus of data. And Nathan probably also has very interesting insights there because of OLMo 3. A big portion of the paper focuses on the right data mix. So, pre-training is essentially just, you know, training cross entropy loss, training on next token prediction on a vast corpus of internet data, books, papers and so forth. It has changed a little bit over the years in the sense people used to throw in everything they can. Now, it's not just raw data. It's also synthetic data where people, let's say, rephrase certain things. So synthetic data doesn't necessarily mean purely AI-made data. It's also taking something from an article, a Wikipedia article, and then rephrasing it as a Q&A question or summarizing it, rewording it, and making better data that way. Because I think of it also like with humans. If someone, let's say, reads a book compared to a messy—no offense, but like—Reddit post or something like that, I do think you learn— - There's going to be a post about this, Sebastian. - Some Reddit data is very coveted and excellent for training. You just have to filter it. - And I think that's the idea. I think it's like if someone took that and rephrased it in a, let's say, more concise and structured way, I think it's higher quality data that gets the LLM there faster. You get the same LLM out of it at the end, but it trains faster because if the grammar and the punctuation are correct, it already learns the correct way, versus getting information from a messy source and then learning later how to correct that. So, I think that is how pre-training evolved and why scaling still works. It's not just about the amount of data, it's also the tricks to make that data better for you, in a sense. And then mid-training is... I mean, it used to be called pre-training. I think it's called mid-training because it was awkward to have pre-training and post-training but nothing in the middle, right? It sounds a bit weird. You have pre-training and post-training, but what's the actual training? So, the mid-training is usually similar to pre-training, but it's a bit more specialized. It's the same algorithm, but what you do is you focus, for example, on long-context documents. The reason you don't do that during pre-training is because you don't have that many long context documents. We have a specific phase. And one problem of LLMs is still that it's a neural network. It has the problem of catastrophic forgetting. So, you teach it something, it forgets other things. And you wanna... I mean, it's not 100% forgetting, but it's like "no free lunch." It's the same with humans. If you ask me some math I learned 10 years ago, I would have to look at it again. - Nathan was actually saying that he's consuming so much content that there's a catastrophic forgetting issue. - Yeah, I'm trying to learn so much about AI, and it's like I was learning about pre-training parallelism. I'm like, "I lost something and I don't know what it was." - I don't want to anthropomorphize LLMs, but it's the same kind of thing in how humans learn. I mean, quantity is not always better because you have to be selective. And mid-training is being selective in terms of quality content at the end. So the last thing the LLM has seen is the quality stuff. And then post-training is all the fine-tuning, supervised fine-tuning, DPO, Reinforcement Learning with Verifiable Rewards (RLVR), with human feedback, and so forth. So the refinement stages. And it's also interesting, it's a cost thing. You spend a lot of money on pre-training right now. RL a bit less. With RL, you don't really teach it knowledge. It's more like unlocking the knowledge; it's more like a skill learning, like how to solve problems with the knowledge that it has from pre-training. There are actually three papers this year, or last year, 2025, on RL for pre-training. But I don't think anyone does that in production. - Toy examples for now. - Toy examples, right? But to generalize, RL post-training is more like the skill unlock, where pre-training is like soaking up the knowledge. - A few things that could be helpful. A lot of people think of synthetic data as being bad for training the models. You mentioned how DeepSeek got almost... OCR, which is Optical Character Recognition. A lot of labs did it. Ai2 had one, Meta had multiple. And the reason each of these labs has these is because there are vast amounts of PDFs and other digital documents on the web that aren't in formats that are encoded with text easily. So you use these Almost-OCR, DeepSeek OCR, or what we called our Almost-OCR, to extract trillions of tokens of candidate data for pre-training. Pre-training dataset size is measured in trillions of tokens. Smaller models from researchers can be something like five to 10 trillion. researchers can be something like five to 10 trillion. Um, Qwen is documented going up to 50 trillion, and there are rumors that these closed labs can go to 100 trillion tokens. Getting this potential data to put in—they have a very big funnel, and the data you actually train on is a small percentage of this. This character recognition data would be described as synthetic data for pre-training in a lab. And then there's also the fact that ChatGPT now gives wonderful answers, and you can train on those best answers, and that's synthetic data. It's very different than early ChatGPT with lots of hallucination data. when people became grounded in synthetic data. - One interesting question is, if I recall correctly, OLMo 3 was trained with less data than specifically some other open-weight models, maybe even OLMo 2. But you still got better performance, and that might be one example of how the data helped. - It's mostly down to data quality. I think if we had more compute, we would train for longer. I think we'd ultimately see that as something we would want to do. And especially with big models, you need more compute, because we talked about having more parameters and we talked about knowledge. Essentially, there's a ratio where big models can absorb more from data, and then you get more benefit out of this. Any logarithmic graph in your mind is like a small model will level off sooner if you're measuring tons of tokens, and bigger models need more. But mostly, we aren't training that big of models right now at AI2, and getting the highest quality data we can is the natural starting point. - Is there something to be said about the topic of data quality? Is there some low-hanging fruit there still where the quality could be improved? - It's like turning the crank. Historically, in the open, there's been a canonical best pre-training dataset that has moved around between who has the most recent one or the best recent effort. Like AI2's Dolma was very early with the first OLMo, and Hugging Face had FineWeb. And there's a DCLM project, which has been kind of like a, which stands for Data Comp Language Model. There's been Data Comp for other machine learning projects, and they had a very strong dataset. And a lot of it is the internet is becoming fairly closed off, so we have Common Crawl, which is hundreds of trillions of tokens, and you filter it. It looks like scientific work where you're training classifiers and making decisions based on how you prune down this dataset into the highest quality stuff and the stuff that suits your tasks. Previously, language models were tested a lot more on knowledge and conversational things, but now they're expected to do math and code. To train a reasoning model, you need to remix your whole dataset. And there's a lot of wonderful scientific methods here where you can, you can take your gigantic dataset, sample really tiny things from different sources, such as GitHub, Stack Exchange, Reddit, Wikipedia. You can sample small things from them, and train small models on each of these mixes and measure their performance on your evaluations. You can just do basic linear regression, and it's like, "Here's your optimal dataset." But if your evaluations change, your dataset changes a lot. So a lot of OLMo 3 was new sources for reasoning to be better at math and code, and then you do this mixing procedure and it gives you the answer. I think that's happened at labs this year; there's new hot things, whether it's coding environments or web navigation, and you need to bring in new data, change your whole pre-training so that your post-training can work better. And that's like the constant evolution and the redetermining of what they care about for their models. - Are there fun anecdotes of what sources of data are particularly high quality that we wouldn't expect? You mentioned Reddit sometimes can be a source. - Reddit was very useful. I think PDFs is definitely one. - Oh, especially arXiv. - Yeah, so AI2 has run Semantic Scholar for a long time, which is what you can say is a competitor to Google Scholar with a lot more features. And to do this, AI2 has found and scraped a lot of PDFs for openly accessible papers that might not be behind the closed walled garden of a certain publisher. So, truly open scientific PDFs. And if you sit on all of these and you process it, you can get value out of it. And I think that like, a lot of that style of work has been done by the frontier labs did much earlier. You just need to have a pretty skilled researcher that understands how things change models; they bring it in, clean it, and it's a lot of labor. When frontier labs scale researchers, a lot more goes into data. If you join a frontier lab and you want to have impact, the best way to do it is just find new data that's better. And then, the fancy, glamorous algorithmic things like figuring out how to make o1 is like the sexiest thought of a scientist. It's like, "Oh, I figured out how to scale RL." There's a group that did that, but most of the contribution is like— - On the dataset - ..."I'm gonna make the data better," or, "I'm gonna make the infrastructure better so everyone on my team can run experiments 5% faster." - At the same time, I think it's also one of the closest guarded secrets, what your training data is, for legal reasons. And so there's also, I think, a lot of work that goes into hiding what your training data was essentially. Like training the model to not give away the sources because you have legal reasons. - The other thing, to be complete, is that some people are trying to train on only licensed data, whereas Common Crawl is a scrape of the whole internet. So if I host multiple websites, I'm happy to have them train language models, but I'm not explicitly licensing what governs it. And therefore, Common Crawl is largely unlicensed, which means that your consent really hasn't been provided for how to use the data. There's another idea where you can train language models only on data that has been licensed explicitly, so that the kind of governing contract is provided, and I'm not sure if Apertus is the copyright thing or the license thing. I know that the reason that they did it was for an EU compliance thing, where they wanted to make sure that their model fit one of those checks. - On that note, there's also the distinction in licensing. Some people just purchase the license. Let's say they buy an Amazon Kindle book, or a Manning book, and then use that in training. That is a gray zone 'cause you paid for the content and you might want to train on it. But then there are also restrictions where even that shouldn't be allowed. And so that is where it gets a bit fuzzy. And yeah, I think that is right now still a hot topic. Big companies like OpenAI approached private companies for their proprietary data and private companies, they become more and more, let's say, protective of their data because they know, "Okay, this is going to be my moat in a few years." And I do think that's like the interesting question, where if LLMs become more commoditized, and I think a lot of people learn about LLMs, there will be a lot more people able to train LLMs. Of course, there are infrastructure challenges. But if you think of big industries like pharmaceutical industries, law, finance industries, I do think they, at some point, will hire people from other frontier labs to build their in-house models on their proprietary data, which will be then, again, another unlock with pre-training that is currently not there. Because even if you wanted to, you can't get that data. You can't get access to clinical trials most of the time and these types of things. So, I do think scaling, in that sense, might be still pretty much alive if you also look in domain-specific applications, because we are still right now, in this year, just looking at general purpose LLMs on, on ChatGPT, Anthropic and so forth. They are just general purpose, they're not even, I think, scratching the surface of what an LLM can do if it is really specifically trained and designed for a specific task. - I think on the data thing—this is one of the things that happened in 2025, and we totally forget it—is Anthropic lost in court and owed $1.5 billion to authors. Anthropic, I think, bought thousands of books and scanned them and was cleared legally for that because they bought the books, and that is kind of going through the system. And then the other side, they also torrented some books, and I think this torrenting was the path where the court said that they were then culpable to pay these billions of dollars to authors, which is just such a mind-boggling lawsuit that kind of just came and went. That is so much money- ... from the VC ecosystem. - These are court cases that will define the future of human civilization, because it's clear that data drives a lot of this, and there's this very complicated human tension of... I mean, you can empathize. You're both authors. And there's some degree to which, I mean, you put your heart and soul and your sweat and tears into the writing that you do. It feels a little bit like theft for somebody to train your data without giving you credit. - And there are, like Nathan said, also two layers to it. Someone might buy the book and then train on it, which could be argued fair or not fair, but then there are the straight-up companies who use pirated books where it's not even compensating the author. That is, I think, where people got a bit angry about it specifically, I would say. - Yeah, but there has to be some kind of compensation scheme. This is like moving towards- ... towards something like Spotify streaming did- ... originally for music. You know, what does that- ... compensation look like? You have to define those kinds of models. You have to think through all of that. One other thing I think people are generally curious about, I'd love to get your thoughts, as LLMs are used more and more. If you look at even arXiv, but GitHub, more and more of the data is generated by LLMs. What do you do in that kind of world? How big of a problem is that? - Largest problem's the infrastructure and systems, but from an AI point of view, it's kind of inevitable. - So it's basically LLM-generated data that's curated by humans essentially, right? - Yes, and I think that a lot of open source contributors are legitimately burning out. If you have a popular open source repo, somebody's like, "Oh, I want to do open source AI. It's good for my career," and they just vibe- -code something and they throw it in. You might get more of this- - I have a- - - than I do. - Yeah, so I have actually a case study here. I have a repository called MLxtend that I developed as a student around 10 years ago, and it is a reasonably popular library still for certain algorithms, I think especially like frequent data mining stuff. And there were recently two or three people who submitted a lot of PRs in a very short amount of time. I do think LLMs have been involved in submitting these PRs. Me, as the maintainer, there are two things. First, I'm a bit overwhelmed. I don't have time to read through it because, especially as an older library, that is not a priority for me. At the same time, I kind of also appreciate it because I think something people forget is it's not just using the LLM. There's still a human layer that verifies something, and that is in a sense also how data is labeled, right? One of the most expensive things is getting labeled data for RL from human feedback phases. And this is kind of like that, where it goes through phases, and then you actually get higher quality data out of it. And so I don't mind it in a sense. It can feel overwhelming, but I do think there is also value in it. - It feels like there's a fundamental difference between raw LLM-generated data and LLM-generated data with a human in the loop that does some kind of verification, even if that verification is a small percent of the lines of code. - I think this goes with anything where people think also sometimes, "Oh, yeah. I can just use an LLM to learn about XYZ," which is true. You can, but there might be a person who is an expert who might have used an LLM to write specific code. There is kind of like this human work that went into it to make it nice, throwing out the not-so-nice parts to kind of pre-digest it for you, and that saves you time. I think that's the value-add, where you have someone filtering things or even using the LLMs correctly. This is still labor that you get for free. For example, if you read a Substack article, I could maybe ask an LLM to give me opinions on that, but I wouldn't even know what to ask. And I think there is still value in reading that article compared to me going to the LLM because you are the expert. You select what knowledge is actually spot on, should be included, and you give me this very... this executive summary. And this is a huge value-add because now I don't have to waste three to five hours to go through this myself, maybe get some incorrect information and so on. And so…

Transcript truncated. Watch the full video for the complete content.

Get daily recaps from
Lex Fridman

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.