DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters | Lex Fridman Podcast #459

Lex Fridman| 05:06:18|Mar 27, 2026
Chapters22
Dylan Patel and Nathan Lambert discuss the DeepSeek moment, its tech context across major AI players, and geopolitical implications, while aiming to cut through hype and explain how these models work and what the future may hold.

Lex Fridman chats with Dylan Patel and Nathan Lambert about DeepSeek, OpenAI, NVIDIA, xAI, TSMC, Stargate, and the AI megaclusters shaping geopolitics and the future of computing.

Summary

In this marathon Lex Fridman episode, Dylan Patel of Semi Analysis and Nathan Lambert from the Allen Institute for AI join the discussion to unpack DeepSeek’s V3 and R1, OpenAI’s O3 mini, and the shifting landscape of open weights, licenses, and post-training. They demystify the DeepSeek V3 base, its post-training regime, and how R1’s reasoning model advances chain-of-thought-style problem solving with cost-efficient sparsity (Mixture of Experts and MLA attention). The trio contrasts pre-training versus post-training, RLHF, and the newer RL-based reasoning approaches that produce verifiable capabilities, including math and code verifiers. They also dive into hardware realities: NVIDIA’s Hopper ecosystem, H800 vs H100, the implications of export controls, and China–Taiwan dynamics around DeepSeek and Stargate. The conversation then pivots to organizations building colossal AI data centers (Stargate, OpenAI, Meta, Google), the economics of training versus inference, and who currently monetizes AI. Throughout, they touch on geopolitical stakes, the race for AI-enabled agents, the future of open-source AI, and the human-in-the-loop role in safety, alignment, and practical deployment. The tone remains accessible with definitions and concrete examples, even as the topics veer into GPUs, data centers, and global strategy. Specifically, they discuss how DeepSeek’s open-weight strategy pressures competitors, the licensing landscape (MIT-like vs. more restrictive llama licenses), and the reality of export controls as a geopolitical tool shaping who can train what and where. They conclude with reflections on the societal trajectory of AI, the potential for agentic systems, and the enduring human element in design and governance.

Key Takeaways

  • DeepSeek V3 is a base LLM trained with large-scale pre-training and a separate post-training regime that creates a chat-like, instruction-tuned model; DeepSeek R1 adds a reasoning finetuning path on top of V3’s base.
  • Mixture of Experts (MoE) and MLA attention reduce compute by activating only a subset of parameters per token, enabling 600B+ parameter scales with around 37B active during inference in DeepSeek’s MoE design.
  • Open weights with permissive licenses (MIT-like) and open data/code concepts are pushing the frontier toward more usable, verifiable AI; DeepSeek’s R1 is notable for its open weights and commercial friendliness.
  • Export controls and geopolitical tensions (US–China) are increasingly shaping where and how frontier AI is trained, with Stargate-like projects and 2–5 GW data-center scales highlighted as potential future bottlenecks or accelerants.
  • Reasoning models (R1, O3 mini, Gemini Flash) reveal a continuum: base capabilities, explicit chain-of-thought, and parallel sampling/search strategies; pricing and accessibility are rapidly evolving across incumbents and newcomers.
  • Hardware economies (NVIDIA Hopper, H100 vs H800, memory bandwidth, interconnect) and data-center scale dominate both cost and capability; power and cooling are strategic chokepoints in mega-clusters.
  • AI agents represent the next frontier beyond chat and reasoning; real-world deployment will hinge on reliability, safety, boundary conditions, and human-in-the-loop governance; true autonomy remains a work in progress.

Who Is This For?

This episode is essential for AI developers, researchers, and engineers who want a grounded view of frontier models, the economics of mega-clusters, and the geopolitics surrounding AI infrastructure. It’s also valuable for policy watchers and tech strategists tracking how licensing, export controls, and open-source AI could reshape industry dynamics.

Notable Quotes

"Deep Seek V3 is a base model… and then you do post-training to get a chat model; the R1 is the reasoning model built on top of that base."
Distinguishes DeepSeek base pre-training from post-training for chat and the subsequent reasoning-focused R1.
"There are licenses that come from history and open source software… not all the same models have the same terms."
Clarifies open weights licensing complexity and why DeepSeek’s licensing is important in the open-source AI discourse.
"Mixture of Experts… you only activate about 37 billion parameters per token, even though you have 600+ billion total—big efficiency win."
Explains how MoE and MLA reduce compute and enable ultra-large models.
"The ‘open weights’ concept is not the same as ‘open source’ but it pushes toward more replicable AI; DeepSeek’s R1 is notable for its open weights with a permissive license."
Highlights licensing and openness dynamics in frontier AI.
"Export controls are shaping who can train frontier AI and where; Stargate-sized data centers represent the scale the US wants to deter or slow in other regions."
Links hardware policy to geopolitics and the AI race.

Questions This Video Answers

  • How does DeepSeek R1 differ from V3 in terms of training and reasoning capabilities?
  • What is a Mixture of Experts model and how does MLA attention improve efficiency for large language models?
  • Why are export controls significant for AI training, and how might Stargate-like centers affect global AI development?
  • How do open weights licenses influence accessibility and commercial use of frontier AI models?
  • What are the differences between OpenAI's O3 mini and DeepSeek R1 in terms of cost and inference performance?
DeepSeek V3DeepSeek R1Open weightsMixture of Experts (MoE)MLA attentionChain of ThoughtO3 miniOpenAIClaudeGPT-4 / Gp4 Turbo context costs
Full Transcript
the following is a conversation with Dylan Patel and Nathan Lambert Dylan runs semi analysis A well respected research and Analysis company that specializes in semiconductors gpus CPUs and AI Hardware in general Nathan is a research scientist at the Allen Institute for AI and is the author of the amazing blog on AI called interconnects they are both highly respected red and and listened to by the experts researchers and engineers in the field of AI and personally I'm just a fan of the two of them so I used the Deep seek moment that shook the AI World a bit as an opportunity to sit down with them and lay it all out from Deep seek open AI Google xai meta anthropic to Nvidia and tsmc and to us China Taiwan relations and everything else that is happening at the cutting Ed of AI this conversation is a deep dive into many critical aspects of the AI industry while it does get super technical we try to make sure that it's still accessible to folks outside of the AI field by defining terms stating important Concepts explicitly spelling out acronyms and in general always moving across the several layers of abstraction and levels of detail there is a lot of hype in the media about what AI is and isn't the purpose of this podcast in part is to cut through the hype through the and the low resolution analysis and to discuss in detail how stuff works and what the implications are let me also if I may comment on the new open AI 03 mini reasoning model the release of which we were anticipating during the conversation and it did indeed come out right after its capabilities and costs are on par with our expectations as we stated open AI 03 mini is indeed a great model but it should be stated that uh deep SEC car 1 has similar performance on benchmarks is still cheaper and it reveals its Chain of Thought reasoning which O3 mini does not it only shows a summary of the reasoning plus R1 is open weight and uh 03 mini is not by the way I got a chance to play with uh O3 mini and uh anecdotal Vibe checkwise I felt that O3 mini specifically O3 mini high is uh better than R1 still for me personally I find that Claude Sona 35 is the best model for programming except for tricky cases where I will use 01 Pro to brainstorm either way many more better AI models will come including reasoning models both from American and Chinese companies they will continue to shift the cost curve but the quote deep seek moment is indeed real I think it will still be remembered 5 years from now as a pivotal event in Tech History due in part to the geopolitical implications but for other reasons too as we discuss in detail from many perspectives in this conversation this is leex Freedman podcast to support it please check out our sponsors in the description and now dear friends here's Dyan Patel and Nathan Lambert a lot of people are curious to understand China's deep seki models so let's lay it out Nathan can you describe what deep seek V3 and deep seek R1 are how they work how they're trained Let's uh look at the big picture and then we'll zoom in on the details yeah so deep seek V3 is a new mixture of experts Transformer language model from Deep seek who is based in China they have some new specifics in the model that we'll get into largely this is a open weight model and it's a instruction model like what you would use in chat GPT um they also release what is called the base model which is before these techniques of posttraining most people use instruction models today and those are what's served in all sorts of applications this was released on I believe December 26th or that week and then weeks later on January 20th deep seek released deep seek R1 which is a reasoning model which really accelerated a lot of this discussion this resenting model has a lot of overlapping training steps to deep seek V3 and it's confusing that you have a base model called V3 that you do some too to get a chat model and then you do some different things to get a reasoning model I think a lot of the AI industry is going through this challenge of communications right now where open AI makes fun of their own naming scheme they have gbt 4 they have open ai1 and there's a lot of types of models so we're going to break down what each of them are there's a lot of technical specifics on training and go from high level to specific and kind of go through each of them there's so many places we can go here but maybe let's go to open weights first what does it mean for model to be open weights and what are the different flavors of Open Source in general yeah so this discussion has been going on for a long time in AI it became more important since chat gbt or more focal since trat BT at the end of 2022 open weights is the accepted term for um when model weights of a language model are available on the internet for people to download those weights can have different licenses which is the effectively the terms by which you can use the model there are licenses that come from history and open source software there are licenses that are designed by companies specifically um all of llama deep seek quen mistol these popular names in open weight models have some of their own licenses it's complicated because not all the same models have the same terms the big debate is on what makes a model open weight it's like why are we saying this term it's kind of a mouthful it sounds close to open source but it's not the same there's still a lot of debate on the definition and soul of open- source AI open source software has a rich history on freedom to modify freedom to take on your own freedom for many restrictions on how you would use the software and what that means for AI is still being defined so uh for what I do I work at the Allen Institute for AI we're a nonprofit We want to make AI open for everybody and we try to lead on what we think is truly open source there's not full agreement in the community but for us that means releasing the training data releasing the training code and then also having open weights like this and we'll get into the details of the models and again and again as we try to get deep into how the models will train were trained we will say things like the data processing data filtering data quality is the number one determinant of the model quality and then a lot of the training code is the determinant on how long it takes to train and how faster experimentation is so without fully open- Source models where you have access to this data it is hard to know or it's harder to replicate so we'll get into cost numbers for deeps B3 on mostly GPU hours and how much you could pay to rent those yourselves but without the data the replication cost is going to be far far higher and same goes for the code we should also say that this is probably one of the more open models out of the frontier models so like in this full spectrum where probably the fullest open source like you said open code open data open weights this is not open code this is probably not open data and this is open weights and the licensing is uh MIT license or it's uh I mean there's some nuance and the different models but it's towards the free in terms of the open source movement these are the kind of the good guys yeah deep seek is doing fantastic work for disseminating understanding of AI their papers are extremely detailed in what they do and for other teams around the world they're very actionable in terms of improving your own training techniques uh and we'll talk about licenses more the Deep seek R1 model has a very permissive license it's called the M license that effectively means there's no Downstream restrictions on commercial use there's no use case restrictions you can use the outputs from the models to create synthetic data and this is all fantastic I think the closest pier is something like llama where you have the weights and you have a technical report and the technical report is very good for llama one of the most red p PDFs of the year last year is the Llama 3 paper but in some ways it's slightly less actionable it has less details on the training specifics like less plots um and so on and the Llama 3 license is more restrictive than MIT and then between the deep sea custom license and the Llama license we could get into this whole Rabbit Hole I think we we we'll make sure we want to go down the license rabbit hole before we do specifics yeah and I mean so it should be stated that one of the implications of deep secret puts pressure on llama and everybody else on open AI to push towards uh open source and that's the other side of Open Source that uh you mentioned is how much is published in detail about it so how open are you with the sort of the insights behind the code so like how good is the technical reports are they hand wavy or is there actual uh details in there and that's one of the things that deep seek did well is they publish a lot of the details yeah especially in the deeps V3 which is their pre-training paper they were very clear that they are doing inter itions on the technical stack that go at many different levels for example on their to get highly efficient training they're making modifications at or below the Cuda layer for NVIDIA chips I have never worked there myself and there are a few people in the world that do that very well and some of them are at Deep seek and these types of people are at Deep seek and leading American frontier Labs but there are not many places to help people understand the other implication of open weights just you know there's uh a topic we'll return to often here so there's a uh fear that China the nation might have interest in um stealing American data violating privacy of American citizens what can we say about open weights to help us understand what what the weights are able to do yeah in terms of stealing people's data yeah so these weights that you can download from hugging face or other platforms are very big matrices of numbers you can download them to a computer in your own house that has no internet and you can run this model and you're totally control of your data that is something that is different than how a lot of language model usage is actually done today which is mostly through apis where you send your prompt to gpus run by certain companies and these companies will have different distributions and policies on how your data is stored if it is used to train future models where it is stored if it is encrypted and so on so the open weights you have your fate of data in your own hands and that is something that is deeply connected to the soul of Open Source so it's not the model that steals your data it's clovers hosting the model which could be China if you're using the Deep seek app or it could be perplexity uh you know you're trusting them with your data or open AI you're trusting them with your data and some of these are American companies some of these are Chinese companies but the model itself is not doing the stealing it's the host all right so uh back to the basics what's the difference between deep seek V3 and deep seek R1 can we try to like lay out the confusion potential yes so for one I have very understanding of many people being confused by these two model names so I would say the best way to think about this is that when training a language model you have what is called pre-training which is when you're predicting the large amounts of mostly internet text you're trying to predict the next token and what to know about these new deep seek models is that they do this internet large scale pre-training once to get what is called Deep seek V3 base this is a base model it's just going to finish your sentences for you it's going to be harder to work with than chat GPT and then what deep seek did is they've done two different posttraining regimes to make the models have specific desirable behaviors so what is the more normal model in terms of the last few years of AI an instruct model a chat model a quote unquote aligned model a helpful model there are many ways to describe this is more standard post training so this is things like instruction tuning reinforce learning from Human feedback we'll get into some of these words and this is what they did to create the deeps V3 model this was the first Model to be released and it is very high performant it's competitive with gp4 llama 405b so on and then when this release was happening we don't know their exact timeline or soon after they were finishing the training of a different training process from the same next token prediction base model that I talked about which is when this new reasoning training that people have heard about comes in in order to create the model that is called Deep seek R1 the r through this conversation is good for grounding for reasoning and the name is also similar to open AI 01 which is other reasoning model that people have heard about and we have to break down the training for R1 in more detail because for one we have a paper detailing it but also it is a far newer set of techniques for the AI community so is a much more rapidly evolving area of research maybe we should also say the big two categories of training of pre-training and posttraining these umbrella terms that people use so what is pre-training and what is posttraining and what are the different flavors of things underneath posttraining umbrella yeah so pre-training I'm using some of the same words to really get the message across is you're doing what is called autor regressive prediction to predict the next token in a series of documents this is done over standard practice is trillions of tokens so this is a ton of data that is mostly scraped from the web in some of deep se's earlier papers they talk about their training data being distilled for Math and I shouldn't use this word yet but taken from common crawl and that's a public access that anyone listening to this could go download data from the common crawl website this is a crawler that is maintained publicly yes other tech companies eventually shift to their own crawler and deepy likely has done this as well as most Frontier Labs do but this sort of data is something that people can get started with and you're just predicting text in a series of documents this is can be scaled to be very efficient and there's a lot of numbers that are thrown around in AI training like how many floating Point operations or flops are used and then you can also look at how many hours of these gpus that are used and it's largely one loss function taken to a very large amount of of compute usage you just you set up really efficient systems and then at the end of that you have this space model and pre-training is where there is a lot more of complexity in terms of how the process is emerging or evolving and the different types of training losses will use I think this is a lot of techniques grounded in the natural language processing literature the oldest technique which is still used today is something called instruction tuning or also known as supervised fine tuning these acronyms will be if or sft it's that people really go back and forth throughout them and I will probably do the same which is where you add this formatting to the model where it knows to take a question that is like explain the history of the Roman Empire iror to me and or something you a sort of question you'll see on Reddit or stack Overflow and then the model will respond in a information dense but presentable manner the core of that formatting is in this instruction tuning phase and then there's two other categories of loss functions that are being used today one I will classify as preference fine tuning preference fine tuning is a generalized term for what came out of reinforcement learning from Human feedback which is rhf this reinforce learning from Human feedback is credited as the technique that helped uh chat GPT break through it is a technique to make the responses that are nicely formatted like these Reddit answers more in tune with what a human would like to read this is done by collecting parse preferences from actual humans out in the world to start and now AIS are also labeling this data and we'll get into those trade-offs and you have this kind of contrastive loss function between a good answer and a bad answer and the model learns to pick up these Trends there's different implementation ways you have things called reward models you could have direct alignment algorithms there's a lot of really specific things you can do but all of this is about fine-tuning to human preferences and the final stage is much newer and will'll link to what is done in R1 and these reasoning models is I think open ai's Nam for this they had this new API in the fall which they called the reinforcement fine-tuning API this is the idea that you use the techniques of reinforcement learning which is a whole framework of AI there's a deep literature here to summarize it's often known as trial and error learning or the subfield of AI where you're trying to make sequential decisions in a certain potentially un potentially noisy environment there's a lot of ways we could go down that but fine-tuning language models where they can generate an answer and then you check to see if the answer matches the true solution for math or code you have an exactly correct answer for math you can have unit tests for code and what we are doing is we are checking the language models work and we're giving it multiple opportunities on the same questions see if it is right and if you keep doing this the models can learn to improve in verifiable domains uh to a great extent it works really well it's a newer technique in the academic literature it's been used at Frontier labs in the US that don't share every detail uh for multiple years so this is the idea of using reinforcement learning with language models and it has been taking off especially in this deep seek moment and we should say that there's a lot of exciting stuff going on on the uh again across the stack but the post training probably this year there's going to be a lot of interesting developments in the post training we'll we'll talk about it uh I almost forgot to talk about the the difference between uh deep seek V3 and R1 on the user experience side so forget the technical stuff forget all that just people that don't know anything about AI they show up like what's the actual experience what's the use case for each one when they actually like type and talk to it what what is he good at and that kind of thing so let's start with deep seek V3 again it's what more people would have tried something like it you ask it a question it'll start generating tokens very fast and those tokens will look like a very human legible answer it'll be some sort of markdown list it might have formatting to help you draw to the core details in the answer and it'll generate tens to hundreds of tokens a token is normally a word for common words or a subword part in a longer word and it'll look like a very high quality Reddit or stack Overflow answer these models are really getting good at doing these across a wide variety of domains I think even things that if you're an expert things that are close to The Fringe of knowledge they will still be fairly good at I think Cutting Edge AI topics that I do research on these models are capable for study Aid and they're regularly updated this changes is with the Deep seek R1 what is called these reasoning models is when you see tokens coming from these models to start it will be a large chain of thought process we'll get back to Chain of Thought in a second which looks like a lot of tokens where the model is explaining the problem the model will often break down the problem be like okay they asked me for this let's break down the problem I'm going to need to do this and you'll see all of this generating from the model it'll come very fast in most user experiences these AP are very fast so you'll see a lot of tokens a lot of words show up really fast it'll keep flowing on the screen and this is all the reasoning process and then eventually the model will change its tone in R1 and it'll write the answer where it summarizes its reading reasoning process and writes a similar answer to the first types of model but in deep seeks case which is part of why this was so popular even outside the AI Community is that you can see how the language model is breaking down problems and then you get this answer on a technical side they they train the model to do this specifically where they have a section which is reasoning and then it generates a special token which is probably hidden from the user most of the time which says okay I'm starting the answer so the model is trained to do this two-stage process on its own if you use a similar model and say openai open ai's user interface is trying to summarize this process for you nicely by kind of showing the sections that the model is doing and it'll kind of Click through it'll say breaking down the problem making calculation cleaning the result and then the answer will come for something like open AI maybe it's useful here to go through like an example of a deep seek R1 reasoning yeah so the if if you're looking at the screen here what you'll see is a screenshot of the deep seek chat app and at the top is thought for 1517 seconds with the drop- down arrow underneath that if we were in an app that we were running the drop- down arrow would have the reasoning so in this case uh the question the specific question which you know I'm philosophically SL pothead inclin so this is uh asking deep deep SEC car1 for one truly novel insight about humans and it reveals the reasoning and basically the TR truly novel aspect is was pushing the reasoning to constantly sort of the model asking itself is this truly novel so it's actually challenging itself to be more novel more counterintuitive uh more uh less cringe I suppose so some of the reasoning says uh this is just snapshots alternatively humans have a unique meta emotion where they feel emotions about their own emotions you feeling guilty about being angry this recursive emotional layering creates complex motivational drives that don't exist in other animals the inside is that human emotions are nested so it's like it's reasoning through how humans feel emotions it's reasoning about meta emotions going to have pages and Pages this it's almost too much to actually read but it's nice to skim as it's coming it's stream of it's a James Joyce extreme of Consciousness and then it goes wait the user wants something that's not seen anywhere else let me dig deeper and consider the human ability to hold contradictory beliefs simultaneously cognitive dissonance is known but perhaps the function is to allow flexible adaptation so on and so forth I mean that really captures the public imagination that holy this isn't uh I mean intelligent slash almost like like an inkling of siience because like you're thinking through you're self-reflecting you're deliberating and the final result of that after 157 seconds is humans instinctively convert selfish desires into Cooperative systems by collectively pretending abstract rules money laws rights are real these shared hallucinations act as quote games where competition is secretly redirected to benefit the group turning conflict into society's fuel pretty profound I mean you know this is AAL digression but a lot of people have found that these reasoning models can sometimes produce much more eloquent text that a at least interesting example I think depending on how open minded you are you find language models interesting or not and there's a spectrum there well I mean it's some of the we'll talk about different benchmarks of s but some is just a Vibe like that in itself is a let's say quote fire tweet yeah if I I'm trying to produce something something where people are like oh okay so that's CH thought we'll probably return to it more how are they able to achieve such low cost on the training and the inference maybe you could talk the training first yeah so there's there's two main techniques that they implemented that are probably the majority of their efficiency and then there's a lot of implementation details that maybe we'll gloss over or get into later that sort of contribut to it but those two main things are one is they went to a mixture of experts model uh which which we'll Define in a second and then the other thing is that they invented this new technique called MLA lat and attention both of these are are big deals mixture of experts is something that's been in the literature for a handful of years and open AI with gp4 was the first one to productize a mixture of experts model and what this means is when you look at the common models around uh that most people have been able to interact with are open right think llama llama is a dense model I.E every single parameter or neuron is activated as you're going through the model for every single token you generate right now with a mixture of experts model you don't do that right how how does a human actually work right is like oh well my visual cortex is active when I'm thinking about you know Vision task and like you know other things right my my amydala is when I'm scared right these different aspects of your brain are focused on different things a mixture of experts model attempts to approximate this to some extent it's nowhere close to what a brain architecture is but different portions of the model activate right you'll have a set number of experts in the model and a set number that are activated each time and this dramatically reduces both your training and inference cost because now you're you know if you think about the parameter count as the sort of total embedding space for all of this knowledge that you're compressing down during training when you're embedding this data in instead of having to activate every single parameter every single time you're training or running inference now you can just activate a subset and the model will learn which expert to route to for different tasks and so this is a humongous innovation in terms of hey I can continue to grow the total embedding space of parameters and so deep seeks model is you know 600 something billion parameters right uh relative to llama 405b it's 405 billion parameters right llama relative to llama 70b it's 70 billion parameters right so this model technically has more embedding space for information right to compress all of the world's knowledge that's on the internet down but at the same time it is only activating around 37 billion of the parameters so only 37 billion of these parameters actually need to be computed every single time you're training data or inferencing data out of it and so versus versus again the Llama model 70 billion parameters must be activated or 405 billion parameters must be activated so you've dramatically reduced your compute cost when you're doing training and inference with this mixture of experts architecture so we break down where it actually applies and go into the Transformer is that useful let's go let's go into the Transformer the Transformer is a thing that is talked about a lot and we will not cover every detail uh essentially the Transformer is built on repeated blocks of this attention mechanism and then a traditional dense fully connected multi-layer perception whatever word you want to use for your normal neural network and you alternate these blocks there's other details and where mixture of experts is applied is that this dense model the dense model holds most of the weights if you count them in a Transformer model so you can get really big gains from those mixure of experts on parameter efficiency at training an inference because you get this efficiency by not activating all of these parameters we should also say that a Transformer is a giant neural network yeah and then there's for 15 years now there's What's called the Deep learning Revolution networks gotten larger and larger and at a certain point the scaling laws appeared where people realized this is a scaling law Shirt By the way representing scaling laws where it became more and more formalized that bigger is better across multiple dimensions of what bigger means so uh and but these are all sort of neural networks we're talking about and we're talking about different architectures of how construct to construct these neural networks such that the training and the inference on them is super efficient yeah every different type of model has a different scaling LW for it which which is effectively for how much compute you put in the architecture will get to different levels of performance at test tasks a mixture of experts is one of the ones at training time even if you don't consider the inference benefits which are also big at training time your efficiency with your gpus is dramatically improved by using this architecture if it is well implemented so you can get effectively the same performance model in evaluation scores with numbers like 30% less compute I think there's going to be a wide variation depending on your implementation details and stuff but it is just important to realize that this type of technical Innovation is something that gives huge gains and I expect most companies that are serving their models to move to this mixture of experts implementation historically the reason why not everyone might do it is because it's a implementation complexity especially when doing these big models so this is one of the things this deep seek gets credit for is they do this extremely well they do mixture of experts extremely well this architecture for what is called Deep seek moee is the shortened version of mixture of experts is multiple papers old this part of their training infrastructure is not new to these models alone and same goes for what Dylan mentioned with multi-ad lat and attention this is all about reducing memory usage during inference and same things during training by using some fancy low rank approximation math if you get into the details with this latent attention it's one of those things I look at it's like okay this they're doing really complex implementations because there's other parts of language model such as uh embeddings that are used to extend the context length the common one that deep seek used is Rotary positional and pendings which is called rope and if you want to use rope with a normal Moe it's kind of a sequential thing you take these you take two of the attention matrices and you rotate them by a complex value rotation which is a matrix multiplication with deep seek MLA with this new attention architecture they need to do some clever things because they're not set up the same and it just makes the implementation complexity much higher so they're managing all of these things and these are probably the sort of things that open AI these closed labs are doing we don't know if they're doing the exact same techniques but they actually shared them with the world which is really nice to like this is The Cutting Edge of efficient language model training and some of this is requires low-level engineering just is a giant mess and trickery so as I understand they went below Cuda so they go super low programming of gpus effectively Nvidia builds this Library called nickel right uh in which you know when you're training a model you have all these communications between every single layer of the model and you may have over 100 layers what does a nickel stand for it's nccl Nvidia Communications collectives Library nice um and so D when when you're training a model right you're going to have all these all reduces and all gathers right uh between each layer between the uh multier perceptron or feed forward Network and the attention mechanism you'll have you'll have basically the model synchronized right um or you'll have all the you'll have all reducer and all gather um and and this is a communication between all the gpus in the network whether whether it's in training or inference so Nvidia has a standard Library this is one of the reasons why it's really difficult to use anyone else's Hardware uh for training is because no one's really built a standard Communications Library um and and nvidia's done this at a sort of a higher level right a deep seek because they have certain limitations around the gpus that they have access to the interconnects are limited to some extent um by the restrictions of the gpus that were shipped into China legally not the ones that are smuggled but legally shipped in uh that they used to train this model they had to figure out how to get efficiencies right and one of those things is that instead of just calling the Nvidia Library nickel right they instead created their they scheduled their own Communications uh which which the lab some of the labs do right um em meta talked about in llama 3 how they made their own custom version of nickel this is they didn't they didn't talk about the implementation details this is some of what they did probably not as well as maybe not as well as deep seek Because deep seek you know necessity is the mother of innovation and they had to do this whereas uh in the casa you know open AI has people that do this sort of stuff anthropic Etc uh but you know deep seek certainly did it publicly and they may have done it even better because they were gimped on a certain aspect of the chips that they have access to and so they scheduled Communications um you know by scheduling specific SMS SMS you could think of as like the core on a GPU right so there's hundreds of cores or there's you know a bit over a 100 cores SMS on a GPU and they were specifically scheduling hey which ones are running the model which ones are doing all reduce which one are doing all gather right and they would flip back and forth between them and this requires extremely low-level programming this is what nickel does automatically or other Nvidia libraries handle this automatically usually yeah exactly and so so technically they're using you know PTX which is like sort of like you could think of it as like an assembly type language it's not exactly that or instruction set right like coding directly to assembly or instruction set it's not exactly that but uh that's still part of technically Cuda but it's like do I want to write in Python you know pytorch equivalent and call Nvidia libraries do I want to go down to the ca level right or uh you know and code even lower level or do I want to go all the way down to the assembly or Isa level and and there are cases where you go all the way down there at the very big Labs but most companies just do not do that right because it's a waste of time and the efficiency gains you get are not worth it but deep seeks implementation is so complex right especially with their mixture of experts right um people have done mixture of experts but they're generally 8 16 experts right and they activate to so you know one of the words we like Ed like to use is like sparsity Factor right or usage right so so you might have four you know one fourth of your model activate right and and and that's what Mist draws uh mixol model right uh their model that really catapulted them to like oh my God they're really really good um openi has also had models that aree and and so have all the other labs that are major closed but what deep seek did that maybe only the leading Labs have only just started recently doing is have such a high sparity factor right it's not 1/4 of the model right two out of eight experts activating every time you go through the model it's eight out of 256 and there's different implementations for mixture of experts where you can have some of these experts that are ever always activated which this just looks like a small neural network and all the tokens go through that and then they also go through some that are selected by this routing mechanism and one of the Innovations in deep seeks architecture is that they change the routing mechanism in mixture of expert models there's something called an auxiliary loss which effectively means during training you want to make sure that all of these experts are used across the tasks that the model sees why there can be failur and mixture of experts is that when you're doing this training the one objective is token prediction accuracy and if you just let toing go with a mixture of expert model on your own it can be that the model learns to only use a subset of the experts and in thee literature there's something called the auxiliary loss which helps balance them but if you think about the loss functions of deep learning this even connects to the bitter lesson is that you want to have the minimum inductive bias in your model to let the model learn maximally and this auxiliary loss this balancing across experts could be seen as intention with the prediction accuracy of the tokens so we don't know the exact extent that the Deep seeke change which is instead of doing an auxiliary loss they have an extra parameter in their routing which after the batches they update this parameter to make sure that the next batches all have a similar use of experts and this type of change can be big it can be small but they add up over time and this is the sort of thing that just points to them innovating and I'm sure all the labs that are training biges are looking at this sort of things which is getting away from the auxiliary loss some of them might already use it but you just keep you keep accumulating gains and we'll talk about the philosophy of training and how you organize these organizations and a lot of it is just compounding small improvements over time in your data in your architecture and your post trainining and how they integrate with each other deep seek does the same thing and some of them are shared or a lot we have to take them on face value that they share their most important details I mean the architecture and the weights are out there so we're seeing what they're doing and it adds up going back to sort of the like efficiency and complexity point right it's 32 versus a four right for like mix draw and othere models that have been publicly released so this ratio is extremely high and sort of what Nathan was getting at there was when you have such a different level of sparsity um you can't just have every GPU have the entire model right the model's too big there's too much complexity there so you have to split up the model um with different types of parallelism right and so you might have different experts on different GPU nodes but now what what happens when a to you know this set of data that you get hey all of it looks like this one way and all of it should route to one part of my you know model right um so so when all of it rout routes to one part of the model then you can have the you can have this overloading of a s certain set of the GPU resources or certain set of the gpus and then the rest of the the training Network sits idle because all of the tokens are just routing to that so this is the biggest complexity one of the big complexities with running a very you know sparse mixture of experts model uh I.E you know this 32 ratio versus this uh four ratio is that you end up with so many of the experts just sitting their Idol so how do I load balance between them how do I schedule the communications between them this is a lot of the like extremely low-level detailed work that they figured out in the public first and potentially like second or third and the world and maybe even first in some cases what uh lesson do you uh in the direction of the better lesson do you take from all of this where is this going to be the direction where a lot of the gain is going to be which is this kind of lowlevel optimization or is this a shortterm thing where the biggest gains will be more on the algorithmic high level side of like posttraining is is this like a short-term leap because they figured out like a hack because constraints Necessities the mother of invention or is is there still a lot of gains I think we should summarize what the bitter lesson actually is about is I the bitter lesson essentially if you paraphrase it is that the types of training that will win out in deep learning as we go are those methods that which are scalable in learning and search is what it calls out and the scale word gets a lot of attention in this the interpretation that I use is effective to avoid adding the human priors to your learning process and if you read the original essay this is what it talks about is how researchers will try to come up with clever solutions to their specific problem that might get them small gains in the short term while simply enabling these deep Learning Systems to work efficiently and for these bigger problems in the long term might be more likely to scale and continue to drive success and therefore we were talking about relatively small implementation changes to the mixture of experts model and therefore it's like okay like we will need a few more years to know if one of these were actually really crucial to the bitter lesson but the bitter lesson is really this long-term Arc of how Simplicity can often win and there's a lot of sayings in the industry like the models just want to learn you have to give them the simple lost landscape where you put compute through the model and and they will learn and get barriers out of the way that that's where the power something like nickel comes in where standardized code that could be used by a lot of people to create sort of simple innovations that can scale which is why the hacks the I imagine the code base for deep seek is probably a giant mess I'm sure they have deep seek definitely has code bases that are extremely messy where they're testing these new ideas multi-head late in attention probably start could start in something like a Jupiter notebook or somebody tries something on a few gpus and that is really messy but the stuff that trains deep seek V3 and deep seek R1 those libraries if you were to present them to us I would guess are extremely high quality code high quality readable code I think there is one aspect to note though right is that there is the general General ability for that to transfer across different types of runs right you may make really really high quality code for one specific model architecture at one size and then that is not transferable to hey when when I make this architecture tweak everything's broken again right like that's that's something that could be uh you know with their with their specific low-l coding of like scheduling SMS is specific to this model architecture and size right and whereas like nvidia's collectives library is more like hey it'll work for anything right you want to do an all reduce great I don't care what your model architecture is it'll work uh and you're giving up a lot of performance when you do that uh in many cases but it's it's worth for them to do the specific uh optimization for the specific run given the constraints that they have regarding compute I wonder how stressful it is to like you know these Frontier models like initiate training like to have the code to push the button that like you're now spending a large amount of money and time to train this like there must I mean there must be a lot of innovation on the debugging stage of like making sure there's no know issues that you're monitoring and visualizing every aspect of the training all that kind of stuff when when people are training they have all these various dashboards but like the most simple one is your loss right uh and it continues to go down but in reality especially with more complicated stuff likee the biggest problem with it or FPA training which is another Innovation you know going to a lower Precision number format I.E less accurate is that you end up with lost bikes right and and no one knows why the Lost bike happen and for long some of them you do some of them you do some of them are data I give a ai's example of what blew up our earlier models is a subreddit called microwave gang we love to shout this out it's a real thing you can pull up microwave gang essentially it's a subreddit where everybody makes posts that are just the letter M so it's like so there's extremely long sequences of the letter M and then the comments are like beep beep because that's when the microwave ends but if you pass this into a model that's trained to be a normal producing text it's extremely high loss because normally you see an M you don't predict M's for a long time so like this is something that causes a l spikes for us but when you have much like this is this is old this is not recent and when you have more mature Data Systems that's not the thing that causes the LW Spike and what Dylan is saying is true but it's like it's it's levels to this sort of idea with regards to the stress right these people are like you know you'll go out to dinner with like a friend that works at one of these labs and they'll just be they'll just be like looking at their phone every like 10 minutes and they're not like you know it's one thing if they're texting but they're just like like is the Lost is the L tokens tokens per second lost not blown up they're just walking watching this and the heart rate goes up if there's a spike and some level of spikes is normal right it'll it'll recover and be back sometimes a lot of the old strategy was like you just stop the run restart from the old version and then like change the data mix and then it keeps going there are even different types of spikes so Durk grenal has a theory A2 that's like Fast spikes and slow spikes where there are sometimes where you're looking at the loss and there other parameters you can see it start to creep up and then blow up and that's really hard to recover from so you have to go back much further so you have the stressful period where it's like flat or might start going up and you're like what do I do whereas there also law spikes that are it looks good and then there's one spiky data point and what you can do is you just skip those you you see that there's a spike you're like okay I can ignore this data don't update the model and do the next one and it'll recover quickly but these like un trickier implementations so as you get more complex in your architecture and you scale up to more gpus you have more potential for your loss blowing up so it's like there there's and there's a distribution the whole idea of grocking also comes in right it's like just because it slowed down from improving and loss doesn't mean it's not learning because all of a sudden it could be like this and it could just Spike down and loss again because it learned truly learned something right uh and it took some time for it to learn that it's not like a gradual process right and that's that's what humans are like that's what models are like so it's it's really a stressful task as you mentioned and the whole time the the the dollar count is going up every company has failed runs you need failed runs to push the envelope on your infrastructure so a lot of news Cycles are made of X company had y failed run every company that's trying to push the frontier of AI has these so is yes it's noteworthy because it's a lot of money and it can be weektoon setback but it is part of the process but how do you get if you're deep seek how do you get to a place where holy there's a successful combination of hyper parameters a lot of small failed runs and so so rapid uh iteration through failed runs until and successful ones you just and then you build a su tuation like this this mixture of expert works and then this implementation of MLA works key hyper parameters like learning rate and regularization and things like this and you find the regime that works for your code base I've talking to people at Frontier Labs there's a story that you can tell where training language models is kind of a path that you need to follow so you need to like unlock the ability to train a certain type of model or a certain scale and then your code base and your internal knoow what type of parameters work for it is kind of known and you look at the Deep seek papers and models they' they've scaled up they've added complexity and it's just continuing to build the capabilities that they have there there's the concept of a YOLO run um so YOLO you only live once um and and what it is is like you know there's there's there's all this experimentation you do at the small scale right uh research ablations right like you have your jupyter notebook whether you're experimenting with MLA on like three gpus or whatever um and you're doing all these different uh things like hey do I do four expert four active experts 128 experts do I arrange the experts this way you know all these different uh model architecture things you're testing at a very small scale right couple researchers few gpus tens of gpus hundreds of gpus whatever it is and then all of a sudden you're like okay guys no more no more around right uh no more screwing around everyone take all the resources we have let's pick what we think will work and just go for it right YOLO and this is where that sort of stress comes in is like well I know it works here but some things that work here don't work here and some things that work here don't work down here right in this terms of scale right so it's it's it's really truly a YOLO run and and sort of like there is this like like discussion of like certain researchers just have like this methodical nature like they can find the whole search space and like figure out all the ablations of different research and really see what is best and there's certain researchers who just kind of like you know have that innate gut instinct of like this is the Yol run like you know looking at the data this is it this is why you want to work in post training because the GPU cost for training is lower so you can make a higher percentage of your training runs Yol will runs yeah for for now yeah for now for for now so some of this is fundamentally luck still luck is skill right in many cases yeah I mean it looks lucky right when you're but the hill to climb if you're on one of these labs and you have an evaluation you're not crushing there's a repeated Playbook of how you improve things there are localized improvements which might be data improvements and these add up into the whole model just being much better and when you zoom in really close it can be really obvious that this model is just really bad at this thing and we can fix it and you just add these up so like some of it feels like look but on the ground especially with these new reasoning models we're talking to is just so many ways that we can poke around and normally it's that some of them give big improvements the search space is near infinite right and and yet the amount of computing time you have is is very low and you're you're you have to hit release schedules you have to not get blown past by everyone otherwise you know what happened with deep seek you know crushing meta and mistr and coher and all these guys they moved too slow right they they maybe were too methodical I don't know they didn't hit the Yello run whatever the reason was maybe they weren't as skilled uh whatever what you know you can call it luck if you want but at the end of the day it's skill so 2025 is the year of the YOLO run it seems like all the labs are like going in I I I think it's even more impressive what openi did in 2022 right at the time no one believed in mixture of experts models right at Google uh who had all the researchers uh opening ey had such little compute and they devoted all of their compute for many months right all of it 100% for many months to gp4 with a brand new architecture with no belief that hey let me spend a couple hundred million dollars which is all of the money I have on this model right that is truly YOLO yeah right now now you know people like all these like training run failures that are in the media right it's like okay great but like actually a lot huge chunk of my GPS are doing inference I still have a bunch doing research constantly and yes my biggest cluster is training but like on on this YOLO run but like that YOLO run is much less risky than like what opening I did in 2022 or maybe what deep seek did now or you know like sort of like hey we're just going to throw everything at it the big Winners throughout human history are the ones who are willing to do yellow at some point okay uh what do we understand about the hardware it's been trained on deep seek deep seek is very interesting this a second to take zoom out out of who they are first of all right highflyer is a hedge fund that has historically done quantitative trading in China as well as elsewhere and they have always had a significant number of gpus right in the past a lot of these high frequency trading algorithmic Quant Traders used fpgas uh but it shifted to gpus definitely and there's both right but gpus especially and deep and and highflyer which is the hedge fund that owns deep seek and everyone who works for deep seek is part of highflyer to some extent right uh it's same same parent company same owner same CEO they had all these resources and infrastructure for trading and then they devoted a humongous portion of them to training models uh both language models and otherwise right because these these these te techniques were heavily AI influenced um you know more recently people have you know realized hey trading with um you know like even even when you you go back to like Renaissance and all these all these like quantitative firms natural language processing is the key to like trading really fast right understanding a press release uh and making the right trade right and so deep seek has always been really good at this and even as far back as 2021 they they have press releases and papers saying like Hey we're the first company in China with an a100 cluster this large those 10,000 a100 gpus right this is this is in 2021 now this wasn't all for training you know large language models this was mostly for training models for their quantitative aspects their or quantitative trading as well as you know a lot of that was natural language processing to be clear right um and so this is the sort of History right so verifiable fact is that in 2021 they built the largest chin uh cluster at least they claim it was the largest cluster in China 10,000 gpus before export controls started yeah it's like they've had a huge cluster before any conversation of export controls so then you step it forward to like what have they done over the last four years since then right um obviously they've continued to operate the hedge fund probably make tons of money and the other thing is that they've leaned more and more and more into AI the CEO Le CH Fang uh Leon you're not putting me spot on this we discuss this Leon Fang right the CEO he own maybe Leon Fang he he owns maybe a little bit more than half the company allegedly right um is an extremely like Elon Jensen kind of figure where he's just like involved in everything right um and so over that time period he's gotten really in-depth into AI he actually has a bit of a like a if you if you see some of his statements a bit of an eak Vibe almost right total AGI Vibes like we need to do this we need to make a new ecosystem of open AI we need China to lead on this sort of ecosystem because historically the Western countries have led on on software ecosystems and in straight up acknowledges like in order to do this we need to do something different deep seek is his way of doing this some of the translated interviews with him are so he has done interviews yeah you think he would do a western interview or no or is there controls on there hasn't been one yet but okay I would try it well I just got a Chinese translator so it's great this is this is all push um so fascinating figure engineer pushing full on into AI leveraging the success from The High Frequency trading very direct quotes like we will not switch to closed Source when ask about this stuff very long-term motivated in how the ecosystem of AI should work and I think from a Chinese perspective he wants the Chinese company a Chinese company to build this vision and so this is sort of like the quote unquote Visionary behind the company right this hedge fund still exists right this this quantitative firm and so deep seek is the sort of at at you know slowly he got turned to this full view of like AI everything about this right but at some point it slowly maneuvered and he made deep seek um and deeps has done multiple models since then they've acquired more and more gpus they share infrastructure with the fund right um and so you know there is no exact number of public GPU resources that they have but besides this 10,000 gpus that they bought in 2021 right and they were fantastically profitable right and then this paper claims they did only 2, h800 gpus which are a restricted GPU that was previously allowed in China but no longer allowed and there's a new version but it's basically nvidia's h100 for China right um and there's some restrictions on it specifically around the communications uh sort of uh speed that the interconnect speed right which is why they had to do this crazy SM you know scheduling stuff right so so going back to that right let's like this is obviously not true in terms of their total GPU count obvious available gpus but for this training run you think 2,000 is the correct number or no so this is where it takes um you know significant amount of sort of like zoning in right like what do you call your training run right do you count all of the research and ablations that you ran right picking all the stuff because yes you can do a YOLO run but at some level you have to do the test at the small scale and then you have to do some test at medium scale before you go to a large scale accepted practice is that for any given model that is a notable advancement you're going to do 2 to 4X compute of the full training run in experiment alone so a lot of this Compu that's being scaled up is probably used in large part at this time for research yeah and research will you know research begets the new ideas that let you get huge efficiency research gets you 01 like research gets you breakthroughs then you need to bet on it so some of the pricing strategy they will discuss has the research baked into the price so the numbers that deep seek specifically said publicly right are just the 10,000 gpus in 2021 and then 2,000 gpus for only the pre-training for V3 they did not discuss cost on R1 they did not discuss cost on all the other RL right for the instruct model that they made right they only discussed the pre-training for the base model and they did not discuss anything on research and ablations and they do not talk about any of the resources that are shared in terms of hey the fund is using all these gpus right and and we know that they're very profitable and that 10,000 gpus in in in 2021 so so the uh some some of the research that we've found is that we actually believe they have closer to 50,000 gpus we is sem so we should say that you're uh sort of one of the world experts in figuring out what everybody's doing in terms of the Semiconductor in terms of cluster build outs in terms of like who's doing what in terms of training runs so yeah so that's the Wii okay go ahead yeah sorry sorry um we believe they actually have something closer to 50,000 gpus right now this is this is split across many tasks right again the fund um research in ablations for ballpark how much would open AI or anthropic had I think the clearest example we have because meta is also open they talk about like order of 60k to 100K h100 equivalent gpus in their training clusters right so so like llama 3 they said they trained on 16, h100s right but the company of meta last year publicly disclosed they bought like 400 something thousand gpus yeah right so so of course Tiny percentage on the training again like most of it is like serving me the best Instagram reels right um or whatever right I mean we could get into a cost of like what is the cost of ownership for a 2,000 GPU cluster 10,000 like the there's just different sizes of companies that can afford these things and deep seek is reasonably big their compute allocation compared is one of the top few in the world is not open Ai and Tropic Etc but they have a lot of computer can you in general actually just zoom out and also talk about the the hopper architecture the Nvidia Hopper GPU architecture and the difference between h100 and h800 like you mentioned the interconnects yeah so there's you know Amper was the a100 and then h100 Hopper right people use them synonymously in the US because really there's just h100 and now there's h200 right but same thing uh mostly in China they've had two there have been different salvos of export restrictions so initially the US government limited on a two- Factor scale right which is Chip interconnect versus uh flops right so any chip that had interconnects above a certain level and flops above a certain floating Point operations above a certain level was restricted uh later the government realized that this was a flaw in the restriction and they cut it down to just floating Point operations and so um H h800 had high flops low communication exactly so the h800 was the same performance as h100 on flops right but it didn't had it just had the interconnect bandwidth cut deep seek knew how to utilize this you know hey even though we were cut back on the interconnect we can do all this fancy stuff to figure out how to use the GPU fully anyways right and and so that was back in October 2022 but uh later in 2023 end of 2023 implemented in 2024 the US government banned the h800 right um and so by the way this h800 cluster these 2,000 gpus was not even purchased in 2024 right it's purchased in late 202 um and they're just getting the model out now right because it takes a lot of research Etc um h800 was banned and now there's a new chip called the H20 uh the H20 is uh cut back on only flops but the interconnect bandwidth is the same and in fact in some ways it's better than the h100 because it has better memory bandwidth and memory capacity so there are you know Nvidia is working within the constraints of what the government sets and then get builds the best possible GPU for China can we take this actual tangent and we'll return back to the hardware is the the philosophy the the motivation the case for export controls what is it uh Dar amade just published a blog post about export controls the case he makes is that if AI becomes super powerful and he says by 2026 we'll have AGI or super powerful Ai and that's going to give a significant whoever builds that will have a significant military advantage and so because the United States is is a democracy and as he says China is uh authoritarian or has authoritarian elements you want a unipolar world where the super powerful military because of the AI is one that's a democracy it's a much more complicated world geopolitically when you have two superpowers with super powerful Ai and one is authoritarian so that's the case he makes and so we want to uh the United States wants to use export controls to slow down to make sure that China can't do these gigantic uh training runs that would be presumably required to build AGI this is very abstract I think this can be the goal of how some people describe export controls is this super powerful AI there's and you touched on the training run idea there's not many worlds where China cannot train AI models I think export controls are knapping the amount of compute or the density of compute that China can have and if you think about the AI ecosystem right now as all of these AI companies Revenue numbers are up and to the right the AI usage is just continuing to grow more gpus are going to inference a large part of export controls if they work is just that the amount of AI that can be run in China is going to be much lower so on the training side deep seek V3 is a great example which you have a very focused team that can still get to the frontier of AI on this 2,000 gpus is not that hard to get all considering in the world they're still going to have those gpus they're still going to be able to train models but if there's going to be a huge market for AI if you have strong export controls and you want to have a 100,000 gpus just serving the equivalent of chat GPT custers with good export controls it also just makes it so that e AI can be used much less and I think that is a much easier goal to achieve than trying to debate on what AGI is and if you have these extremely intelligent autonomous AIS and data centers like those are the things that could be running in these GPU clusters in the United States but not in China to some extent training a model does effectively nothing right like TR to have a model the the thing that Dario is sort of speaking to is the implementation of that model once trained to then create huge economic growth huge increases in military capabilities huge capabil increases in productivity of people uh betterment of lives whatever whatever you want to direct super powerful AI towards you can but that requires a significant amount compute right and so the US government has effectively said um and and and and forever right like train training will always be a portion of the total compute um you know we mentioned meta 400,000 gpus only 16,000 made llama right so the the percentage that meta is dedicating to inference now this might be for recommendation systems that are trying to hack our mind into spending more time and watching more ads or if it's if it's or if it's for a super powerful AI That's doing productive things doesn't matter about the exact use that our you know economic system decides it's that that can be delivered whatever in whatever way we want whereas with China right you know you're you know export restrictions great you're never going to be able to cut everything off right uh and that's that's like I think that's quite well understood by the US government uh is that you can't cut everything off um you know and they'll make their own chips and and they're trying to make their own chips they'll be worse than ours but you know this is the whole point is to just keep a gap right um and therefore at some point as the AI you know in a world where 2 3% economic growth this is really dumb by the way right to cut off uh you know high-tech and make money off of it but in a world where super powerful AI comes about and then starts creating significant changes in society which is what all the AI leaders and big tech companies believe I think super powerful AI is going to change society massively and therefore this compounding effect of the difference in compute is really important there's some sci-fi out there where like AI is T is like measured in the power of in like how much power is delivered to compute right or how much uh is being you know that's sort of a way of thinking about what's the economic output is just how much power you direct in towards that AI should we talk about reasoning models with this as a way that this might be actionable as something that people can actually see so the reasoning models that are coming out with R1 and o1 they're designed to use more compute there's a lot of Buzzy words in the AI Community about this test time compute inference time compute whatever but um Dylan has good research on this you can get to the specific numbers on the ratio of when you train a model you can look at things about the amount of compute used at training and amount of compute used at inference these reasoning models are making inference way more important to doing complex tasks in the fall in December their open AI announced this 03 model there another thing in AI when things move fast we get both announcements and releases announcements are essentially blog posts where you pat yourself on the back and you say you did things and releases are R the models out there the papers out there Etc so open AI has announced 03 I we can check if 03 mini is out as of recording potentially but that doesn't really change the point which is that the Breakthrough result was something called Arc AGI task which is the abstract reasoning Corpus a task for artificial general intelligence um Fran chle is the guy who's been it's it's a multi-year old paper it's a brilliant Benchmark and the number for openai 03 to solve this was that it used a some sort of number of samples in the API the API has like thinking effort and number of samples they used a thousand samples to solve this task and it comes out to be like five to $20 per question which you're you're putting in effectively a math puzzle and then it takes orders of dollars to answer one question and this is a lot of compute if this is going to take off in the US Open AI needs a ton of gpus on inference to capture this they have this um open AI chat gbt Pro subscription which is $200 a month which Sam said they're losing money on which means that people are burning a lot of gpus on inference and I've signed up with it I've played with it I don't think I'm a power user but I I I use it and it's like that is the thing that a Chinese company with mediumly strong expert controls there will always be loopholes might not be able to do it all and if that the main result for 03 is also a spectacular coding performance and if that feeds back into AI companies being able to experiment better so presumably the idea is for an AGI a much larger fraction of the compu would be used for this test time computer for the reasoning for the AGI goes into a room and thinks about how to take over the world and that you know come back in 2.7 hours this is what going to take a lot of computer this is what people like CEO or leaders of open Ai and anthropic talk about is like autonomous AI models which is you give them a task and they work on it in the background I think my personal definition of AGI is much simpler like I I think language models are a form of AGI and all this super powerful stuff is a next step that's great if we get these tools but a language model has so much value in so many domains it is a general intelligence to me but this next step of agentic things where they're independent and they can do tasks that aren't in the training data is what the fewe Outlook that these AI companies are driving for I think the terminology here that Dar Dario uses as super powerful AI so I agree with you on the AGI I think we already have something like that's exceptionally impressive that Allan toring would for sure say is Agi but he's referring more to something once in possession of then you would have a significant military and geopolitical advantage over other nations so it's not just like you can ask it how to cook an omelet and he has a much more positive view in his essay Machines of love and grace I've read into this I don't have enough background in physical sciences to gauge exactly how confident I am and if AI can revolutionize biology but I am safe saying that AI is going to accelerate the progress of any computational science so we're doing a depth for search here on topics uh taking tangent of a tangent so let's continue uh on that depth first search uh you said that you're both feeling the AGI so you're what's what's your timeline Dario 2026 for the super powerful AI That's you know that's basically agentic to a degree where it's a real security threat that level of AGI what's your what's your timeline I don't like to attribute specific abilities because predicting specific abilities and when is very hard I think mostly if you're going to say that I'm feeling the AGI is that I expect continued rapid surprising progress over the next few years so something like R1 is less surprising to me from Deep seek because I I expect there to be new paradigms where substantial progress can be made and deep seek R1 is so unsettling because we're kind of on this path with with chat gbt it's like it's getting better it's getting better it's getting better and then we have a new direction for for changing the models and we took one step like this and we like took a step up so it looks like a really fast St slope and then we're going to just take more steps so like it's just really unsettling when you have these big steps and I expect that to keep happening I see I've tried openingi operator I've tried CLA computer use they're not there yet I understand the idea but it's just so hard to predict what is the Breakthrough that will make something like that work and I think it's more likely that we have breakthroughs that work and things that we don't know what they're going to do so like everyone wants agents Dario has very eloquent way of describing this and I just think that it's like there's going to be more than that so like just expect these things to come I'm going to have to try to pin you down to a date on the AGI timeline uh like the nuclear weapon moment so moment where on the geopolitical stage there's a real like you know CU we're talking about export controls when do you think just even to throw out a date when do you think that would be like for me it's probably after 2030 so I'm not as that's what I would say so so Define that right because to me it kind of almost has already happened right you look at elections in India and Pakistan people get AI voice calls and think they're talking to the politician right the AI diffusion rules which was enacted in the last couple weeks of the Biden admin and looks like the Trump admin will keep and potentially even strengthen limit cloud computing and GPU sales to countries that are not even related to China it's like this is Portugal and all these like normal comp countries are on the you need approval from the US list like yeah Portugal and like you know like like all these countries that are allies right Singapore right like they they freaking have f35s and we don't let them buy gpus like this is this to me is already to the scale of like you know well that just means that uh the US military is really nervous about this new technology that doesn't mean the technolog is already there so like they might be just very cautious about this thing that they don't quite understand but that's a really good point sort of the the rooc calls swarms of semi-intelligent bots could be a weapon could be doing a lot of social engineering I mean there's tons of talk about you know from the 2016 elections like Cambridge analytica and all this stuff Russian influence I mean every country in the world is pushing stuff onto the internet and has narrative they want right like that's every every like technically competent whether it's Russia China us Israel Etc right you know people are pushing viewpoints onto the internet and mass and language models crash the cost of like very intelligent sounding Lang there's some research that shows that the distribution is actually the limiting factor so language models haven't yet made missing information particularly like change the equation there the internet is still ongoing I think there's a Blog AI snake oil and some of my friends that prints in that write on this stuff so there is research it's like it's a default that everyone assumes and I would have thought the same thing is that misinformation doesn't get far worse with language models I think in terms of Internet posts and things that people have been measuring it hasn't been a exponential increase or something extremely measurable in things you're talking about with like voice calls and stuff like that it could be in modalities that are harder to measure so it's it's something that it's too soon to tell in terms of I think that's like political instability via the web is very it's it's monitored by a lot of researchers to see what's happening I think the you're asking about like the AGI thing I my if you make me give a year I would be like okay I have ai CEOs saying this they've been saying two years for a while I think that they're people like Dario anthropic the had thought about this so deeply I need to take their words seriously but also understand that they have differ different incentive so I would be like add a few years to that which is how you get something similar to 2030 or a little after 2030 I think to some extent we have capabilities that hit a certain point where any one person could say oh okay if I can leverage those capabilities for x amount of time this is Agi right call it 2728 but then the cost of actually operating that capability yeah this is going to be my point so so extreme that no one can actually deploy it at scale and Mass to actually completely revolutionize the economy on a click on a snap of a finger so I don't think it will be like a snap of the finger moment physical constraint rather it'll be a you know oh the capabilities are here but I can't deploy it everywhere right and so one one simple example going back sort of to 2023 was when uh you know being with gp4 came out and everyone was freaking out about search right perplexity came out if you did the cost on like hey implementing gpt3 into every Google search was like oh okay this is just like physically impossible to implement right and and and as we step forward to like going back to the test time compute thing right a query for you know you ask chat GPT a question it costs cents right for their most capable model of chat right to get a query back to solve an arc AGI problem though cost five to 20 bucks right and this is this is an a it's only going up from there this is a th000 10,000 X Factor difference in cost to respond to a query versus do a task and the task of AGI is not like it's like it's it's simple to some extent um you know but it's also like what are the tasks that we want a okay AGI quote unquote what we have today can do Arc AGI three years from now it can do much more complicated problems but the cost is going to be measured in thousands and thousands and hundreds of thousands of dollars of GPU time and there just won't be enough power gpus infrastructure to operate this and therefore shift everything in the world on the snap the finger but at that moment who gets to constu control and point the AGI at a task and so this was in Dario's post that he's like hey China can effectively and more quickly than us Point their AGI at military tasks right and they have been in many ways faster at adopting certain new technologies into into their military right especially with regards to drones right uh the us maybe has a long-standing you know large air sort of you know fighter jet type of thing bombers but when it comes to asymmetric arms such as drones they've they completely leapfrogged the US and the west and the the fear that Dario is sort of pointing out there I think is that yeah great we'll have AGI in the commercial sector uh the US military won't be able to implement it super fast Chinese military could and they could direct all their resources to implementing it in the military and therefore solving you know military logistics or solving some some other aspect of like disinformation for targeted certain set of people so they can flip a country's politics or something like that that is actually like catastrophic versus you know the US just wants to you know because it'll be more capitalistically allocated just towards whatever is the highest return on income which might be like building you know factories better or whatever so everything I've seen uh people's intuition seems to fail on robotics so you have this kind of General optimism I've seen this on self-driving cars people think it's much easier problem than it is similar with drones here I understand it a little bit less but I've just seen the reality of the war in Ukraine and the usage of drones on both sides and it seems that humans still far outperform any any fully autonomous systems AI is an assistant but humans Drive fpv drones where the humans controlling most of it just…

Transcript truncated. Watch the full video for the complete content.

Get daily recaps from
Lex Fridman

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.