I read every major CS paper of the last 100 years...

Fireship| 00:10:11|Jun 17, 2026

Chapters9

Turing establishes that no general program can decide if another program halts, framing the limits of computation.

A brisk tour through 10 landmark CS papers that shaped AI, from Turing to OpenAI, with crisp, concrete takeaways you can apply today.

Summary

Fireship’s tour de force traces a century of CS debate and discovery through its most influential papers. It starts with Turing’s foundational questioning and the halting problem, then moves to Shannon’s formalization of information and entropy. The journey continues with the perceptron and the AI winter, followed by Lamport’s distributed clocks which underpin modern databases and AI training at scale. The narrative then bounces to backpropagation, the PageRank-driven data reign of Google, and the ImageNet era that finally unlocked practical deep learning. Attention is All You Need introduces the transformer, which makes large-scale language models feasible, and GPT-3 (and its ChatGPT successor) demonstrates the power of scale in language tasks. The video ties these threads together, arguing that each paper catalyzed the next leap—from abstract theory to data, architecture, and massive computation. Fireship notes that the modern AI boom rests on a lineage of papers, many authored by people long since passed. The editor’s voice keeps the pace brisk while linking classic ideas to contemporary AI breakthroughs, culminating in a concise “TL;DR” arc that maps the evolution from machine definition to trillion-dollar consumer AI.

Key Takeaways

Turing’s 1936 paper proved the halting problem is unsolvable, establishing fundamental limits on what algorithms can decide.
Shannon’s mathematical theory of information introduced the bit and entropy, formalizing information content and prediction uncertainty.
The perceptron’s early neural network idea showed learning from data, but single-layer limits led to the AI winter until multi-layer training emerged.
Lamport’s happen-before relation and logical clocks provided a scalable, clock-agnostic basis for ordering events in distributed systems.
Backpropagation unlocked practical neural networks by propagating error signals through layers, enabling deep learning with enough data and compute.
PageRank reshaped web search and created massive-scale text corpora that later fed AI models by providing a structured data backbone.
AlexNet demonstrated that deep CNNs trained on ImageNet with GPUs could dramatically outperform rivals, accelerating the deep learning revolution (10-point error drop).

Who Is This For?

Essential viewing for developers, researchers, and students who want a coherent through-line of AI’s history—showing how foundational theory, data, and architecture converged to drive today’s transformer-era AI.

Notable Quotes

"What is information as a thing you can measure?"

—Shannon’s framing of information as a measurable quantity, separate from meaning.

"To estimate entropy of English, Shannon made people guess the next letter in a sentence."

—Illustrates how entropy is tied to predictability in language.

"Back propagation. Run your data forward, measure how wrong the output is, and then push that error backward through every layer using the chain rule..."

—The central mechanism that enables deep neural networks to learn.

"The transformer... lets every word look at every other word at once and decide what's relevant."

—Foundational shift enabling scalable language models.

"GPT-3... 175 billion parameters and feed it the entire internet as a data set."

—Empirical moment that ignited the current AI bubble by scaling models.

Questions This Video Answers

How did Turing’s halting problem shape limits of computation and AI?
What is entropy in information theory and why does it matter for AI?
What was the impact of backpropagation on neural networks and why did it take so long to gain traction?
Why is the Transformer architecture so pivotal for modern AI systems?
How did PageRank influence data scale and the training data available for AI models?

Turing machineHalting problemShannon information theoryBit (information)EntropyPerceptronLamport’s happened-beforeBackpropagation PageRankImageNet/AlexNet/AlexNet era 2012/2013

Full Transcript

The year was 1936. Alan Turring asked a simple question. It can machines think? Actually, no, that's not right. What he really asked was something way more boring. It can every mathematical problem be solved by an algorithm? Surprisingly, he proved the answer is no. But in the process, he accidentally invented the computer. Then 12 years later, in 1948, another legend shows up named Claude Shannon, and he reduced all human communication down to ones and zeros, casually inventing the bit like it was no big deal. One thing led to another, and now in 2026, we have 18-year-olds in hoodies typing import torch into Python files and cashing billion-dollar checks from venture capitalist boomers. But reaching this point has been underpinned by a century long chain reaction of computer science papers written mostly by dead people much smarter than us. In today's video, we'll look at 10 of the most important scientific papers in the history of computer science and how they changed the world for better or worse. Our story begins nearly a century ago when mathematician David Hilbert asked the field's biggest flex of a question. Is there a universal algorithm that can decide whether any mathematical statement is true? Or in other words, can we automate math itself? He called this the Enchunk's problem which is German for decision problem. By 1936, Alan Turing comes around and gives a brutal answer to this question. No. But in order to prove it, he wrote this paper on computable numbers that had to define what an algorithm even is. And so he imagined a hypothetical machine with an infinite tape, a read write head, and a tiny table of rules. This touring machine is the abstract blueprint for every computing device you've ever owned. Once created, he asked it to solve the halting problem. Can you write a program that looks at any other program and tells you if it'll finish running or loop forever? During proved that it's impossible for a program like this to exist. It simply leads to a logical contradiction, which means math has problems that no algorithm can solve. That's annoying. But 12 years later, a guy named Claude Shannon would ask his own annoying question. What is information as a thing you can measure? In his paper, a mathematical theory of communication, he rips out the meaning from normal words entirely. I love you and the cat is on fire carry the same information if they're equally surprising. And he measures that surprise in a unit called the bit. He proved that all information could ultimately be boiled down to a stream of ones and zeros. But here's the crazy part. To estimate how much information was needed to transmit a message, he borrowed a word from thermodynamics nobody understands. Entropy. To estimate entropy of English, Shannon made people guess the next letter in a sentence. When a letter is easy to guess, it has low entropy. When a letter is hard to guess, it has high entropy. But wait a minute. Having humans guess the next token is exactly what AI does today, just on a much bigger scale. Shannon wasn't trying to build artificial intelligence, but he gave us the math for uncertainty, prediction, and compression and accidentally wrote the spiritual ancestor to the loss function. And that's exactly why Anthropic named their AI model Claude. Then 10 years later at Cornell, a psychologist, not a computer scientist, builds the first machine that actually learns. He gets inspired by the way neurons work in the brain. So he designs a thing called a perceptron that takes inputs, weighs them, and then adjusts those weights when it's wrong until it can classify patterns on its own. It's the building block for modern neural networks, and the hype is immediate and unhinged. The Navy funds it, and the New York Times reports that the computer will soon be conscious, but 11 years later, the hype would die out completely, thanks to two haters at MIT, who published another paper with a completely different vibe. With basic math, they prove that a single layer perceptron can't even learn exclusive ore, which is just trivial logic that means this or that, but not both. This paper, or technically a book, was essentially a death certificate for AI at the time. Funding evaporated, and deep neural networks entered their first AI winter, but there was a twist buried in the fine print. They actually figured out that stacking layers of perceptrons fixes everything. The only problem is that back then, nobody knew how to train a stack of perceptrons. It would take another 17 years to figure it out. But first, we need to talk about times, clocks, and the ordering of events in a distributed system by Lesie Lamport. Because neural networks are useless unless you can run them on a massive scale. This paper realized that separate computers with no shared clock, it can't really have a universal now time. And that's a big problem when you have multiple computers in a distributed system trying to do things in order. Well, he figured out a way to fix this with the happen before relation. You stop trusting the wall clock time and order events by causality instead. If A could have caused B, A comes first. From that, he builds logical clocks which allow an unlimited number of machines to stay in agreement without ever looking at a real clock. Eventually, this paper would become the bedrock for every database, blockchain, and every massive AI training run because you need thousands of GPUs that constantly stay in sync and agree on state without dissolving into chaos. That was a gamecher. But 17 years after neural networks were left for dead, the three researchers, including the godfather Jeffrey Hinton, answered the question that everyone gave up on. How do you train a stack of layers? But before we answer that, we need to quickly talk about Coder, who was cool enough to sponsor this 10-minute video on esoteric computer science papers. They provide self-hosted development environments that let you work with multiple agents in parallel and with enterprise level security. and they just launched coder agents, a chat interface and API for delegating coding jobs to agents running on your own infrastructure. It's the only architecture that lets organizations self-host both the agent workflow and the development environments where the code is actually executed. This gives teams greater control over source code access, agent execution, governance, and security boundaries. It's also model agnostic, so you can connect any LLM you want and switch between them with just a config change. Coder agents are designed for teams in regulated industries who need to self-host their AI workflows with complete control that they're already used by dozens of financial institutions and government organizations. And you can check it out at the link below. Now, back to the question, how do you train a stack of layers? The answer is back propagation. Run your data forward, measure how wrong the output is, and then push that error backward through every layer using the chain rule from calculus to nudge each weight in the direction that's a little less wrong. Do that a few million times and the network teaches itself. The crazy discovery though is that the middle hidden layers started inventing their own features. Edges, shapes, and concepts that nobody programmed in that exclusive or problem that was impossible 17 years ago. It just became trivial. Back propagation is still essential to neural networks today, but back then they sucked because we didn't have enough data or compute. Well, that was about to change in 1998 with the rise of the internet and this famous paper from Larry and Sergey about the anatomy of a large-scale web search engine. The paper describes the page rank algorithm where instead of ranking a web page by how often a word appears, it treats every link as a vote and each vote is weighted by how trustworthy the voter is. They built a prototype in their dorm room which eventually became a company called Google that you may have heard of. Most importantly though, this algorithm helped assemble the largest structured pile of human text ever created. And that massive pile of text would eventually become the training data or feed stock for future AI models. We'd finally see this in action in 2012 with a legendary imageet paper. It created by a dream team of Alex Kresensky, Ilaskever, and Jeffrey Hinton. Remember when I said back propagation needs data and compute? Well, finally the star is aligned. The data set is called ImageNet and it's a monster data set of millions of handlabeled photos. While the compute is a couple of Nvidia consumer grade gaming GPUs, a grad student named Alex wires up a deep convolutional neural network, names it AlexNet, and trains it in his bedroom. Then he walks it into the annual imageet contest and humiliates everyone. This is a contest where AI models try to classify objects in an image like hot dog or not hot dog. And while everyone was fighting over a fraction of a percent, Alex Net walked in and dropped the error rate by 10 points in a single year. And this freaked everyone out because it was suddenly clear that deep learning actually works. It just needs more data, more compute, and the right architecture. Luckily, we would get that architecture a few years later thanks to Ashes Vashwani and Google in the paper. Attention is all you need. Around this time, large language models had a huge problem. They would start a sentence and by the end they would forget what they were even talking about. That's because they would read and predict tokens sequentially one after the other. This paper fixed that by introducing a new architecture called the transformer that throws out sequential reading entirely. Instead, it lets every word look at every other word at once and decide what's relevant. Not only does this make large language models feel more intelligent, but transformers also scale better as well. Google made the big mistake of giving this architecture away for free, and now every AI lab uses it, and that's where you get the T in chat GPT. Speaking of which, that brings us to a paper released by OpenAI in 2020. Language models are fewshot learners. Basically, OpenAI takes the transformer and then asks the dumbest question possible. What if we just make it enormous? Not two times bigger, but scale it to 175 billion parameters and feed it the entire internet as a data set. They made a crazy bet that intelligence isn't some secret algorithm we're missing, but rather it simply emerges once you cross a threshold of scale. The end result was GPT3, the model that ignited the current AI bubble that we're living through right now. What's crazy is that all of a sudden, this model could translate, summarize, and write code without ever being specifically told how to do these things at such a large scale. It learned how to generalize these things on the fly. 2 years later, this paper would evolve into Chat GPT, which today is now a trillion dollar product. But when you think about it, what is chat GPT even doing? Well, it's just predicting the next word or token just like Claude Shannon was doing in 1948. So, here's the TLDDR for the last 100 years. Alan Turing defined the machine. Claude Shannon gave it currency. Rosenl Black gave it a neuron. Jeffrey Hinton taught it how to learn. Google gave it data and an architecture. And Open AI just turned the dial to the maximum. This has been the history of artificial intelligence in 10 scientific papers. Thanks for watching and I will see you in the next