I Trained My Own AI... It beat ChatGPT
Chapters7
The creator announces training their own AI model and outlines the ambition to outperform large benchmarks and rival models.
PewDiePie chronicles the chaotic, bootstrapped journey of training a coding-focused AI, chasing a 40% benchmark, facing hardware and data pitfalls, and learning from failure along the way.
Summary
PewDiePie dives into the messy reality of building an AI model from scratch. He documents starting with Gwen 32B, wrestling with data quality, synthetic data, and a grueling compute budget. Along the way he references benchmarks like Ader Polyglad and confronts missteps—from contamination to using the wrong model version—that sabotaged progress. The tale is as much about persistence as performance, with plenty of humor as GPUs fail, cables smoke, and hardware on the brink of collapse. He also shares how external sources, including Chinese research from DeepSeek and open-source tools, shaped his approach. Throughout, he emphasizes learning through failure, iterating from 2% to nearly 40% on coding benchmarks, and ultimately acknowledging the value of starting smaller and elevating skills. The sponsor plug for boot.dev reinforces his message: practical, hands-on learning is what moves you forward. The video closes with a candid reflection on growth, curiosity, and the idea that coding models will expand access to learning rather than instantly replacing human developers.
Key Takeaways
- Gwen 32B served as the base model, chosen for coding proficiency, with the goal of outperforming benchmarks on a coding-focused task.
- Initial benchmarks fell short (around 8-16%), highlighting the impact of data format (whole format vs diff) and the necessity of proper validation.
- Data quality and data handling were the core bottlenecks; fixing the data harness and reducing garbage data repeatedly changed benchmark outcomes.
- Contamination risks in benchmarking were discovered late; cleaning data and re-running benchmarks yielded more reliable results (ultimately 39.1% after decontamination).
- Retraining with larger, more focused data and correcting the model/version mix led to major jumps (from ~4% to 25%+, then to 36%).
- External benchmarks (Sweetbench, Gemini, Google’s Squoogle) framed the progress and showed the model’s relative strength; final tones emphasized continued testing beyond a single metric.
- Embracing failure as a learning tool is a central message, with the takeaway that persistence and iterative experimentation are more valuable than a perfect immediate result.
Who Is This For?
Aspiring AI developers and ML hobbyists who want a candid, real-world view of training open-source models, debugging data pipelines, and interpreting benchmarks. It’s especially relevant for coders curious about model fine-tuning, data quality, and iterative experimentation.
Notable Quotes
"The deed is done and I ran the benchmarks official AI benchmark and my model outperforms DeepSeek 2.5."
—Initial claim of outperforming a bigger model sets the journey's stakes.
"I have not created my own AI. I have merely taken an AI model and trained it."
—Humility about what was actually achieved frames the challenge.
"Garbage data in, garbage data out."
—A core lesson about data quality comes through clearly.
"I beat ChatGPT… then found contamination and had to restart."
—Shows the perils of benchmarking without data hygiene.
"If anything, coding models will bring more people into learning to code again."
—Ends on an optimistic note about the broader impact.
Questions This Video Answers
- How did PewDiePie’s Gwen 32B compare to ChatGPT on coding benchmarks?
- What is Ader Polyglad and why is it used for benchmarking coding models?
- What caused data contamination in PewDiePie’s AI training experiment and how was it fixed?
- Why is data quality more important than model size when training coding AI?
- What role did synthetic data play in PewDiePie's training process?
PewDiePieGwen 32BAder PolygladDeepSeekChad GPT/ChatGPTdiff vs whole formatdata contaminationsynthetic datafine-tuningboot.dev sponsorship
Full Transcript
The deed is done. I can finally return to this channel cuz I have done what I said I was going to do. Not that. We will get to that. One thing at a time. I trained my own AI model. Yay. How long has it been? I'm making my own AI. I'm making an orb. My own planter. I was hoping to share it by the end of this video, but it takes a long time. So, I feel like those bears coming out of hibernation. What happened? But the deed is done and I ran the benchmarks official AI benchmark and my model outperforms DeepSeek 2.5.
Way bigger model than mine. Facebook's flagship model llama 4 Maverick destroyed. And most importantly of all, my model outperforms Chad GPTs four. [laughter] in like November or something. That was This sounds like impressive, but it's way less impressive once we get into it. And we will get into it. It also sounds way less impressive considering the fact that I almost burned down my house twice for this project. It also sounds way LESS IMPRESSIVE. I LOST MY vanity for now. I have not created my own AI. I have merely taken an AI model [music] and trained it.
It's like stealing a child on the street instead of birthing one myself. It's way more effective that way. Plus, it would cost millions and millions of dollars in infrastructure which I do not have yet. Am I talking about birthing or but it's important for you to understand just how naive I was going into this because I knew nothing about machine learning training AI and coding that I mentioned my model is a coding model. I can make a coding model. Sure, Felix, great idea. But I also know I wasn't that crazy going into this and I'll explain why.
But mainly the fact that I wanted to do this was because I wanted to learn. The reason I'm here with this video and the series of events has just been me approaching this philosophy of just yes this might be difficult but I will learn from it step by step you know that takes you places and that's why I'm so excited to announce today's sponsor which is boot.dev boot.dev is a website that teaches you how to code but for real. I did their Linux course and it's fantastic. When you actually understand how something works, it changes things for real.
It's none of this Dualingo bing fake learning, okay? It's fun, it's engaging, and it works. I'm super excited to do their other courses as well. Create your own AI agent in Python course. I'm going to say something that makes me sound like such a boomer, but I am at this point. So, to heck with it. Instead of focusing on gaming too much lately, I just been focusing on leveling up myself. But it's true. It's such an amazing feeling and I want you guys to experience it as well and I think boot.dev is an amazing venture to do so.
So check it out in the description and I will remind you later as well. Okay. Now funny enough to train my model AI model it ironically it works very similar to how you would train on yourself on boot.dev. There's an instruction of a problem. You get a framework on how to initiate [music] or what to use and then there's a validated answer at the end of it. Basically, I gathered around 100,000 samples like that. And [music] then you feed that to the model, which slightly nudges its parameters. Like it slightly probably nudges your brain. And bada bing, bada boom, you have trained your model.
It's kind of like this. You're AI. Look at this. Are you learning? Are you getting this? You will look at these. You will look at 100,000 more. Play pay attention. This might take a while. The model that I used was Gwen 32B, which is already amazing at coding, but I needed it to be amazing at coding in this format. The whole reason I decided to do this and where I landed, cuz I discovered something. There are many ways to benchmark and test your model. I discovered there's one called Ader Polyglad. If you remember last video, I used the agent ader to code my own web UI.
This is [music] a respectable benchmark that tests coding in six different languages. That's six more languages than I know. But what was interesting about this is that state-of-the-art models like CHP perform like garbage on this benchmark. [music] 18.2%. Uh what my model that I was planning [music] to train on performs 8%. trash, but it performs 16% if you use a different format called whole format instead of diff [music] format. Basically, there's two different formats, but one of them is important. How do I explain this? It's basically like this. You draw a picture. Okay, imagine you draw a picture and you want to add a cloud, but instead of just adding the [ __ ] cloud, you redraw the entire picture with the cloud.
It makes no [ __ ] sense. But basically a lot of these models they struggle to understand the format of just editing in the code instead of just writing the whole thing. It's just stupid. It wastes compute and it wastes time and I never used it as well. So I thought hey if I can just fix the format I will make my model 16%. And I will almost beat Chad GPT. The goal beat chat GPT easy because what I had on my side at my disposal, my arsenal was Chinese AI research. You'd think China would be on the more censorship side of things.
At least that's what I thought. How wrong I was. It's completely the opposite. Deepseek China AI basically just released their model for anyone to run and a whole document with their entire training process in great detail. It's amazing. There's so much information from these research documents. It combined with the open source community. There's so much to try and it just makes you really want to try it yourself. At least that's how I saw it. Even though a lot of it was super advanced and I didn't understand anything, eventually I did. I think [laughter] it's also hilarious reading these documents.
Chinese AI researchers are unironically comedians cuz they write the most unhinged [ __ ] They say [ __ ] like, "Oh, we trained our model on 248 GPUs." And you're like, "What? That's like $60 million." And and you keep reading and they're like, "We emphasize this is an economical APPROACH." I'M LIKE, "ECONOMICAL? WHAT'S THE NON-ECONOMICAL VERSION?" But I think that kind of puts things in perspective and why a lot of companies in the west, they don't want to share [music] information. So with Chinese AI research on my side and a boot dev course, I was ready to do this thing.
Now, what do you need to train AI? Data. Now, how do you get data without sacrificing your soul in the process? Well, there is options, believe it or not. You can mine the stack, which is this 60 TBTE data set that you're allowed to train on. Good thing I kept my hard drives around. You can use publicly available data sets. You can also mine git, which is a little more of a gray zone cuz you have to check for licenses, which I'm sure everyone is doing, right? [laughter] These big tech companies, they're they're being super ethical and great about it.
I'm sure that's why they're not sharing any information. And you can also synthesize your data. Glorious synthetic data. M so tastes so good. I tried every single method there is. It's been a mess. This is scraping git or enriching my data. This is scraping for more data. This is running testing on the data. And this is augmenting the data. And this is my eight LLMs. This is the level we've reached. This project was kind of like I was like in the middle of a freeway and there's a bunch of cars direct that I had to direct constantly because [ __ ] had to move for this to finish.
Everything took so much time and my GPUs had to constantly be cooking and I had to do all this makeup cosmetic surgery to the data cuz all these developers writing poor lazy first of all why you even publish some of that lazy instructions. Don't make a commit if you're just going to add a comment. But finally, I had my data, but I still knew I needed more. And I also wanted to try out synthetic generation of data, which is basically you typically take the strongest model that you have and you ask it, hey, look at this.
Make more in this format. And my god, it was a beautiful thing. You get the perfect data exactly the way you want it. It's amazing. But the problem is, and maybe you already know this, AI is wrong all the [ __ ] time. [laughter] Okay? It's basically like this. I tell AI [music] my drawing skills have really I'm sure glad I learned how to draw. [laughter] I tell AI, "Make a burger by showing him a burger." And then AI [music] makes what looks like a burger, but I open the lid and he put razor blades in there.
So my harness is a burger eater [laughter] to check that if the real burgers are made instead of fake burgers. I think I explained that pretty well. You lied to me. My synthetic approach, I know that most people don't care. I used oss instruct from magic coder and evol instruct which is an amazing method. It's basically like that cloning dancing guy technique. You get feed a code snippet and then you say hey make it into this format and also make it do another one but make it more advanced. I don't want to get into this technical it doesn't matter.
This video will take forever but the problem I had was I was not getting enough of it and I don't trust okay I don't trust like that. Okay so I validated it. I kept thinking the problem was my test hardness, so I needed [music] to fix that. WHAT AN IDIOT. THAT JUST MADE it so I passed more garbage. So when the time came for me to finally train my model after months, months, I was so excited. [music] I ran the benchmark and I had made an AI model that finally that was worse. I had made it worse.
I probably should have quit then, but I am way too stubborn for that. I just can't. I I I don't have it in me. I can't do it. [music] This is when you realize AI is kind of like magic, right? But it also isn't. Garbage data in, garbage data out. There was a lot of issues with my data. When I had fixed my harness, I had just let more garbage data pass through. There was also all these other issues like empty white spaces and classic coding issues that I just wasn't aware about. So that's why I had made it worse.
But I gave it another attempt. And this time, no more mistakes. No more dillydadling. Lock aim again. The model is worse. It was my harness yet again. I don't know why I got so stuck on this [ __ ] thing. Well, I didn't even need synthetic data. I just wanted it to work. I just wanted to work so bad. Finally, I had fixed the [ __ ] thing and the benchmark came in 16%. Sometimes 15, sometimes 14, but 16 was the highest. The model is just not going to solve the same problem every time. It it just doesn't work like that.
There is a level of randomness to it to the benchmark performance. But 16 was the ceiling. And that makes sense because that was the ceiling in the official whole format or whatever. And I had fixed it so it's in the diff format. [music] But I had not made a [ __ ] difference. I had not made the model smarter. And and remember I need to beat 18%. To say that I beat Chad GPT otherwise this means nothing. So my plan of attack was to add reasoning to my data. Adding reasoning to your data is basically making it write out some thinking before it solves the problem.
Instead of doing 2 plus 2 in your head, you go, "Oh, okay. So let me break this down. So I'll have two apples and then if I add two more apples and then I count them all together then I will have four. This is really effective for complicated issues to break down the problem into parts and it really can improve the performance of the AI. But when it's simple questions and it still does it it can be very very irritating as well. You probably seen this actually if you use a stronger AI model a lot of times you ask it questions it goes oh absolutely well let's break this down into parts.
First, we're gonna you're like, "That was a simple yes or no question. What do you mean?" But it's a really effective way to make your model solve problems more accurately. And a lot of these smaller open source models that I train on struggle with this because they just haven't been trained on it enough. So, that was my plan of attack to make my model smarter. [music] And I found a study that showed that 10% in performance. I'm like, that's way more than I need, baby. Let's just clone this git repo and get going. The only problem was, of course, that the ungodly level of computation that was needed for this.
And that's when things were starting to smell funny. I realized my room had this weird aura to it all of a sudden. Hey, I swear it didn't used to smell like that. So, I decide to reboot my computer [music] and lightning strikes all across it. Smoke starts pouring out. My whole life flashed before my eyes. I turn off the computer, but the damage was done. One of my GPUs had died. Rest [music] in peace. I tested each one one by one and it seemed like everything else was okay. It was just this one problem. And looking at my purchase history, that one GPU came from a different factory.
You have to understand what a hack job this setup that I have is. Okay, it's bifurcated. It's undervolted to death. This GPU is run on 450 watt. But I run it on 1751 just so my house wouldn't [ __ ] crash every time. And then these are hacked Chinese 4090 GPUs. This whole thing is held together by prayers. Literally in Japan they sell these prayer infused IT badges. Where is it? It's in front of my computer. Official Japanese priests. I probably have now blessed my computer and I was ready to give it another shot only for the smell to appear again.
After sniffing my GPUs like a maniac, I concluded my GPUs was not the problem this time. And eventually I found it. I don't think it's supposed to look like that. Again, I was using cable that was graded for 1,500 watt. I was running over 2,000. Change cable. We were good to go. My computer kept crashing still. Oh, it's still thought it crashed. It has crashed. Oh, [ __ ] What a pile of You'd think training would be just a straightforward thing, but it's really not. It's taking too long to train. I need more compute. Okay, what am I going to do?
It's not my fault. I know what you're thinking, Felix. Are you building another computer? No, I am just extending the one I have. Of course. Of course. I'm an epic minimalist. I had to steal the circuit from my bathroom and I drilled a hole. It's super heavy on your computer and every time it crashes, I have to [ __ ] defibrillate it back. Bring it back from a coma. It's not pleasant. It's and with everything that had happened in the past extremely stressful and I really really really started doubting what the [ __ ] I'm doing here for these 2%.
I started calling upon DeepSeek API because it's practically for free and eventually I had 15,000 samples which is way less than I had aimed for but these were the top of the top the creme the cra samples that I have with the most beautifully crafted stepbystep reasoning the world has ever seen. I did my supervised fine tuning three epochs and I ran the benchmark and it scored 17. Are you [ __ ] kidding me? But as I mentioned, the performance is kind of random. So I kept running the benchmark. I had eventually given up, but I had done one more just for the [ __ ] of it.
And I noticed, holy [ __ ] this one is having like a god run. It's at 40%. What is happening? It drops to 30. I'm like, "Hold, please hold." It drops to 25. It drops to 20 and it finally finishes all the exercises in the benchmark. It is done at 19.6%. I have beat chat. It felt so goddamn good. STOP. STOP. DON'T LISTEN to this guy. The benchmark is invalid. I did not check for contamination before the benchmark. Basically, if you grab data all across the internet, there's a high chance that you're going to grab data that might already exist in a benchmark somewhere.
So, you have to check for contamination. I didn't check for contamination. I didn't want to check for contamination, but my stupid conscience was like, I should probably do it. I was backing up, going through my data for the gazillionth time, and I realized there was some contamination. It's not a huge deal, but it's like if I just am honest, whatever. To me, it's not good enough. And I want to clear out my data and I'm going to retrain again, benchmark again, and I'm hoping and I think it will still give me the same result cuz it's such small contamination.
I'm kind of worried cuz I'm running out of time. [laughter] Have I done [ __ ] all this entire time? Have I achieved nothing this entire time? and the whole thing was a hoc pogus. That's what I thought. So this time I went all out. I trained on my entire data set. Previously I trained on a small subset of what I thought was my best data because it takes forever to train. It takes forever. It takes days, weeks, and since I reached the score of 19.6, I was like, uh, I'M DONE. BUT NOT THEN I made another discovery.
I was training on the wrong model. A major update. I'm watching I'm watching my video. This guy I'm giving feedback to my editor and I'm like, "Hold on, hold on. What is that?" That is not the coder version. That's the regular version. Have I been Have I been training on the wrong version this entire time? Oh, so maybe this has all been a blessing in disguise because I feel like with these two in my things in mind, I I should I should beat Chachi. I should, but we will know in a couple days. Did I beat Chachi?
Well, I train on the coding model and the first score 4.4%. I don't know what the [ __ ] happened. I can probably think of five things, but I don't care to figure it all out. It's going to take forever. I just retrained again. And the model score 25%, baby. I am not just beat Chip. I beat Chachi twice. Their August shitty version as well. I was so relieved. I was like, "Thank God." But I made another discovery. It's not over. A third of the benchmark was not even running. C++ and JavaScript just wasn't being tested properly.
So, I fixed the benchmark. I run it again and the final score 36%. Baby, this means I also beat Google Schmoogles thing as well. And I think I beat I beat 4.1 or something. I don't know. Chat, it's a massacre out here. You're being destroyed. It's embarrassing. Open AI stock just demolished. Just quit already. This is what pops the AI bubble this. [laughter] But I was like, there are still issues. I'm going to do some post training. 1,500 samples. Splinky blinky five epochs. And I ran the benchmark again. Pure decontaminated, let it be known. Purer than the fountain of youth.
39.1% baby. Another destroy. I think this one I beat Google Smoogle. I always stupid benchmark. Yeah, Gemini. Gemini Pro. You guys pay for that? I did not think this would even ever come close. But here's the embarrassing part about all of this. I train on Gwen 2.5, but Gwen 3 is out and it scores 40%. [laughter] So, unless I beat 40%, this means nothing. And I'm I don't I'm out of time. I don't have [ __ ] time. I need to send this video to my editor. Yes, there was one more thing. A model being good at one benchmark is stupid as [ __ ] Okay, I need to test this model on other benchmarks as well.
I want to test it on Sweetbench, all these other coding benchmarks. Unless it improved in that, it really it I don't even know. I am just out of time. Okay, I don't have time to do it, but I will. And if this model ever gets to a point where I feel like it's good enough, I would love to share it. But I think if anything, I might just move on to a different project in in the background because it takes a long time and uh I'm kind of tired. That being said, you've seen me fail a lot in this video.
I have become so accustomed to failure, you have no idea. I've almost given up on this project so many times. There are so many times where I'm just like, I don't know what I'm doing. This is the stupidest thing ever. I have graveyards full of just garbage, debunkle, schmunkle, deformed data that I have generated thinking this is the best. [laughter] I have gone through the whole alphabet of failures. I was just so way in over my head on this project. But I think the number one thing I've learned, how do I explain this? When you install Linux, here's what happens.
Linux toural the creator becomes your godfather inevitably. And I was watching one of a random video of him talking and he was talking about how he's doing this project and he was failing. But that's okay because that's how you learn. Uh some people think that failure is a bad thing and I happen to be one of those people who actually enjoy doing things I'm not good at because it's how you learn. And I'm watching that lad and I'm like he's speaking to me right now. Oh my god. But I really feel like that's the main thing I've learned from all this because there's so much to learn from failure.
Learn from it and iriterate and keep working. I think if you have expectations of how things should go for yourself, you're just going to get disappointed and you're going to want to give up. So expect to fail, embrace failing. That's the message I want to send out to you kids. Last thing, this is a coding model and I think a lot of people are are looking at coding models like are they going to replace and I saw Linus say this as well and he was basically saying coding models if anything it will just bring in more people interested in learning how to code again.
He's speaking to me right now like I would never have learned wanting to learn how to code if it wasn't for AI coming into the picture. So on the final note, learn something new. Check out boot.dev. I highly recommend it. It's a great course. Pick out whatever you want. Challenge yourself into something difficult that may be above what you think you're capable of. That's it. Now scram. I'm just kidding, bro. This month we're traveling a lot. And what always happened whenever we travel? The inevitable connecting to public Wi-Fi. Free Wi-Fi is a trap. Get the reference.
If you connect to someone else's Wi-Fi, you might as well just give up your credit card and banking [music] information and all your information they ever had. You should always use a VPN. NordVPN. Say it with me. Nord SMGos board connect always. I'm even connected right now. I made a little module for me. Look, cuz whatever I do online, if I want to download, it's my goddamn business. It's my [clears throat] It's about a goddamn right. So, thank you NordVPN for making it possible to free yourself, protect yourself online. If you go to nordvpn.com/pie, you get a huge deal on a 2-year plan, plus a bonus extra 4 months for free.
New Year's resolution, take your online privacy seriously. This is the best deal for NordVPN you're going to find. So, take advantage. Thank you, Nord, for sponsoring this video. That's nordvpn.com/piepie.
More from PewDiePie
Get daily recaps from
PewDiePie
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.









