Cancel your AI subs, local LLMs are here
Chapters7
Explains why small, locally runnable models are appealing: no subscription, private conversations, and independence from external quotas.
Local LLMs like Gemma and Quinn 3.5 run on your own hardware, cutting AI-subscription reliance and boosting privacy and control.
Summary
Cloudflare Developers’ video showcases a pragmatic push toward local LLMs. The creator highlights how models like Gemma 4 and Quinn 3.6, now small enough to run on consumer laptops and phones, remove the need for expensive GPUs or paid AI subscriptions. He demonstrates a tangible local stack built around Open Code with two models (Gemma and Quin 3.5) loaded via a local API, driven by a 36GB VRAM setup. Tools like NVTop reveal GPU usage, while Llama Swap handles automatic model swapping to fit RAM constraints. When larger models—such as Kimik 2.5 (a trillion-parameter model)—are needed, he taps into Cloudflare Workers AI for cloud-assisted workloads. The video includes a blog post link with detailed setup steps to keep the stack running after reboots. Overall, the takeaway is that local AI stacks can power everyday workflows with privacy, unlimited local compute, and offline capability, without surrendering access to high-end models via cloud services.
Key Takeaways
- Gemma 4 and Quinn 3.5/3.6 are small enough to run on consumer hardware, eliminating the need for $70k+ GPUs.
Who Is This For?
Essential viewing for developers and product teams exploring offline, private AI workflows. It’s especially valuable for those who want a hands-on blueprint to run local LLMs like Gemma and Quinn and to understand when to offload to cloud services for very large models.
Notable Quotes
"Last week, Quinn 3.6 got announced and Google dropped Gemma 4."
—Sets the premise that new, capable models are available in smaller, local-friendly forms.
"you can run them on the hardware you have today. And it's really important because it means um you don't need an AI subscription to have access to good models."
—Highlights the core benefit of local models: no subscription, local compute.
"it's private. It's not going out somewhere to be used to train models or used for ads. It works offline because you're actually running the models on your device."
—Emphasizes privacy and offline capability.
"I have 36 gigs of VRAM and I can actually run these models at like max config."
—Shows concrete hardware requirements and feasibility.
"When I really need to run something with like frontier level intelligence... I can't do that locally because it's a 1 trillion parameter model and that requires 2 gigs of VRAM."
—Sets expectation forlimits and when cloud help is needed.
Questions This Video Answers
- How can I run Gemma 4 locally on a consumer PC?
- What is Llama Swap and how does it help with local LLMs on limited RAM?
- What are the trade-offs between local LLMs and cloud-hosted models for heavy workloads?
- Can Cloudflare Workers AI handle trillion-parameter models like Kimik 2.5 efficiently?
- What steps are involved in setting up an offline AI stack on a laptop?
Full Transcript
Last week, Quinn 3.6 got announced and Google dropped Gemma 4. I think it was really exciting, like everyone was excited, myself included, because these new models are really good, but also really small in the sense that they can run on any device you own, like on the laptop you own on all on your mobile device. And why this is important is because you don't have to go out to buy a $70,000 plus GPU to run these local models. you can run them on the hardware you have today. And it's really important because it means um you don't need an AI subscription to have access to good models.
Um like these AI labs have been very flaky in the last couple of weeks limiting your quota usage or limiting what applications you can use their models on. When you have these models running locally, you can use them for whatever you want and as many times you want. In fact, you have unlimited tokens because it's all compute running on your own hardware. And of course, the conversations are private. It's not going out somewhere to be used to train models or used for um ads. It works offline because you're actually running the models on your device. So, I'm going to show you the kind of setup I have for my own local AI stack.
And the goal of this is to inspire you to actually get these models downloaded, run them locally, and use them to power your own workflow. It could be for um just having general chats and conversations or it could be for use in a coding agent like open code or other coding agents that you may already be using. So let me show you what I have. Um right here I'm on my computer and I'm just going to show you my open code configuration. Um all right so this is my open code configuration. I have a configuration for a local LLM like setup which is an open AI compatible API.
You can see this is a local IP address like that's all local. And I have two models. I have the Gemma model with a maxed out context window and I also have the Quen 3.5 model also with a maxed out context window. And I'm running both models maxed out or local. um to actually show this to you, I'll switch to another tab and let's do Neo fetch. Okay, so this is this is my hardware configuration. I have 36 gigs of VRAM and I can actually run these models at like max config. So let's head back here.
I'm just going to start up open code. Um I'm in a project directory. I'm just going to start up open code and I have Gemma 4 selected. This is the model I'm using by default. I can ask it a question. I'm just going to say hi for example. And this should like come back with a response. While that's um I'm just going to give it a minute to Yep. It's coming back with a response. And I can ask it what is this uh project about. All right. And it's going to read the readme file and tell me what the project is about.
And yeah, look at it. Go. It's uh completely correct. It's a todo app built using durable objects and that's what the project is about. Now, let me show you something really cool. Um I have a tool here called NV Top. It's just like HTOP, but it shows you like the GPU utilization on your hardware. Um you can see we have 36 gigs of VRAM and the model currently running is taking up 22 gigs of VRAM. That's because I'm maxing out the context window. One cool thing I have here is that I don't have enough RAM to run both Gemma and Coen simultaneously.
So, I'm using a cool piece of software called Llama Swap, which swaps out the actively running model on the GPU. So, I can actually configure various models, but just have one running at a time, and it does the automatic swapping. Like, it's really good. If you want to see what my setup looks like, I have an article on my blog which I'll be leaving in the description below that goes through the details of how I actually got this set up and you could just reach through and hopefully it inspires you and gives you an idea of how to have your actual local AI stack running.
Okay, I'll have the blog post linked below. Uh so let's uh head back here. I am just going to switch to using Quinn for instance. So let's switch model uh coin 3.5 and I can say hello for instance um send that and let's switch to uh see how that works in practice. So I'm just going to switch to the other tab. Uh you look at this graph you can see that this has unloaded the Gemma model. You see the orange line has gone down and has gone back up which is the Quen model being loaded into memory. And you can see the seal or green looking uh line showing you my actual GPU utilization which is um going up to like 100% right now because it's running the current inference on the new model um still on lama.cbp but llama swap handles the like intelligent swapping of the models and this is quen being executed right now as like the model which is running on my GPU. So if we head back to the open code instance where the actual inference is happening, uh you notice that this is done generating its response and of course I can use both Quen and Gemma running locally on my own hardware without having to pay for an AI subscription. Uh just one word of note here is that when I really need to run something with like frontier level intelligence like Kimik 2.5, I can't do that locally because um it's a 1 trillion parameter model and that requires 2 gigs of VRAM.
That's not something I can run locally. So what I do instead is I reach out to a service like Cloudflare Workers AI. Um if we switch back to the browser, I'm just going to show you this. Kafa workers AI has Kim 2.5 and many other like really large models I can run locally. So I reach out for this because um it's great I have 10,000 neurons per day so I can actually run like the really large workloads on this massive models in the cloud. But for everything else I need to do that can all be local because I have both Gemma and Quin running locally.
All right. So I hope this inspires you to go start your own AI local stack. I'm going to leave the blog post showing you how to actually configure this on your local device and set it up such that it's always up and running even even when you reboot your computer. I think this is really cool. I think local AI is going to be the future of LLMs. Um so let me know what your thoughts are and I'll be making more videos on this topic. So don't forget to leave a like and get subscribed and I'll catch you in the next video.
Cheers.
More from Cloudflare Developers
Get daily recaps from
Cloudflare Developers
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.




