Cancel your AI subs, local LLMs are here

Cloudflare Developers| 00:06:48|Apr 9, 2026
Chapters7
Explains why small, locally runnable models are appealing: no subscription, private conversations, and independence from external quotas.

Local LLMs like Gemma and Quinn 3.5 run on your own hardware, cutting AI-subscription reliance and boosting privacy and control.

Summary

Cloudflare Developers’ video showcases a pragmatic push toward local LLMs. The creator highlights how models like Gemma 4 and Quinn 3.6, now small enough to run on consumer laptops and phones, remove the need for expensive GPUs or paid AI subscriptions. He demonstrates a tangible local stack built around Open Code with two models (Gemma and Quin 3.5) loaded via a local API, driven by a 36GB VRAM setup. Tools like NVTop reveal GPU usage, while Llama Swap handles automatic model swapping to fit RAM constraints. When larger models—such as Kimik 2.5 (a trillion-parameter model)—are needed, he taps into Cloudflare Workers AI for cloud-assisted workloads. The video includes a blog post link with detailed setup steps to keep the stack running after reboots. Overall, the takeaway is that local AI stacks can power everyday workflows with privacy, unlimited local compute, and offline capability, without surrendering access to high-end models via cloud services.

Key Takeaways

  • Gemma 4 and Quinn 3.5/3.6 are small enough to run on consumer hardware, eliminating the need for $70k+ GPUs.

Who Is This For?

Essential viewing for developers and product teams exploring offline, private AI workflows. It’s especially valuable for those who want a hands-on blueprint to run local LLMs like Gemma and Quinn and to understand when to offload to cloud services for very large models.

Notable Quotes

"Last week, Quinn 3.6 got announced and Google dropped Gemma 4."
Sets the premise that new, capable models are available in smaller, local-friendly forms.
"you can run them on the hardware you have today. And it's really important because it means um you don't need an AI subscription to have access to good models."
Highlights the core benefit of local models: no subscription, local compute.
"it's private. It's not going out somewhere to be used to train models or used for ads. It works offline because you're actually running the models on your device."
Emphasizes privacy and offline capability.
"I have 36 gigs of VRAM and I can actually run these models at like max config."
Shows concrete hardware requirements and feasibility.
"When I really need to run something with like frontier level intelligence... I can't do that locally because it's a 1 trillion parameter model and that requires 2 gigs of VRAM."
Sets expectation forlimits and when cloud help is needed.

Questions This Video Answers

  • How can I run Gemma 4 locally on a consumer PC?
  • What is Llama Swap and how does it help with local LLMs on limited RAM?
  • What are the trade-offs between local LLMs and cloud-hosted models for heavy workloads?
  • Can Cloudflare Workers AI handle trillion-parameter models like Kimik 2.5 efficiently?
  • What steps are involved in setting up an offline AI stack on a laptop?
Local LLMsGemma 4Quinn 3.6Quin 3.5Llama SwapNVTopOpen CodeCloudflare Workers AIKimik 2.5offline AI
Full Transcript
Last week, Quinn 3.6 got announced and Google  dropped Gemma 4. I think it was really exciting,   like everyone was excited, myself included,  because these new models are really good,   but also really small in the sense that they can  run on any device you own, like on the laptop you   own on all on your mobile device. And why this is  important is because you don't have to go out to   buy a $70,000 plus GPU to run these local models.  you can run them on the hardware you have today.   And it's really important because it means um you  don't need an AI subscription to have access to   good models. Um like these AI labs have been very  flaky in the last couple of weeks limiting your   quota usage or limiting what applications you can  use their models on. When you have these models   running locally, you can use them for whatever  you want and as many times you want. In fact,   you have unlimited tokens because it's all compute  running on your own hardware. And of course,   the conversations are private. It's not going out  somewhere to be used to train models or used for   um ads. It works offline because you're  actually running the models on your device.   So, I'm going to show you the kind of setup I  have for my own local AI stack. And the goal of   this is to inspire you to actually get these  models downloaded, run them locally, and use   them to power your own workflow. It could be for  um just having general chats and conversations   or it could be for use in a coding agent like  open code or other coding agents that you may   already be using. So let me show you what I have.  Um right here I'm on my computer and I'm just   going to show you my open code configuration. Um  all right so this is my open code configuration.   I have a configuration for a local LLM like setup  which is an open AI compatible API. You can see   this is a local IP address like that's all local.  And I have two models. I have the Gemma model   with a maxed out context window  and I also have the Quen 3.5 model   also with a maxed out context window. And  I'm running both models maxed out or local.   um to actually show this to you, I'll switch  to another tab and let's do Neo fetch. Okay,   so this is this is my hardware configuration. I  have 36 gigs of VRAM and I can actually run these   models at like max config. So let's head back  here. I'm just going to start up open code. Um I'm   in a project directory. I'm just going to start  up open code and I have Gemma 4 selected. This   is the model I'm using by default. I can ask it  a question. I'm just going to say hi for example.   And this should like come back with a response.  While that's um I'm just going to give it a   minute to Yep. It's coming back with a response.  And I can ask it what is this uh project about.   All right. And it's going to read the readme  file and tell me what the project is about. And yeah, look at it. Go. It's uh completely  correct. It's a todo app built using durable   objects and that's what the project is about.  Now, let me show you something really cool. Um   I have a tool here called NV Top. It's just like  HTOP, but it shows you like the GPU utilization   on your hardware. Um you can see we have 36  gigs of VRAM and the model currently running   is taking up 22 gigs of VRAM. That's because I'm  maxing out the context window. One cool thing I   have here is that I don't have enough RAM to  run both Gemma and Coen simultaneously. So,   I'm using a cool piece of software called Llama  Swap, which swaps out the actively running model   on the GPU. So, I can actually configure various  models, but just have one running at a time,   and it does the automatic swapping. Like, it's  really good. If you want to see what my setup   looks like, I have an article on my blog which  I'll be leaving in the description below that goes   through the details of how I actually got this set  up and you could just reach through and hopefully   it inspires you and gives you an idea of how to  have your actual local AI stack running. Okay,   I'll have the blog post linked below. Uh so let's  uh head back here. I am just going to switch to   using Quinn for instance. So let's switch model uh  coin 3.5 and I can say hello for instance um send   that and let's switch to uh see how that works in  practice. So I'm just going to switch to the other   tab. Uh you look at this graph you can see that  this has unloaded the Gemma model. You see the   orange line has gone down and has gone back up  which is the Quen model being loaded into memory.   And you can see the seal or green looking uh line  showing you my actual GPU utilization which is um   going up to like 100% right now because it's  running the current inference on the new model   um still on lama.cbp but llama swap handles  the like intelligent swapping of the models   and this is quen being executed right now as  like the model which is running on my GPU.   So if we head back to the open code instance  where the actual inference is happening,   uh you notice that this is done generating  its response and of course I can use both Quen   and Gemma running locally on my own hardware  without having to pay for an AI subscription.   Uh just one word of note here is that when I  really need to run something with like frontier   level intelligence like Kimik 2.5, I can't do that  locally because um it's a 1 trillion parameter   model and that requires 2 gigs of VRAM. That's not  something I can run locally. So what I do instead   is I reach out to a service like Cloudflare  Workers AI. Um if we switch back to the browser,   I'm just going to show you this. Kafa workers  AI has Kim 2.5 and many other like really large   models I can run locally. So I reach out for this  because um it's great I have 10,000 neurons per   day so I can actually run like the really large  workloads on this massive models in the cloud.   But for everything else I need to do that can  all be local because I have both Gemma and Quin   running locally. All right. So I hope  this inspires you to go start your own   AI local stack. I'm going to leave the blog post  showing you how to actually configure this on your   local device and set it up such that it's always  up and running even even when you reboot your   computer. I think this is really cool. I think  local AI is going to be the future of LLMs.   Um so let me know what your thoughts are and  I'll be making more videos on this topic. So   don't forget to leave a like and get subscribed  and I'll catch you in the next video. Cheers.

Get daily recaps from
Cloudflare Developers

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.