You Guide To Local AI | Hardware, Setup and Models

Syntax| 00:25:00|Mar 13, 2026

Chapters16

Overview of local AI on a mini PC, hardware options, and a look at whether such a setup can replace a $200 AI subscription, plus the author’s experiments with models, prompting, and agent workflows.

Get a practical tour of running local AI on a mini PC, with real-world hardware choices, models, and workflows you can actually try.

Summary

Syntax host CJ walks you through building a local AI setup at home, starting with a live demo of an LLM running on a mini PC and achieving over 40 tokens per second. He breaks down core terms like inference, GPU memory, and quantization, and explains why GPUs and lots of VRAM matter for 70B-parameter models. CJ compares unified memory systems (including AMD Stricks Halo and DGX Spark) to traditional CPU-GPU RAM layouts, and weighs price versus performance across options like Nvidia/workstation GPUs, Macs, and AMD-based rigs. He shares practical setup steps: choosing a model, using Llama CPP with the Quinn family, and leveraging stacked toolboxes (Vulkan/RockM) to optimize drivers and performance. You’ll see how he wires in tools like OpenCode for agentic workflows, web search grounding, and a VS Code-insiders setup for local AI. The video also covers pros and cons of a no-subscription, local-first approach, including guard rails, tests, and how to manage agent workflows to stay aligned. By the end, CJ candidly answers whether he’d cancel paid AI subscriptions—naming trade-offs and when a local setup makes sense for personal projects. Expect concrete model choices, memory calculations, and actionable tips you can replicate.

Key Takeaways

Running a 70B-parameter model locally requires substantial VRAM; quantization (8-bit/4-bit) can reduce memory needs from 140 GB to roughly 70 GB or less per model.
Unified memory architectures (AMD Stricks Halo, DGX Spark, Mac Studio-style systems) let CPU, GPU, and NPU share memory, improving feasibility and cost for local LLMs.
Quinn 3 coder 30B with 4-bit quantization fits in under 20 GB on the user's setup, enabling longer context windows and practical local inference.
Llama CPP, together with RockM (and Vulkan drivers), provides fast, optimized local inference on AMD hardware, especially when paired with community toolboxes from AMD Stricks Halo Toolbox.
OpenCode integration or similar agent frameworks with guard rails and testing dramatically improves reliability of local AI for coding tasks, compared to ad-hoc prompting.
Grounding via web search is crucial for minimizing hallucinations when running locally; the setup can replace some external AI services for initial queries and summaries.
Local AI requires ongoing management (drivers, memory allocation, and environment tuning); it’s a one-time hardware investment with ongoing setup but avoids subscription costs.

Who Is This For?

Developers and tech enthusiasts curious about replacing cloud AI subscriptions with a privacy-conscious, home-based setup. It’s especially helpful for those exploring AMD-based hardware and local models like Quinn 3 coder or Miniax, and who want practical setup steps and real-world trade-offs.

Notable Quotes

"Check this out. What you are looking at is an LLM running inside my home."

—Intro demo of the local AI setup running on a mini PC.

"We need GPUs for this."

—Explaining the hardware requirement for running LLMs locally.

"Unified memory that’s accessible by the CPU, the neural processing unit, and the GPU."

—Describing unified memory architecture as a key feature.

"Am I going to cancel my Claude subscription? No."

—CJ answers the subscription question candidly.

"Llama CPP is what I went with and it’s fully optimized."

—Choosing a local inference path with Llama CPP.

Questions This Video Answers

How can I run a 70B-parameter model locally with limited VRAM?
What is unified memory and why does it matter for local AI setups?
Which models and quantizations work best on AMD Stricks Halo hardware for local inference?
What are practical workflows to use local LLMs for coding in VS Code?
Is it feasible to replace cloud AI subscriptions with a home-based setup long-term?

Local AIUnified memoryAMD Stricks HaloQuinn 3 coder 30BQuinn 3.5Llama CPPGGUFVulkan/RockM toolboxDGX SparkOpenCode integration the web search tool

Full Transcript

Check this out. What you are looking at is an LLM running inside my home. It's actually on that mini PC back there. And I'm getting over 40 tokens per second using this particular model to get a very decent answer to this JavaScript question that I just asked it. So, in this video, we're talking all about local AI. I am going to show you the machine that I got, talk about its specs, talk about your options when it comes to hardware for running local AI, and really just try to answer the question, can a mini PC like this replace a $200 subscription to an AI company. I'm also going to dive into what models are available and really just show you all the things that I've tried in terms of just basic prompting, hooking it up to MCP tools or tool servers and then using it inside my editor and also some more agentic workflows as well to really just give you my opinions on all this stuff and and you can decide whether or not you want to get into this as well. That sounds good. Let's dive in. My name is CJ. Welcome to Syntax. All right, let's talk about hardware and some key terms and things you should know if you're getting into this world of local AI. First up, we're talking about inference. So, whenever you type a question into ChatGpt or Claude or whatever LLM you use, that is performing inference. It's essentially taking a model that's already been trained. You're passing in some new text and then it's predicting the output text. So, this is known as inference and that's what I'm doing with my local AI. It's also possible to train your own model or fine-tune an existing model. That is something I have not tried yet and not something I was optimizing for when I was looking for some hardware. So inference is what we're trying to do. The next thing to know is that we need GPUs for this. So this is a video from three blue one brown. It's called large language models explained briefly. There's a whole series on neural networks from three blue one brown. But they talk about how GPUs basically have tons of processors inside of them and can do tons of parallel processing. So they're much more powerful than just a CPU and that they can do a lot of things in parallel. And these models are basically just big old matrices of numbers. And GPUs can essentially perform these calculations across all these matrices of numbers really really fast. So we're performing inference and we need a GPU to do that. Now for any given model that you're trying to run inference on, you need a whole lot of memory. This article here talks about uh some basic calculations that you can do. And let's say you're trying to run a 70 billion parameter model. We'll talk about what the various parameter sizes are, but the base calculation here is that if you're running this model at full precision, that would need 140 GB of video memory. So it's a lot. The other aspect to think about is not only do you need video RAM for the model itself because essentially when you're running these models, they have to fit the whole thing into VRAM so they can process quickly. You also need to care about the key value cache. So every prompt that you type in needs to be vectorized so that it can be run through the model. And if you had to do that every single time for every word or every prompt that you're giving it, it would take a whole lot longer. So the way a lot of these things are set up is it essentially pre-calculates the vectors and then reuses those on each subsequent prompt. So you also need VRAM to hold on to that cache as well. So all of this is mounting up to you need GPUs, you need a whole lot of RAM. Now we talked about this 70 billion parameter model running at 16 bit precision. there is this idea of quantization and that's essentially taking an existing model that's at half precision then basically dumbing it down a little bit and uh kind of rounding all of the numbers that exist inside of the model. So at full precision you would need 140 gigabytes of RAM. But if you dumb down the model just a little bit uh to 8 bit precision or 4-bit precision, you're actually going to get to the point where a 70 billion parameter model maybe only needs 70 gigabytes of video memory memory or 30 gigabytes of video memory depending on the quantization. Now, if you want to learn more about quantization, there's a fantastic article here from Martin Gutenorst called a visual guide to quantization. Talks about what it is. It's fascinating, but all you need to know is that these models that you're going to be trying to run locally come in different sizes and quantizations. A great place to start, and what I've been using to learn about these models and and figure out how to run them, is Unsloth. Now, Unsloth also has a lot of guides on fine-tuning models, that is taking an existing model and potentially repurposing it for something else or something that you're trying to do. But they also have a lot of guides on just running the the models that they themselves have quantized. Um, so this page here runs through a bunch of models, but just as an example, if we look at Quinn 3.5, which is one of the best local models you can run right now. On the page for that model, they give you a chart that shows you the various precisions, the uh various models with their various number of parameters. So 27 billion parameters, 35 billion parameters, and then for each precision, how much video memory is required to load that model in. So you can start to see as you find these models, you're going to need to find a GPU that has enough memory uh to load the model itself plus a little more memory for the key value cache as well. So with all of that base knowledge, now you need to find yourself a GPU and it is not cheap. So just looking on a PC part picker, uh this is a place where you can like build your own machines. I went in and just I'm looking at video cards now and you can see just sorted by memory, a video card with 48 gigabytes of memory. This is an RTX 6000 is $7,000. And of course, it depends on where you buy it. So like the RTX A6000 is like $1,000 less. You could potentially run an AMD GPU with a similar amount of video memory for like half the price, but then you're also going to potentially run into uh compatibility issues because certain platforms only run on AMD or versus run on Nvidia. So right off the bat, video cards themselves the are are steep. So, right, if you wanted to build a machine just with this 48 GB memory video card, your base price is 7,000. That's not including memory, CPU, a motherboard, everything else. PC Part Picker, for whatever reason, only has these are basically like consumer grade uh GPUs. And right now, I don't know why, I can only see up to 40 GB of memory. They do make consumer GPUs that have more memory. So, if you look at the RTX Pro 6000, this has 96 GB of memory, but it's almost $10,000. Um, and there are other ones. They're also like enterprise level GPUs. Uh, but all of that to say, you need a GPU with a lot of memory. And GPUs with a lot of memory are very expensive. One of the first platforms to try to make this more accessible for people trying to run local AI was Nvidia with their DGX Spark platform. So, it's this little machine right here that actually has 128 GB of memory, unified memory that can be used for either the video card that's on it or the the CPU itself. Um, and this makes it so that you can have a single machine and not need a really expensive GPU and run some of these models. So, this 128 GB of memory is shared between the GPU and the CPU. But that means you can load a 70 billion parameter model and have room for the the key value store on that particular machine. Now the DGX Spark is a little more approachable in terms of price. It has, like I said, 128 GB of unified memory and around $4,000. So that's great. You could start there. But AMD came along and they brought us the AIAX 395 processor, which has really good scores. all of their charts here, they're comparing it to an Intel i7, which is not comparable because it doesn't use the unified memory architecture. But all that to say, their platform, very similar to the DGX Spark, um, uses a unified memory architecture and is a lot more um, approachable in terms of price. Now, when I say unified memory, you might be thinking of Apple because they were the first ones to do it with their M1 architecture back in 2020. Um, but essentially with a traditional setup, you have a CPU that's talking to the RAM that you plug into the motherboard and then you have a GPU which has onboard video RAM b baked into it and then they communicate with each other over PCIe. With the unified architecture, there is unified memory that's accessible by the CPU, the neural processing unit, and the GPU. And like I mentioned, Apple was one of the first to do this. So back in 2020 when they released the M1, they came out with this unified architecture. And that's why you could also run these local LLMs on Macs as well. But AMD is what I decided to go with. And that little machine running back there is an AMD Stricks Halo machine. So Stricks Halo is like the code word for the 395 processor with the unified architecture. If you go to the Stricks Halo wiki, they actually list out all of the types of machines that run this processor. And you might be familiar with the framework desktop. And this machine runs that exact same processor. You you might have seen people getting it for running local. This is very similar, but this one is is a bit more approachable in terms of of price. So, right now, if you look on MicroEnter, uh the machine that I have is selling for $2,500. So, of all the stuff we've shown so far, this is probably one of the most approachable in terms of price. And then, like I mentioned, there are other platforms that have the same processor with unified memory, but are from Minis Forum or from HP. There's a few others as well, like I showed in that wiki. All of them a little more expensive than the one that I got, which is the GMT Evo X2. Now, I bought this thing 2 or 3 months ago, and I got it for uh $2,100, but with the increase in the price of uh RAM and everything else, this stuff is only getting more expensive. Um, so I bought in early. This is still a fairly reasonable price to be able to run the kinds of things that I'm going to show you, but this is all the research that I did. It's a whirlwind tour. I'll provide links to all the stuff I showed you in the description if you want to look into this stuff more. Um, and of course, like I mentioned, you also could go with a Mac or Mac Studio because Mac Studios you can get 128 GB of unified memory or 256 or 512, but the price is a little bit more. Now, when I bought the GMT Evo, like I said, I paid $2,100. And at the time, it would have been $1,000 more just to get the Mac Studio with the same amount of unified memory. And so, I just went with that. At this point with the price of everything increasing, you're kind of getting to the point where it's almost the same price to get like this HP or this minis forum or the framework desktop and it's probably about the same as a Mac Studio. Um, so you can you can weigh those options. The other thing to consider though is with a Mac you have to run Mac OS. Now there is AI Linux. I have not looked into if people are trying to do local with AI Linux but typically you would just stick with Mac OS. So that might limit you in what you can actually run. But with these PCs, they usually come with Windows, but you can wipe them, install Ubuntu, install Fedora, and then get access to a lot more uh community packages and workflows where people have been tweaking and working with this stuff to try to get local AI running. So that was a whole lot to take in, but all of that to say, I have landed on this machine here, and let me show you how I set it up. Now, the machine I got was the GMK Tech Evo X2. And uh it's it's a nice little machine. Now, in terms of IO, on the front, we've got an SD card reader, a USBC, which is USB 4, two USB 3.2 ports, and a 3.5 mm headphone jack. On the back, you've got the DC power in, another headphone jack, a USB 3.2 port, another USBC port, you've got display port, HDMI, and then two USB 2 ports. We've also got a 2.5 gigabit Ethernet adapter. Now, in terms of sizing, it's about one bread by one bread and half a bread wide. It's also got these nice little feet on the bottom so you can stand it upright. And you do want to use it in this way. Look at that little buddy. Uh because of all the airflow. Comes with a power cable and an HDMI cable as well as the power brick which is a 230 W brick running at about 19.5 volts. So nice little machine. Let's get this thing set up. Now when you boot it up for the first time, you do get Windows. It actually took quite a while before it got to the initial Windows screen. And while I was waiting, I found this button on the side which changes the RGB color. And uh yeah, Windows took forever to boot. So, I didn't even go through the getting started. I immediately plugged in a USB drive that has the uh Fedora installer on it. And then I went through the BIOS and updated all of the recommended settings in terms of performance mode, how much memory to allocate to the video card, and a few other options that I found by going through the wiki and the forums. Now, this was actually my first time installing Fedora. I prefer Ubuntu and run it on all my machines, but Madora is pretty cool. Pretty cool. So, the the install was pretty seamless. And they also had an option to enable the administration tools, which I didn't know what it was going to do. But the first time it booted up, it immediately made itself available over the web. And that means I could actually just go back to my computer. I didn't even need to SSH into it. I could go to the web dashboard and actually start configuring it. They have a built-in terminal there and I can see all of the status of it's running. So, this is actually some software called Cockpit. And I've run this software on my Ubuntu machines, but it comes from Fedora and it's really cool that out of the box you can get this up and going. So, the moment your install is done, you can immediately head to a browser and start configuring your machine. Now, one of the first things I needed to do for this machine is to enable some settings for GTT, which is how Linux is able to allocate memory for the video card or for the CPU. So, there are quite a few guides out there. There's one from Jeff Gearling that helped initially, but uh this particular stricks Halo machine used different options for setting how much video RAM should be allocatable. And so, once I figured out those settings, I was good to go. Essentially, we're allocating up to 108 gigabytes for the video card. And like Jeff Gearling mentions, this essentially allows it to run stable. Anything more, and it might kernel panic every now and then. So, we essentially have 20 GB for the OS and everything else that's running of RAM and up to 108 GB that we can use for all of our AI workflows. All right, our hardware is all set up. Now, we need to actually run some models. And there's a lot of different ways to do this as well. Olama is probably one of the most popular platforms. Essentially, it's a CLI tool. It can download models and then run them locally. There's also LM Studio, which is a desktop app. They also have like a CLI tool where you can manage the downloaded models and then they give you a chat GPT like interface. And then there's also VLLM, which is commonly used to cluster and network your uh computers together so you can run LLMs at scale. And then there is also Llama CPP. And this is actually what I've settled on. So, Llama CPP in a lot of benchmarks is one of the most fastest and they actually pioneered the format that a lot of these other tools use, the GGUF file format. So, Llama CPP is what I went with and in order to get it set up, it's pretty involved. But fortunately, there are community packages and libraries for this. So, shout out to Cuz who's created this AMD Stricks Halo Toolbox, which essentially out of the box gets you ready to go with Llama CPP on any machine that has this Ryzen AI 395 on it. And all of this is based off of this standard for toolbox. So, toolbox essentially works on top of Docker or Podman to package up various dependencies and things that you might need to get something running. And so that's exactly what Kuzo has done for these toolboxes because when you're running Glamma CPP on the Andy Stricks Halo, you're going to need special drivers. You're going to need various things that are set up and ready to go. And you can do all of that manually. Like even at this point, when I was trying to set this up without the toolbox, you'd have to compile Llama CPP from scratch. Essentially, these toolboxes come with pre-ompiled versions, so you're just ready to go. And once you've installed the toolbox CLI, it's just one command to spin up that toolbox. And now you can start running Llama CPP inside that toolbox. It's fully optimized. Everything's compiled. You're good and ready to go. Now there's two different toolboxes to choose from. There is Vulcan and Rockm. Vulcan is essentially the open- source implementation of the AMD drivers. And then RockM uses the proprietary drivers, but in my tests, RockM does give the best performance. So that's the toolbox that I went with. So you run this one command on your machine, and then you're ready to go. You're ready to start running some models. Now, in the readme for this project, they show a command that will use the hugging face CLI to download the Quinn3 coder 30B at BF16 precision. So, that's that full precision and it comes in two different files. But if you want to start picking your own models and running your own stuffs, first of course, install the hugging face CLI. So, this will give you access to the hf command and then you can start pulling models from hugging face. So, hugging face is the most popular place for people to host models and also download models. And since we're using Llama CPP, if you head to the HuggingFace models page, you can filter by Llama CPP and then every single model that you see here will run under Llama CPP and they typically include the recommended settings for running that specific model. Now, when you're looking through these models, you will notice there's a lot from Unsloth. They're the ones that I mentioned earlier in the video. Essentially, they create quantized versions of models and provide guides for fine-tuning them. Like I showed earlier, all of these models they basically have guides for. But next thing you need to do is actually pick a model. And what you'll do is once you get into the model page over on HuggingFace, you can take a look at the model card here and see all of its quantizations. So you can see Kimmy K 2.5 in this format at full precision is 2 terabytes for the model alone. And then if you look at some of the other quantizations that the size gets smaller and smaller as you go. And so for the machine that I have, it has 128 GB of unified memory. only about 108 of that is accessible as video memory if you want to run things in a stable way. So any one of these models I could not run. Essentially you can look at the size of the model. That's how much video RAM it's going to take up. So we can't run Kimmy K25. There is Miniax M2.5 and you can see in the 3bit quant we have a model that is 101 GB. So you could run that one. In my testing, it works, but because we only have a few more gigabytes to work with after that's already loaded in, your context size is limited to maybe 4,000 or so tokens. So, it works, but it starts to slow down the larger your prompts get. You could potentially try the smaller quantizations, but as they get smaller, they get a little bit dumber. Um, and then there's Quint 3 coder 30B A3B Instruct. And this is the one that I have been using. It's been fantastic. It's been really good. I'll show you some of my tests next, but you can see that the 4-bit quantized version is under 20 GB, so there's plenty of room for like really long context windows. And you could probably even run the higher precision ones and still have room for large context windows. So, you'll pick a tool for running your models. You'll pick which model you want to run, take into account how big that model is, take into account what the format is, is it going to work with the platform that you're running with, and then you're ready to go. Now, let's just talk about overall impressions for just prompting the AI. Now, I've been using Quinn 3 coder next and then also the Quinn 3.5 model. And for one-off questions, it's pretty decent. It is technically accurate for simple questions. It can generate CLI commands. It can answer questions about various popular JavaScript libraries. All of that seems to work perfectly fine. And especially if you hook it up to some tools like web search or documentation search. So, I got the open web UI up and running and connected it to my llama server and enabled the web search tool. And so, now whenever you prompt, it can actually search the web and then use the results from the web to ground its answer in the truth instead of potentially hallucinating something. And so, I also found this very reasonable in terms of how it was able to summarize things. Yeah, I will say it's a little bit slower than something like chatbt or claude, but it is running inside my house on that little machine back there. So like I I can I can take that. Uh and also a lot of times you might fire off a query, come back to it later. That's totally fine. So personally I see myself replacing claude chatgptemini with this for my initial searching of the web and initial question asking. It works perfectly fine for that. I don't have to worry about these AI companies being trained on my prompts and this is working great. So for basic question answering and basic searching and summarizing from the web, this has absolutely replaced my usage of Claude chatgbt Gemini and now I can do it from the comfort of my own home. Now when it comes to coding, there's still some things that need to be worked out. It's not perfect. If you want to use VS Code, you have to use the insider version of VS Code. That's like the pre-release version. And then you can only use it for agentic workflows. You can't use it for autocomplete. There are plugins and extensions that work inside of just regular VS code, but that essentially replaces Copilot altogether. So there's this tool I tried called continue, and it basically works like Copilot, but you can customize which AI endpoint it talks to. So I can point it directly at my little machine back there that's running locally. And uh it works exactly like you would expect in Copilot, right? It has a chat window. It has agentic workflows where it can create files and I found that trying to create projects from scratch using this particular workflow. The AI struggled at various points. So I would have to intervene every now and then to fix a config issue or tell it oh you're editing the wrong file or you put it in the wrong place. Um so it really couldn't do it unguided from like basic prompts. However, if you start to use more structured tooling, so if you try to do more spec driven development, that absolutely keeps this in line a lot more. And that's really my guidance here if you're going to try to use this for bigger workflows is you have to be specific about the prompts that you pass it and try to use sub aent workflows. So, I also got this hooked up to open code. And from my experience, that's the best way to work with local AI because it's supported out of the box. It can do sub agents. It has a web-based dashboard so I can prompt from the couch. Um, and I can hook this up to tools like beads for task management or spec kit for doing the initial spec and design of what I'm trying to build and then from there creating beads tasks and then from there having an agent orchestrator that then spins up sub aents. So if you keep guard rails around what you're asking this little AI to do and you have lots of ways for it to verify itself. So if you have tests, if you have linting and type checking, all of those things are going to be ways for it to verify itself instead of just guessing. And that's basically what I found when I was trying to create a project from scratch with a basic prompt. It it started to break down. Right now, you can do that kind of thing with Claude Opus 4.6 and it works perfectly fine. But with some of these dumber models, they need checks and balances. So anything you can do that allows them to verify themselves will keep them in line a whole lot more. So I found having extensive test suites, having that spec drawn out beforehand, having it run through all of those checks before moving on to the next thing, having it work on only isolated work at a time and and in this this uh try things out test repeat cycle that got it to stay aligned much more. So it requires a lot more work and it requires a lot more managing of these agents to get it working on this little machine here. But again, you don't have to pay a monthly subscription. It's a onetime cost and AI companies don't have access to the source code that you're generating. So, there are some trade-offs. To really answer the question, am I going to cancel my Claude subscription? No. Right now, I'm I'm actually getting some really good use out of Cloud Opus 4.6 uh for some various personal projects that I'll be talking about eventually, but um the kind of work that I'm doing there, it's just not it's too complex for for this little machine. Now, like I mentioned, if you have the right guard rails, if you have tests and and uh all of the things set up in a way that can allow this little machine back there to work within guidelines, you can get it working, but it's a lot more work. It's it's much easier for me to prompt in a more general way to something like Cloud Opus 4.6, and it figures out what I need without all the handholding. Now, I will say that these models that are being released by uh these companies like the openw weightight models from Quinn and and GLM 4.7 and Kimmy K2.5 and Miniax 2.5 um all of these models are really decent for being local models and they're only going to get smaller and they they're only going to get better. So that's one of the cool things about this whole ecosystem is if you just wait a month or two, a new model will be released and all of a sudden the same hardware that you already have can actually have better capabilities because of the models that have been released. So for me, that's the exciting thing is being able to tinker with these things. Um, and like I mentioned earlier, be able to to prompt these LLMs from the privacy of my own home without AI companies training on my prompts. So that's all I got for you in this video. Let me know down in the comments if you're going to try this out. Also, let me know if you're already trying this out. Let me know what hardware you're working with. And if you have any questions about how I set things up or anything else like that, throw those down in the comments as well. All right, I'll see you in the next one.