How to Setup OpenCode with Local Gemma 4 Models | Full Step-by-Step Guide

Tony Xhepa| 00:09:42|Apr 10, 2026
Chapters8
Outline of installing Open Code AI and preparing to use local models with Llama.cpp or Ollama.

Tony Xhepa demonstrates setting up OpenCode with Gemma 4 locally via Llama.cpp, detailing memory needs, quantization, and config tweaks.

Summary

Tony Xhepa walks you through configuring a local OpenCode setup using Gemma 4 with Llama.cpp (macOS focus) after a prior Ollama walkthrough. He shows how to install OpenCode AI, optionally Ollama, or switch to Llama.cpp, and where to grab the model from Hugging Face. He highlights memory considerations, noting that a 26B model can push RAM usage toward 28–30 GB depending on context and caching. The video covers selecting a 4-bit quantization option (4-bit QM or 4K) and using the llama server command with HF Hugging Face, including the -- ninja flag for tool support. Xhepa demonstrates starting the server with a 32K context and port 8033, then verifying memory usage in Activity Monitor while issuing a sample prompt. He also shows how to wire the local model into OpenCode by editing the config.json (provider: llama local; name: Gemma 4 26B) and launching a sample Laravel app to reveal a 30+ GB memory footprint. Finally, he tweaks the environment by updating .env to use MySQL with root and no password, and notes how the context size and number of tokens affect memory. The video wraps with a reminder to like, subscribe, and share if you found the guide helpful.

Key Takeaways

  • Install prompts: on macOS, follow the steps to install Homebrew, Git, and CMake, then install or switch to Llama.cpp for local model serving.
  • Model choice and size: Gemma 4 26B (GGUF, 4-bit quantization) can be used with Llama.cpp, with 4-bit variants around 16–17 GB, and memory usage climbing to ~28–30 GB during operation.
  • Server command example: use 'llama server HF -- HF <model-name> -- ninja -- C 32K -- port 8033' to run the local Llama.cpp server with 32K context.
  • OpenCode integration: configure a local provider in openai-compatible config.json (provider: llama local; name: Gemma 4 26B) to route queries to the local model from OpenCode.
  • Memory reality: running a Laravel example with PHP 8.4, Laravel 13, and Livewire 4.2 pushes total memory to about 30+ GB; context size and active prompts drive consumption.
  • env tweaks: updating .env to switch Laravel app DB to MySQL with root and empty password demonstrates how runtime changes can be tested without leaving the local setup.

Who Is This For?

Essential viewing for developers setting up OpenCode with local Gemma 4 models or exploring Llama.cpp-based local ASR/LLM deployments; it covers concrete steps, memory expectations, and config nuances.

Notable Quotes

"First, run this command on the terminal. Then, install Homebrew, and then, install Git and CMake."
Shows the initial setup steps for macOS users.
"Here, we have models and the size here, which is 9.67, 9.6 also this one, 18, and 20."
Discusses model size options and their memory implications.
"If you download this 26, you can open the terminal and run the Ollama launch open code dash dash model, and then, copy this name and paste it after here."
Describes how to launch a local model with a chosen name.
"Memory used is 30.6 GB, which is okay. It's updating the database."
Notes practical RAM usage during a deployment scenario.
"So, that's it all about this video, what I wanted to show you."
Closing summary and sign-off.

Questions This Video Answers

  • How do I set up OpenCode with a local Gemma 4 model using Llama.cpp on macOS?
  • What RAM do I need to run Gemma 4 26B with 4-bit quantization in OpenCode?
  • How do I configure open code.json to point to a local Llama.cpp model?
  • What are the differences between Ollama and Llama.cpp when running Gemma 4 locally?
  • How does changing the context size (e.g., 32K vs 62K) affect memory usage in a local LLM setup?
OpenCodeGemma 4Llama.cppOllamaHugging FaceLLM4-bit quantizationLua server? (no)LM StudioQuantization options 4-bit/4K Python HL
Full Transcript
Hello friends, Don here. Welcome. In today's video, I'm going to show you how to set up your machine to work with a open code and local models, in this case, Gemma 4. Now, in the previous video, I worked with Ollama, which is very easy. First, you need to download the open code AI, install the open code. So, you can go to open code AI and, for example, copy this command. And then, also download the Ollama. If you work with Ollama, or in this video, I'm going to work with the Llama.cpp. You can go to llama.cpp.com. Here, we have the download for Windows, Linux, and macOS. Because I'm using macOS, let's go to macOS installation. You can download here, or go to installation guide. And yeah, first, run this command on the on your terminal. Then, install the Homebrew, and then, install Git and CMake. At the end, install also the Llama.cpp. Okay. But, if you want to work with the Ollama, you just copy this command and install the Ollama, and then, you can go to models. For example, Gemma 4. Now, you need to know your uh how memory how RAM memory you have in your machine. In my machine, I have 32 GB. And you can see we have models and the size here, which is 9.67, 9.6 also this one, 18, and 20. Now, because this is 20, it's not mean that only 20 GB is going to be enough in your machine. 20 GB is only the model, but when you work, also you can see here, we have context 256K. And it's going to work also with a cache, and it's going to be much larger than the 20 GB of memory. You have also the 26 billion, which is uh 18 GB. I have downloaded this 26 billion, but with the Llama.cpp. And here, for example, if you download this 26, you can open the terminal and run the Ollama launch open code dash dash model, and then, copy this name and paste it after here, or you just say Gemma 4. And that's it. You can work with open code with the Gemma 4 local model. But, I'm going to work with Llama.cpp. I have installed, and now, to download the model, you can go to Hugging Face, and you can search for model here, for example, Gemma 4. And I have downloaded this Gemma 4 26 billion. Which is GGUF. And you can see here, I have chose 4-bit quantization. You can see 4-bit we have quantization small, which is 16.4 GB. I have this quantization 4K which is almost 17 GB. And if you want, you can go right here on the drop down, and for example, you can choose local apps, which is Llama.cpp, LM Studio, also LM Studio is very good, Jan, vLLM, Ollama, by, and so on. So, if you um because I'm going to work with Llama.cpp, you can click here, and it suggests the first install Llama, and then, say llama dash server dash HF Hugging Face, and the name of the model. And here, we have the drop down to select the quantization. By default, it's this Q M. Okay. You can copy this command, but let me just come here and say llama server HF dash HF and the name of the model. And here, I have added also dash dash ninja, because I want to work also with the tools. And here, dash C is the context. How much context? I'm going to give it first 32K, and then, the port is going to be 8033. Hit enter. And I'm going to show you I'm going to open the activity monitor and show you the memory here. How much memory? As you can see, it's going to start to use 26 27. Yeah, almost 28 GB of memory. Yeah, now it's 28. And we can open. So, let's copy this. Let's go. And we can open here. So, here is the Llama.cpp. We can say, for example, hi. As you can see, here is the model. [gasps] And yeah, hello. I'm going to zoom it a little bit. Because you can see, it is 40 five tokens per second, which is very good. Now, to use this with the open code, let's open another terminal, and I'm going to go to CD .config /open code. And I'm going to open this open code with code editor. And here, I have created this open code.json file. And you can copy and paste this in your machine, the same thing. So, the final JSON here with dollar sign schema open code.ai/config.json, provider llama local, name Llama.cpp, NPM @ai-sdk/openai-compatible, and options base URL is the URL we have here, /version 1. Okay? And the then, models local model ID, name. You need to copy the name you have. So, in my case, is this is the name, and tool call true. With that, now, let's close this, and let's come here. I'm going to open a new tab, which is in this third example app. And I'm going to say open code. Okay, you can see now it's going to use this Gemma 4 26 billion, and I'm going to ask what stack is used in this app, which is a Laravel app. And the memory used is here 29 GB of memory. Which is okay, because we have 32. As you can see, it's calling the Laravel boost application info, and this app uses the following stack: PHP 8.4, framework Laravel 13, front end Livewire 4.2, database SQLite, testing, and authentication Laravel Fortify 1.36. And here, we can see this is going to 30.4.5 GB. But, what if I change Let's stop this one, and I'm going to run again, but I'm going to change uh the context from 32 to 62. Okay, let's come here, and I'm going to exit, and I'm going to open the code again. Okay, 28. Now, I'm going to ask something. I'm going to say update the .env file to use MySQL database with the username root and empty password. Let's see now here. We're going to allow. Reading the .env file. And let's see here. Yeah, it's 30. 6 GB used, which is okay. It's updating the database. I have updated, and now, if I come here, here is the DB connection. MySQL. And database Laravel, username root, and password is empty password. Okay, so I updated .env file to use MySQL with root and an empty password. Okay. And the memory used is 30.6. And also, you can see here, 80.3 K context. If we ask more in this session, it's going to go up and up. So, I think, yeah, 60K is okay. So, that's it all about this video, what I wanted to show you. Now, if you like such a videos, don't forget to subscribe to my channel, like the video, share with your friends, and I'm going to see you in another one. All the best. Thank you very much.

Get daily recaps from
Tony Xhepa

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.