Run Local AI 2x Faster on Mac: MLX & oMLX Setup Guide

Tony Xhepa| 00:16:55|Apr 29, 2026
Chapters11
Intro to using MLX on macOS and why OllamaX is chosen for efficient local model work.

Mac users can speed up local AI with MLX via OllamaX, configure custom providers, and run Gemma or Qwen models directly on macOS.

Summary

Tony Xhepa walks through setting up local AI on a Mac Studio, emphasizing MLX as the more efficient option on macOS compared to GGUF. He walks you through installing OllamaX, configuring the dashboard and API key, and pulling models from Hugging Face or Ollamax’s DL downloader. He demonstrates selecting MLX-compatible models (like Gemma and Qwen variants), explains 4-bit vs 8-bit quantizations and RAM considerations, and shows how to adjust global settings (memory, max context window, port) and then restart the server. The guide covers Open Code and Pi agent integration, including how to wire OllamaX models into OpenCode.json and Pi’s agent.json so you can run local models in code or a Laravel project. Tony also compares speeds between Gemma (faster) and Qwen 3.6 27B (slower per token when used in coding tasks), and he highlights practical steps for switching providers and testing chat capabilities locally. The walkthrough ends with a recap of steps to save, restart, and verify the local setup for fast, on-device inference.

Key Takeaways

  • MLX on macOS is more efficient than GGUF, and Tony demonstrates using OllamaX for macOS-specific local model hosting.
  • Configuration steps include setting port to 8383, API key to a simple 1234, and enabling 127.0.0.1 admin access on the OllamaX dashboard.
  • Model selection emphasizes MLX-compatible options like 4-bit Quantization (lower RAM) versus 8-bit (higher memory use) and how 3B-active CUDA-like configs can be faster than full 27B models.
  • Gemma (Ollamax) is identified as a fast choice for local coding tasks, while Qwen 3.6 27B demonstrates different trade-offs when fully loaded.
  • OpenCode and Pi integration are shown: create models.json for OllamaX in Pi Agent, and modify OpenCode.json to point to the OllamaX base URL for local completions.
  • Downloader workflow is demonstrated: paste repository IDs, browse Hugging Face trends, and install multiple MLX models from the Ollamax hub.
  • Global settings include max context window (126,000 tokens) and memory hints, plus the need to restart the server after changes.

Who Is This For?

Essential viewing for Mac developers who want to run local AI models with MLX via OllamaX, and for anyone implementing OpenCode or Pi agents to use local models in production-like workflows.

Notable Quotes

"Hello, friends. Tony here. Welcome."
Opening greeting sets the casual, tutorial tone.
"MLX is specific for macOS and is more efficient."
Core claim about MLX efficiency on Mac.
"We need to save the settings. And when we save the settings, we need to stop and start the server."
Important operational step for config changes.
"Gemma is more fast. So, if I come here and select Gemma and just say hi."
Demonstrates choosing Gemma for speed.
"OpenCode.json to use OllamaX base URL for local completions."
Shows how to wire OllamaX into OpenCode.

Questions This Video Answers

  • How do I set up OllamaX on a Mac with MLX for local models?
  • What's the difference between 4-bit and 8-bit quantization in MLX on macOS?
  • How can I integrate OllamaX models with OpenCode and Pi agents for local inference?
  • Which models are best for fast local coding tasks on Mac (Gemma vs Qwen 3.6 27B)?
  • How do I configure OllamaX dashboard settings and restart the server properly?
MacOS MLXOllamaXOllamaxGemma modelQwen 3.64-bit quantization8-bit quantizationLlama.cppGGUF vs MLXOpenCode AI agents","Pi AI agent"
Full Transcript
Hello, friends. Tony here. Welcome. In today's video, I'm going to show you how you can set up your machine. My machine is Mac Studio with 32 GB of memory. And you need to know what memory you have for the local models. And I had some comments. I created two or more videos about how we can work with the LLMs, but yeah. I am I had more comments on YouTube. And one of them was why I don't use MLX on Mac because when we work with MLX is uh more efficient than uh GGUF. I used Llama.cpp and also uh Ollama in other videos. And Llama.cpp use GGUF, which is uh compatible with other machines. But MLX is specific for macOS and is more efficient. And in this video, I'm going to work with MLX. Now, specifically, I'm going to work with this OllamaX uh application. You can go to OllamaX.ai and download and install the application here. When you install the application for the first time, it open, that is going to be this page, so the settings page. And I have changed the port to be the 8383. And the API key, I have added 1 2 3 4. And then, you can go and stop and start start the server. After you start the server, you can open the server the page in 127.0.0.1 and the port you have added {slash} admin. And you are going to see like this, this page, so OllamaX dashboard. We have stats here. Average speed, models, and so on. Yeah, also here, we have the OpenAI API or cloud API. And here, you can just, for example, select a model here and copy this command and open the Codex, open code, open cloud, and pie. But I'm going to show you how you can configure settings so you don't need to copy the command from here. We have models here, manager. So, I have downloaded four models, as you can see. You can go to downloader here. You can paste the repository ID and download. You can search on the Hugging Face. You can browse the models, trending, popular, search results. So, let's go to trending. And as you can see, what's trending is the Qwen 3.6 35 billion. We have also the [snorts] 27 billion 3.5 27 billion, which is a cloud 4.6 Opus Distilled MLX 4-bit. And as you can see, I have clicked this MLX only because we need only MLX here. Then we have others. We have three Qwen 3.6 27 billion. Gemma 27 26 billion, and so on. I prefer to work with the Hugging Face. If you go to huggingface.co, go to models, and you can see all the models, and you can search here for specific model. Let's say, for example, we want to work with the Qwen. So, Qwen 3.6. And we have Qwen 3.6 35 billion and 27 billion. Now, here we have 35 billion, which is 35 billion parameters. So, this one and this 27 billion parameters. But this 27 billion parameters is better than this 35 billion parameters. And the reason is because, as you can see, we have here {dash} A 3B, which mean active 3 billion parameters. We have only 3 billion parameter active. And instead, on this one, all 27 billion are active. Also, this one is faster, has 35 billion parameter, but is faster because only 3 billion are active. This is more slower than this one. And, for example, let's see, I want to work with this one. We have quantizations here. When we select this, we have quantizations, and you can select see the models here. We have from the Onslaught, for example, this one, 35 billion and GGUF, which is uh the best one to work with uh Llama.cpp. But I need the filter from ML for MLX only. And yeah, we also for MLX, we can see we have 4-bit, we have 8-bit. I suggest you to go and search for models' names. So, for example, what this 4-bit means and what this 8-bit means. 8-bit is better, but is going to need more RAM. Let's see. Let's come here. This is only 21.6 GB. And if I see for 8-bit, yeah, 37.7 GB. The minimum requirement is to use the 4-bit. And yeah, you can choose this one, or you can choose the MLX community. Here now, as you can see also, this is 4-bit, this is NVFP4. NVFP4 is like 4-bit, but is from Nvidia and is better than this 4-bit. You can see also other quantizations. And then, if you if, for example, if you are if you like this one, just copy this ID, go to downloader, and paste it. All right, here. And click on the download. Now, I have downloaded. Here I have manager. I have downloaded four of them. Okay. Also, we have quantization here. If you go to settings, we have global settings. And here you can change the settings. So, for example, we have server local local host only. The port. Uh yeah, also the memory. And the generation. By default, I have added 128,000 for max context window. Uh I think for coding, less than 126 or 128 is not going to be a very good. So, for me, the requirement is 126,000. So, let's say 126. Also, also here, 126. And yeah, we need to save down there. Save the settings. And when we save the settings, we need to stop and start the server. Also, we have logs here. We have bench, performance, and intelligence. We have chat, so you can chat like a ChatGPT. But we are not here for chatting here. What we need is the work with local models using open code or pie agents. Now, for the open code, you can go and install go to open code.ai, copy this command and install or npm or brew or bun or something else or this pie. And the same thing for the pie, you can just copy this command here. Next, if you go to documentation, we have providers. And here we have an example for providers. So, for example, let's see, we have the Llama.cpp. We have also Hugging Face. Llama.cpp, we don't have this OllamaX. But yeah, the an example for the Llama.cpp is here. So, you can configure open code to use local models through the Llama.cpp Llama server utility. And you need to create this open code.json or update adding the provider schema here, and the provider is Llama.cpp with npm, which is OpenAI compatible. Name, you can give it any name. Options is the base URL. Now, base URL is this one. In our case, we're going to work with this OllamaX, it's going to be this one here. And yeah, we can this one. Okay. And then you can add models. Here we need the model ID. And then you can give it a name. You can also add a limit, context, and output. Okay. Uh for the pie, you can go to documentation, which is updated just right now. And you have providers here. I'm going to zoom it a little bit closer here. And we have subscription here for the GitHub, Google, OpenAI or API keys. So, let's scroll down. We need to say {slash} login and add cloud ChatGPT plus or Gemini or so on. We have also custom providers here. And the custom providers is via the model.json. Add Ollama LM Studio VLLM. You can see the models.md. For example, here. But because this is just a new site right now, we have page not found. Let's go to model. And yeah, for the models, I'm going to use uh Gemma. Gemma is more fast. So, if I come here and select Gemma and just say hi. Here now, we can see generated tokens is 48 token per seconds. If I select Qwen 3.6 27 billion. Okay, or let's just create a new chat. And let's say hi for this one. Okay, now let's see here. Yeah, we have only 11 token per seconds. And when you work with coding, this is going to be very slow. Okay, so let's go back to Gemma 4. And now, let's work with the open code and the pie.dev. Here I am inside the example on a Laravel project. I'm going to open a new tab. I'm going to zoom it. And I'm going to just say CD. So, we are at home now. Home directory. I'm going to CD here to dot pi. And we are in the dot pi here. If I say ls -la we have the agent directory. Let's navigate there. CD to agent. And I'm going to open this agent with the VS Code editor. Here we are. Here now, I'm going to create a new file and I'm going to name it models. json. You can read more on the documentation of pi. Okay, so here we have custom provider keys from models.json. So, you need to create this models. json. But right now, we don't have We have patient found. And let's go to GitHub repository. We have this one. And if we scroll down we have providers and models. We have providers here and also we have models.md. Now, custom providers and models add providers via in the pi agent model.json. And let's go to models.md. And here we already have an example. Okay, so base URL API OpenAI completions. We need the API key. And then add models. Okay, so let's copy this one and I'm going to go here and I'm going to paste it. Now we are not working with the Ollama. We are working with Ollamax. We have to change the provider name to be Ollamax. Base URL is Let's go back. Not the chat. Let's go to dashboard and we have this one here. Let's paste it. API is OpenAI completions. Now, API key in my case is 1 2 3 4. Models, we need to go to models manager. I'm going to copy the this one. ID paste it here. And also you can add more all all of them. So, let's add also this, also this one. And also this one. Let's save. I think we are good now. Let's go here. If I say pi and we need to change the model. Hit enter. We have Gemma from Ollamax. We have this Qwen from Ollamax. Also this Qwen 3.6 27 billion and this one. I'm going to choose Gemma because it's faster and I'm going to just say hi. Also here you can see it's from Ollamax Gemma 427 26 billion. And when we work here, yeah, here we have hello, how can I help you today? Okay, so this works. Now, let's do the same thing because if I open the open code here right now and search for models we don't have models from the We have from OpenRouter or others, but not from Ollamax. Let's fix also this OpenCode is not the same. We are not going to go go OpenCode. We have dot. So, if I say CD to dot OpenCode we have this directory, but we are not going to go there. We are going to go to dot config. And here, if I say ls we have OpenCode right here. So, go to dot config and then OpenCode. And here, I'm going to open this one with code editor. In my case, I'm going to open with Open with VS Code. Here now, we need to create this OpenCode.json file. Let's copy. Let's create a new file. Give it the name this name. And then I'm going to copy this, paste it here. Okay, so we have a provider not Llama.cpp. We have to change that. It's going to be Ollamax. Name change to Ollamax. Options, let's just copy this We need also to add here the API key. And the API key is 1 to 4 in my case. Models, now let's copy model. We can change the name of this model. So, I'm going to say Gemma like this or just say Gemma 4 We have a limit context. I'm going to change that to be 26. Now, here we have the Gemma, but we can add more models the same we have done But just to save some time, I'm going to see if we can open if we can work with that. So, OpenCode again. And now let's say {slash} models. We have here the Gemma 4 26 billion from Ollamax. I'm going to hit enter and just say hi. As you can see, also this one is working. And yeah, you can add more models here. You can download more on the downloader and add them. Also you can change the settings. The settings is also for the models. So, we have model settings just for one model. So, you just come here and change one. You can load the defaults for this model, for example, but I'm going to cancel. And yeah, that's it all about this video what I wanted to show you. I hope I explained everything how to set up your machine for working with local models using this Ollamax application. If you have any question, you can leave a comment on YouTube. And don't forget to subscribe and like in this video. All the best and I'm going to see you in another one. Thank you.

Get daily recaps from
Tony Xhepa

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.