Running local models on an M4 with 24GB memory

I’ve been experimenting with running local models on and off for a bit and I’ve finally found a setup that seems to work reasonably. It’s nothing like the output of a SOTA model, but the excitement of being able to have a local model do basic tasks, research, and planning, more than makes up for it! No internet connection required! Not to mention that it’s a way of reducing your dependence on big US tech, even if just a tiny bit.

I gotta say though, it’s not easy to get this stuff set up. First you have to choose how you’re running the model: Ollama, llama.cpp or LM Studio. Each one comes with its own quirks and limitations, and they don’t offer all the same models. Then of course, you have to pick your model. You want the best model available that fits in memory and still gives you enough headroom to run your regular assortment of Electron apps, not to mention something where you can have at least a 64K context window, but ideally 128K or more. Most recently I’ve tried Qwen 3.6 Q3, GPT-OSS 20B, Devstral Small 24B, which all technically fit in memory but were in practice unusable, and Gemma 4B that would run fine but really struggle with tool use.

Then there’s a plethora of configuration options to tweak. From the more well-known, like temperature, to more esoteric options like K Cache Quantization Type. Many of these tools come with a basic recommended set of options, but the appropriate ones can depend on things like whether you’re enabling thinking or not!

Qwen 3.5-9B (4b quant)

qwen3.5-9b@q4_k_s (HuggingFace link) is the best model I’ve gotten working with a reasonable ~40 tokens per second, thinking enabled, successful tool use, and a 128K context window, running on LM Studio. Compared to a SOTA model, it gets distracted more easily, sometimes it gets stuck in loops, it’ll misinterpret asks etc. But it’s surprisingly good for something that can run on a 24GB Macbook Pro while leaving space for lots of other things running too!

These are the recommended settings for thinking mode and coding work:

Thinking mode for precise coding tasks (e.g., WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

To enable thinking I also had to select the model, go to configuration, scroll to the bottom of the Inference tab, and add {%- set enable_thinking = true %} to the Prompt Template.

I’ve been using it through both pi and OpenCode. I still haven’t quite made my mind up on with one I prefer. Pi feels a bit snappier, but although I really appreciate the idea of the harness building itself and all that customization, I can’t help but wish it came with some sensible defaults. I feel like you could easily end up spending more time tweaking your pi set up to be just right, than you do on your actual projects!

Pi setup

... continue reading