Ditching the cloud for local AI — how I use two mini PCs to process millions of tokens a day and save money on costly API fees

For heavy AI users, the economics of the current boom are starting to bite. Over the past year, major labs have nudged prices upward while tightening the screws on usage — whether through stricter rate limits, smaller context windows on lower tiers, or the gradual reshuffling of features behind more expensive plans. Even where per-token costs have fallen in headline terms, the reality for users is more complicated: higher volumes, more complex workflows, and new tooling expectations mean monthly bills are creeping up, not down.

At the same time, open-weight models have improved rapidly, consumer hardware has become more capable, and tools like LM Studio, Ollama, and llama.cpp have made local deployment far more accessible than it was even a year ago. The result is a renaissance in running models on your own machines.

I’m one of the people who has taken the leap myself. In mid-March, I bought a GMKtech mini PC with an AMD Ryzen AI Max+ 395 chip and 96GB of RAM. The purchase — at the time something like £1,500 ($2,000) — was a calculated decision. The kinds of volume I wanted AI models to run through would have blown through my current subscriptions to AI models (I have a ChatGPT Plus and GLM Coding Lite plan, which combined cost me around $23 a month), and forced me onto the higher-cost monthly plans, or API-based inference.

Latest Videos From Watch full video here:

Going local

The decision I had to make was a simple one: did I want to spend that money on a subscription that would cost me several thousand dollars over the course of a year, and end up having to pay a recurring cost for years to come to an AI lab that would likely hike prices? Or did I want to pay a one-off charge for my own hardware and a smaller ongoing cost for electricity?

I chose the latter.

When the mini PC arrived, setting it up was relatively easy — though, fully disclosure, only possible with the help of the full-fat AI models I pay for from the big labs.

The system I set up on my hardware was designed to try and help me keep track of the constantly changing news in the areas I cover for sites like Tom’s Hardware Premium and others. It takes RSS feeds and ingests the contents of stories in key beats that I cover, then grades them against a digital ‘brain’ made of how I think about the world and what I report on, generated by analyzing nearly 2,000 of my past stories over the previous four years.

When it finds candidates that are potentially interesting, those stories are ‘assigned’ to AI beat reporters, who then read around the subject on the web and produce pitches, similar to those that I send to my editors here and elsewhere. Those AI reporters then send their pitches to AI editors, who engage in a conversation with the reporters to fine-tune the idea’s framing, before presenting me with a couple of paragraphs of a broad idea that is meant to be tailored to my tastes via Telegram.

... continue reading