Skip to content
Tech News
← Back to articles

Usage-based pricing killing your vibe, here's how to roll your own local AI

read original more articles
Why This Matters

As usage-based pricing models make cloud AI services more expensive, the shift toward local AI models offers a cost-effective alternative for developers and hobbyists. Advances in model architecture and efficiency now enable smaller, local models to perform complex tasks, reducing reliance on costly cloud services and empowering users to maintain control over their AI tools.

Key Takeaways

With model devs pushing more aggressive rate limits, raising prices, or even abandoning subscriptions for usage-based pricing, that vibe-coded hobby project is about to get a whole lot more expensive. Fortunately, you're not without cost-saving options.

Over the past few weeks, we've seen Anthropic toy with dropping Claude Code from its most affordable plans while Microsoft has skipped testing the waters and moved GitHub Copilot to a purely usage-based model. The whole debacle got us thinking. Do we even need Anthropic or OpenAI's top models, or can we get away with a smaller local model? Sure, it might be slower, less capable, and a little more frustrating to work with, but you can't beat the price of free... Well, assuming you've already got the hardware that is.

It just so happens that Alibaba recently dropped Qwen3.6-27B, which the cloud and e-commerce giant boasts packs "flagship coding power" into a package small enough to run on a 32 GB M-series Mac or 24 GB GPU.

What's changed

This isn't the first time we've looked at local code assistants. Previously we explored using Continue's VS Code extension for tasks such as code completion and generation.

At the time, the models and software stack were quite immature, making them useful tools, but not necessarily good enough to compete with larger frontier models. Since then, model architectures and agent harnesses have improved dramatically.

"Reasoning" capabilities allow small models to make up for their size by "thinking" for longer, mixture-of-experts models mean you don’t need terabytes a second of memory bandwidth for an interactive experience, and vastly improved function and tool calling capabilities mean that these models can actually interact with code bases, shell environments, and the web.

All vibes, no rate limits

In this hands on, we'll be looking at how to deploy and configure local models like Qwen3.6-27B, for coding on your computer, and explore some of the agent frameworks you can use with them.

What you'll need:

... continue reading