Skip to content
Tech News
← Back to articles

Quantization from the Ground Up

read original get Deep Learning Quantization Kit → more articles
Why This Matters

This article highlights how quantization can significantly reduce the size and increase the speed of large language models, making advanced AI more accessible for everyday devices like laptops. By understanding and applying quantization techniques, developers can deploy powerful models with minimal accuracy loss, democratizing AI technology and expanding its practical applications.

Key Takeaways

Sam Rose is a Senior Developer Educator at ngrok, focusing on creating content that helps developers get the most out of ngrok.

Qwen-3-Coder-Next is an 80 billion parameter model 159.4GB in size. That's roughly how much RAM you would need to run it, and that's before thinking about long context windows. This is not considered a big model. Rumors have it that frontier models have over 1 trillion parameters, which would require at least 2TB of RAM. The last time I saw that much RAM in one machine was never.

But what if I told you we can make LLMs 4x smaller and 2x faster, enough to run very capable models on your laptop, all while losing only 5-10% accuracy.

That's the magic of quantization.

In this post, you are going to learn: How a model's parameters make it so big

How floating point precision works and how models sacrifice it

How to compress floats using quantization

How to measure model quality loss after quantization If you already know what parameters are and how floats are stored, feel free to skip straight to quantization.

What makes large language models so large?

Parameters, also called "weights," are the majority of what an LLM is when it's in memory or on disk. In my prompt caching post I wrote that LLMs are an "enormous graph of billions of carefully arranged operations." What do those graphs look like? Let's start with the simplest example: 1 input, 1 parameter, 1 output.

... continue reading