Tech News
← Back to articles

A 30B Qwen Model Walks into a Raspberry Pi and Runs in Real Time

read original related products more articles

For this release, we optimize for what people actually experience when they run a model: fast, high-quality responses on a specific target device. We use Shapelearn, our bitlength learning method to choose weight datatypes for Qwen3-30B-A3B-Instruct-2507 that maximize performance in terms of tokens per second (TPS) and output quality, with one practical constraint: the model must fit comfortably in the available memory. Once it fits, making the file smaller isn't a goal by itself. We only shrink further when it also improves the real tradeoff people care about: speed vs. quality. Approaching bitlength learning this way matters because in llama.cpp, "fewer bits" doesn't automatically mean "more speed." Different quantization formats can trigger different kernels and overheads, and on some GPUs, going lower-bit can even get slower, despite using less memory. Bottom line: treat memory as a budget to meet, then optimize what matters most: TPS and quality.

TL;DR Yes, this 30B Qwen3 runs on a Raspberry Pi. On a Pi 5 (16GB), Q3_K_S-2.70bpw [KQ-2] hits 8.03 TPS at 2.70 BPW and maintains 94.18% of BF16 quality. It genuinely feels real-time. More broadly, the same pattern shows up everywhere else: ByteShape models give you a better TPS/quality tradeoff than the alternatives (here we look at Unsloth and MagicQuant).

CPUs On CPUs, the reducing footprint via shorter bitlengths affects the TPS and accuracy tradeoff as one would expect: once the model fits, reducing footprint tends to increase TPS in a fairly monotonic way. If datatypes are selected correctly, you can trade a bit of quality for speed predictably, which makes it much easier to pick a point on the curve that matches your constraints. We'll start with the most memory-constrained CPU case (Raspberry Pi 5 16GB), where "fits in RAM" is the limiting factor, then move to an Intel i7 with 64GB, where everything fits. Raspberry Pi 5 The figure below shows TPS vs. normalized accuracy for the models that fit in RAM on the Raspberry Pi 5 16GB. Raspberry Pi 5: Tokens per second vs quality (bubble size = model footprint) Raspberry Pi 5: Tokens per second vs quality (bubble size = model footprint) Notably, sustaining 8.5 TPS at 92%+ baseline accuracy with a 30B model on a Raspberry Pi reshapes expectations for Pi-class systems. Overall, the trend shows that ShapeLearn consistently produces better models, with ByteShape trending up and to the right of Unsloth, achieving higher tokens per second at the same quality, or higher quality at the same throughput. We highlight choices for two primary objectives: accuracy or response time. Optimizing for response time while maintaining accuracy: For interactive, on-device use, perceived responsiveness is driven by how quickly text appears, not peak throughput. In practice, generation feels real-time once it reaches roughly 8 TPS , comfortably above typical reading speed. In this Raspberry Pi real-time regime, Q3_K_S-2.70bpw [KQ-2] (2.70 BPW, 8.03 TPS, 94.18% accuracy) is our go-to recommendation: it crosses the real-time threshold while maintaining high accuracy. Compared to Unsloth models at similar quality, ByteShape achieves real-time performance at lower BPW and higher TPS, making it the more efficient choice for interactive edge deployment.

For interactive, on-device use, perceived responsiveness is driven by how quickly text appears, not peak throughput. In practice, generation feels real-time once it reaches roughly , comfortably above typical reading speed. In this Raspberry Pi real-time regime, (2.70 BPW, 8.03 TPS, 94.18% accuracy) is our go-to recommendation: it crosses the real-time threshold while maintaining high accuracy. Compared to Unsloth models at similar quality, ByteShape achieves real-time performance at lower BPW and higher TPS, making it the more efficient choice for interactive edge deployment. Accuracy above all: The table below lists the models that achieve the highest accuracy while still being able to run on a Raspberry Pi. Within this set, ByteShape models make the best use of the available resources to maximize accuracy, occupying the lowest-error rows (~1.1–1.3% relative error, ~98.8% accuracy), while the strongest Unsloth entries remain around 2.1–2.2% error (~97.9% accuracy). Compared to Unsloth's UD-Q3_K_XL [8] , ByteShape achieves up to a 1.87× lower error rate while still operating at ~5–6 TPS, comfortably within TPS-norms on Raspberry PI making it the better choice when accuracy is the priority.

Even when prioritizing maximum speed with some reduction in accuracy, Q3_K_S-3.25bpw [KQ-5] offers a better tradeoff: more accurate, smaller, and faster than the fastest Unsloth model. Model Relative Error BPW TPS Q4_K_S-3.92bpw [KQ-7] 1.14% 3.92 5.30 Q4_K_S-3.61bpw [KQ-6] 1.25% 3.61 5.94 Q3_K_S-3.25bpw [KQ-5] 2.03% 3.25 6.68 UD-IQ3_XXS [6] 2.22% 3.38 5.03 UD-Q3_K_XL [8] 2.13% 3.62 6.28 Many other Unsloth and MagicQuant models (some of ours too!) are not in this chart. We compare them in other sections, but they're not applicable in the Raspberry Pi case. They simply don't fit! Intel i7 Next, we move to the Intel i7 with 64GB RAM. The figure below shows TPS vs normalized accuracy for all models. Intel i7: Tokens per second vs quality (bubble size = model footprint) Intel i7: Tokens per second vs quality (bubble size = model footprint) Overall, ByteShape models outperform both Unsloth and MagicQuant, delivering higher quality at comparable throughput using fewer bits per parameter. Only ByteShape offers models that run in the 26+ TPS range, extending performance well beyond the other methods. Highlights: Quality-first: At the high-accuracy end of the table, IQ4_XS-4.67bpw [KQ-9] achieves the lowest relative error (0.25%), outperforming the best-running Unsloth models ( Q6_K [20] and Q5_K_M [18] whose relative errors are 0.36% and 0.44%). Compared directly, ByteShape delivers up to a 1.44× lower error rate with higher throughput than Q6_K [20] , and a 1.76× lower error rate at essentially the same speed as Q5_K_M [18] . MagicQuant mxfp4 [3] trails in this regime, with both higher error and lower TPS.

At the high-accuracy end of the table, achieves the lowest relative error (0.25%), outperforming the best-running Unsloth models ( and whose relative errors are 0.36% and 0.44%). Compared directly, ByteShape delivers up to a 1.44× lower error rate with higher throughput than , and a 1.76× lower error rate at essentially the same speed as . MagicQuant trails in this regime, with both higher error and lower TPS. Balanced point: In the mid-accuracy, high-throughput region, Q3_K_S-3.25bpw [KQ-5] combines ~98% accuracy with 23.1 TPS at just 3.25 BPW, offering the best overall balance in the table. Matching or exceeding this accuracy with Unsloth ( IQ4_XS [10] ) requires higher BPW and lower TPS, while choosing an Unsloth model closer in speed ( Q3_K_S [7] ) incurs a 1.73× higher error rate. MagicQuant does not offer a competitive model in this range; its fastest entry ( IQ4_NL [2] ) is behind both ByteShape and Unsloth in accuracy and throughput. Takeaway: Across both quality-first and balanced settings, ByteShape consistently converts the available bit budget into either higher accuracy or higher TPS, and is the only approach that simultaneously covers the high-quality and 26+ TPS balanced-performance regions in this comparison.

GPUs: RTX5090/32GB and RTX4080/16GB On GPUs, performance depends as much on kernel choice as on raw memory footprint. For matmul/matvec, llama.cpp's quantization-specific GPU decode paths incur very different overheads, so fewer bits per weight do not reliably translate to higher TPS. Instead, TPS often peaks at quantization-specific sweet spots. Pushing BPW lower can even increase VRAM traffic and instruction count, hurting performance rather than improving it. We dig into this behavior in more detail right after the GPU results section, where the kernel-level tradeoffs become more apparent. We evaluate on two GPUs: an RTX 5090 (32 GB), which can run models above 4 BPW and typically reach the fastest sweet spots, and an RTX 4080 (16 GB), where >4 BPW models do not fit, forcing different trade-offs and making the device-optimized curve easier to see. RTX 5090 (32GB of VRAM) Let's start with the 5090, which has enough VRAM to support all of the quantized models. The figure below shows TPS vs normalized accuracy. RTX 5090: Tokens per second vs quality (bubble size = model footprint) RTX 5090: Tokens per second vs quality (bubble size = model footprint) Two things stand out immediately:

First, this GPU shows a clear ~4-bit sweet spot: several ~4b models cluster at very high TPS with nearly identical quality. Examples include Unsloth Q4_0 [12] , Unsloth IQ4_XS [10] , IQ4_XS-3.87bpw [IQ-6] , and MagicQuant iq4_nl-EHQKOUD-IQ4NL [1] , all running around ~302–303 TPS at ~98.4–98.9% accuracy. Within this tight cluster, Unsloth edges out slightly in throughput and quality. Second, outside of that sweet spot, the tradeoff becomes much more uneven: Many other Unsloth and Magic Quant models show significantly lower TPS , regardless of whether they are quantized more or less aggressively.

, regardless of whether they are quantized more or less aggressively. Past the ~4b region, only ByteShape continues to increase TPS with a more predictable reduction in quality. Accuracy-critical workloads: when output quality is paramount, ByteShape delivers the most accurate model on the 5090: IQ4_XS-4.67bpw [IQ-8] (4.67 BPW, 272.98 TPS, 99.75% accuracy). It surpasses Unsloth Q6_K [20] (6.57 BPW, 264.88 TPS, 99.64% accuracy) while using fewer bits and achieving slightly higher throughput, and it clearly outperforms MagicQuant mxfp4_moe-H-B16-EUR-IQ4NL-KO-Q5K-QD-Q6K [3] (5.46 BPW, 240.42 TPS, 99.32% accuracy) in both accuracy and speed, making it the strongest choice when accuracy is a task-critical deployment requirement.

Practical takeaway. If your GPU has enough VRAM to run a strong ~4b model that already meets your speed and accuracy requirements, that cluster is an excellent default. The curve becomes more interesting when task-critical deployment constraints demand higher accuracy or smaller models as for example, under tighter memory budgets or constrained environments (as we'll see on the 4080). RTX 4080 (16GB of VRAM) Next, let's move to a more accessible GPU, especially in these memory-challenged times. The biggest stumbling block for the 4080 is its 16GB of VRAM, which is not sufficient to support the "magical" ~4b quantizations for a 30B model. How convenient! This "avoids" the 5090's ~4b sweet spot and forces a more "real-world" comparison under a hard VRAM budget. The figure below shows TPS versus normalized accuracy for all models that fit on the 4080. RTX 4080: Tokens per second vs quality (bubble size = model footprint) RTX 4080: Tokens per second vs quality (bubble size = model footprint) On the RTX 4080, ByteShape consistently outperforms Unsloth under the same 16 GB VRAM constraint, delivering a better TPS–quality tradeoff. In particular, ByteShape's highest-quality model that fits, IQ4_XS-3.87bpw [IQ-6] (3.87 BPW, 214.81 TPS, 98.66% accuracy) delivers: a 1.59× lower error rate and 9.4% higher TPS vs. Unsloth Q3_K_XL [8] (3.62 BPW, 196.42 TPS, 97.87% accuracy).

... continue reading