Skip to content
Tech News
← Back to articles

We got 207 tok/s with Qwen3.5-27B on an RTX 3090

read original get RTX 3090 GPU Cooling Fan → more articles
Why This Matters

This article highlights innovative software optimizations for running large language models on consumer-grade GPUs like the RTX 3090, achieving unprecedented inference speeds. By rewriting inference software with tailored kernels and decoding techniques, the tech industry can significantly enhance AI accessibility and performance without waiting for new hardware. These advancements demonstrate how software ingenuity can maximize existing hardware capabilities, benefiting both developers and end-users.

Key Takeaways

Open LLM inference, rewritten by hand for one specific chip at a time.

Kernels, speculative decoding, and quantization, tailored per target.

We don't wait for better silicon. We rewrite the software.

Inside the box

Two projects today, more coming. Each one is a self-contained release with its own benchmarks and paper-style writeup.

01 · Megakernel Qwen3.5 0.8B on RTX 3090

The first megakernel for hybrid DeltaNet/Attention LLMs. All 24 layers of Qwen 3.5-0.8B in a single CUDA dispatch, 1.87 tok/J on a 2020 GPU, matching Apple's latest silicon at 2× the throughput.

# 1. clone + enter git clone https://github.com/Luce-Org/lucebox-hub && cd lucebox-hub/megakernel # 2. install (Python 3.10+, CUDA 12+, PyTorch 2.0+). Weights stream from HF on first run. pip install -e . # 3. run the benchmark (prefill pp520 + decode tg128 vs llama.cpp BF16 + PyTorch HF) python final_bench.py

Method Prefill pp520 Decode tg128 tok/J Megakernel @220W 37,800 413 1.87 llama.cpp BF16 @350W 11,247 267 0.76 PyTorch HF 7,578 108 n/a

What makes it work: 82 blocks, 512 threads, one persistent kernel. No CPU round-trips between layers. Weights streamed straight from HuggingFace. Cooperative grid sync instead of ~100 kernel launches per token. Power ceiling hit before compute ceiling, so DVFS converts tight execution straight into saved watts.

... continue reading