We got 207 tok/s with Qwen3.5-27B on an RTX 3090

Open LLM inference, rewritten by hand for one specific chip at a time.

Kernels, speculative decoding, and quantization, tailored per target.

We don't wait for better silicon. We rewrite the software.

Inside the box

Two projects today, more coming. Each one is a self-contained release with its own benchmarks and paper-style writeup.

01 · Megakernel Qwen3.5 0.8B on RTX 3090

The first megakernel for hybrid DeltaNet/Attention LLMs. All 24 layers of Qwen 3.5-0.8B in a single CUDA dispatch, 1.87 tok/J on a 2020 GPU, matching Apple's latest silicon at 2× the throughput.

# 1. clone + enter git clone https://github.com/Luce-Org/lucebox-hub && cd lucebox-hub/megakernel # 2. install (Python 3.10+, CUDA 12+, PyTorch 2.0+). Weights stream from HF on first run. pip install -e . # 3. run the benchmark (prefill pp520 + decode tg128 vs llama.cpp BF16 + PyTorch HF) python final_bench.py

Method Prefill pp520 Decode tg128 tok/J Megakernel @220W 37,800 413 1.87 llama.cpp BF16 @350W 11,247 267 0.76 PyTorch HF 7,578 108 n/a

What makes it work: 82 blocks, 512 threads, one persistent kernel. No CPU round-trips between layers. Weights streamed straight from HuggingFace. Cooperative grid sync instead of ~100 kernel launches per token. Power ceiling hit before compute ceiling, so DVFS converts tight execution straight into saved watts.

... continue reading