Reverse-Engineering the RK3588 NPU: Hacking Limits to Run Vision Transformers

Author's Note: I am currently an MS CS student at CU Boulder specializing in Edge AI & Embedded Systems. I am actively looking for Summer 2026 Internships where I can optimize difficult workloads on constrained silicon. 📄 Resume | ✉️ Email | 💻 GitHub

The “Unsupported” Hardware Problem

If you look at the spec sheet for the Rockchip RK3588 (the chip inside the Orange Pi 5), it looks like a beast. It promises 6 TOPS of NPU performance. For $100, that’s a steal.

But if you try to run modern AI on it—specifically the Vision Encoder from SmolVLM—that promise falls apart.

The standard Computer Vision SDK ( rknn-toolkit2 ) is optimized for older, predictable CNNs (like ResNet). When I fed it the SigLIP Vision Transformer used by SmolVLM, the driver choked. Even though the model is “smol,” the massive Attention matrices it generates triggered cryptic hex errors and refused to compile.

This left me with one option: running the model on the CPU. The result? A single image inference took ~30 seconds. The 6 TOPS accelerator sat idle while the CPU struggled.

I didn’t accept that. I decided to reverse-engineer the NPU to find out exactly why it was failing, and how to force it to run at full speed.

Context: Why do it the hard way? (First Principles)

A quick note for those following the ecosystem: You might see projects like QEngineering running the newer SmolVLM-v2 on Rockchip’s rknn-llm SDK.

That approach uses a specialized “black box” toolchain designed specifically for Transformers. Rockchip engineers have likely already implemented complex memory management inside that SDK to handle these models.

... continue reading