Skip to content
Tech News
← Back to articles

Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon

read original get Apple Silicon External GPU → more articles
Why This Matters

Hypura's storage-tier-aware LLM inference scheduler significantly enhances the ability of consumer Apple Silicon devices to run large language models beyond their physical memory limits. By intelligently managing data placement across GPU, RAM, and NVMe storage, it enables more complex models to operate efficiently on everyday hardware, opening new possibilities for developers and users alike.

Key Takeaways

_ _ | | | |_ _ _ __ _ _ _ __ __ _ | |_| | | | | '_ \| | | | '__/ _` | | _ | |_| | |_) | |_| | | | (_| | |_| |_|\__, | .__/ \__,_|_| \__,_| |___/|_| Run models too big for your Mac's memory

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon. It places model tensors across GPU, RAM, and NVMe tiers based on access patterns, bandwidth costs, and hardware capabilities — enabling models that exceed physical memory to run without crashing the system.

Run a 31 GB Mixtral 8x7B on a 32 GB Mac Mini at 2.2 tok/s. A 40 GB Llama 70B at 0.3 tok/s. Vanilla llama.cpp crashes on both.

Why does this matter?

Consumer hardware (MacBook Pro, Mac Studio) ships with fast unified memory and NVMe storage, but limited capacity. A 32 GB M1 Max cannot naively load a 40 GB model — the OS will swap-thrash until the OOM killer intervenes.

Hypura solves this by understanding the model architecture:

Norms and embeddings are tiny but accessed every token — pinned to GPU

are tiny but accessed every token — pinned to GPU MoE expert routing exploits sparsity — only 2 of 8 experts fire per token. Router interception identifies selected experts in the eval callback, then loads only the needed expert strides from NVMe (75% I/O reduction). A neuron cache tracks loaded expert slices across tokens, achieving 99.5% hit rate from temporal locality. Co-activation tracking predicts which experts will fire next for speculative prefetch.

exploits sparsity — only 2 of 8 experts fire per token. Router interception identifies selected experts in the eval callback, then loads only the needed expert strides from NVMe (75% I/O reduction). A neuron cache tracks loaded expert slices across tokens, achieving 99.5% hit rate from temporal locality. Co-activation tracking predicts which experts will fire next for speculative prefetch. Dense FFN weights (gate, up, down — ~60% of model size) stream from NVMe through a dynamically-sized pool buffer while attention + norms stay GPU-resident. Prefetch lookahead depth scales automatically with available memory.

The result: models that would crash your machine under naive mmap become runnable. Models that fit in memory run at full Metal GPU speed with zero overhead.

... continue reading