TurboQuant KV Compression and SSD Expert Streaming for M5 Pro and IOS

⚡️ SwiftLM

A blazingly fast, native Swift inference server that serves MLX models with a strict OpenAI-compatible API.

No Python runtime, no Global Interpreter Lock (GIL), no unnecessary memory copies. Just bare-metal Apple Silicon performance compiled to a single binary.

🚀 Features

🍎 100% Native Apple Silicon : Powered natively by Metal and Swift.

: Powered natively by Metal and Swift. 🔌 OpenAI-compatible : Drop-in replacement for OpenAI SDKs ( /v1/chat/completions , streaming, etc).

: Drop-in replacement for OpenAI SDKs ( , streaming, etc). 🧠 Smart Model Routing : Loads HuggingFace format models directly, with native Safetensors parsing.

: Loads HuggingFace format models directly, with native Safetensors parsing. ⚡️ TurboQuantization Integrated : Custom low-level MLX Metal primitives that apply extremely fast quantization for KV caching out-of-the-box.

: Custom low-level MLX Metal primitives that apply extremely fast quantization for KV caching out-of-the-box. 💾 SSD Expert Streaming : Experimental zero-copy streaming that swaps Mixture of Experts (MoE) layers directly from the NVMe SSD to the GPU command buffer without trashing macOS Unified Memory (prevents Watchdog OS kernel panics on 122B+ models).

: Experimental zero-copy streaming that swaps Mixture of Experts (MoE) layers directly from the NVMe SSD to the GPU command buffer without trashing macOS Unified Memory (prevents Watchdog OS kernel panics on 122B+ models). 🎛️ Granular Memory Control: Integrated Layer Partitioning ( --gpu-layers ) and Wisdom Auto-Calibration for squeezing massive models into RAM.

... continue reading