Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

ZSE - Z Server Engine

Ultra memory-efficient LLM inference engine.

ZSE is designed to run large language models with minimal memory footprint while maintaining high performance. Our key innovation is the Intelligence Orchestrator that provides smart recommendations based on your available (not total) memory.

Key Features

🧠 zAttention : Custom CUDA kernels for paged, flash, and sparse attention

: Custom CUDA kernels for paged, flash, and sparse attention 🗜️ zQuantize : Per-tensor INT2-8 mixed precision quantization

: Per-tensor INT2-8 mixed precision quantization 💾 zKV : Quantized KV cache with sliding precision (4x memory savings)

: Quantized KV cache with sliding precision (4x memory savings) 🌊 zStream : Layer streaming with async prefetch (run 70B on 24GB GPU)

: Layer streaming with async prefetch (run 70B on 24GB GPU) 🎯 zOrchestrator : Smart recommendations based on FREE memory

: Smart recommendations based on FREE memory 📊 Efficiency Modes: speed / balanced / memory / ultra

... continue reading