Show HN: Run Qwen3-Next-80B on 8GB GPU at 1tok/2s throughput
LLM Inference for Large-Context Offline Workloads oLLM is a lightweight Python library for large-context LLM inference, built on top of Huggingface Transformers and PyTorch. It enables running models like gpt-oss-20B, qwen3-next-80B or Llama-3.1-8B-Instruct on 100k context using ~$200 consumer GPU with 8GB VRAM. No quantization is used—only fp16/bf16 precision. Latest updates (0.4.0) 🔥 qwen3-next-80B (160GB model) added with ⚡️1tok/2s throughput (fastest model so far) (160GB model) added wit