1.
2.
KVarN: Native vLLM KV-cache quantization back end by Huawei
(news.ycombinator.com)
3.
Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA
(news.ycombinator.com)
4.
Eagle 3.1: Collaboration Between the EAGLE Team, vLLM Team, and TorchSpec Team
(news.ycombinator.com)
5.
Boosting multimodal inference performance by >10% with a single Python dict
(news.ycombinator.com)
6.
Advanced Quantization Algorithm for LLMs
(news.ycombinator.com)
7.
8.
DeepSeek OCR
(news.ycombinator.com)
9.
Voxtral-Mini-3B-2507 – Open source speech understanding model
(news.ycombinator.com)
10.
Mistralai/Voxtral-Mini-3B-2507 · Hugging Face
(news.ycombinator.com)
11.
VLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
(news.ycombinator.com)
12.
Life of an inference request (vLLM V1): How LLMs are served efficiently at scale
(news.ycombinator.com)
13.
Lossless LLM 3x Throughput Increase by LMCache
(news.ycombinator.com)