How to Build Ultra-low-latency Voice Agents With NVIDIA Cache-aware Streaming ASR
This post accompanies the launch of NVIDIA Nemotron Speech ASR on Hugging Face. Read the full model announcement here.
In this post, we’ll build a voice agent using three NVIDIA open models:
This voice agent leverages the new streaming ASR model, Pipecat’s low-latency voice agent building blocks, and some fun code experiments to optimize all three models for very fast response times.
All the code for the post is here in this GitHub repository.
You can clone the repo and run this voice agent:
Scalably for multi-user workloads on the Modal cloud platform.
On an NVIDIA DGX Spark or RTX 5090 for single-user, local development and experimentation.
Feel free to just jump over to the code. Or read on for technical notes about building fast voice agents and the NVIDIA open models.
The state of voice AI agents in 2026
... continue reading