A curated, developer friendly learning path for building real-time voice AI agents from your first STT call to scaling production telephony.
Voice AI has moved from research demos into shipping product in under three years. The modern stack is converging around a clear pattern: a real-time transport layer (WebRTC or telephony), a streaming pipeline of speech-to-text → LLM → text-to-speech, and a turn-taking model that decides when the agent should speak. This list is structured to mirror that learning order start with the foundations, pick a framework, then drill into individual components and production concerns.
Resources are tagged 🟢 Beginner, 🟡 Intermediate, or 🔴 Advanced. Prefer free official docs and vendor-neutral guides; flag where authors have commercial interests.
How to use this list
Read top-to-bottom if you're brand new. The recommended path:
Foundations → understand the pipeline and latency budget Frameworks → pick one (LiveKit Agents or Pipecat are the safest open-source bets) and ship a hello-world Components (STT, TTS, LLM, VAD, turn detection) → swap pieces to learn what each layer does Transport & telephony → connect to a real phone number Evaluation, production, ethics → make it safe enough to ship
Table of contents
1. Foundational concepts and learning paths
Start here. These resources establish the mental model of the voice agent pipeline and the latency budget you'll fight for the rest of your career.
2. Frameworks and orchestration platforms
... continue reading