Skip to content
Tech News
← Back to articles

Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B

read original get USB Microphone for Streaming → more articles
Why This Matters

This development signifies a major step forward in on-device multimodal AI, enabling real-time voice and vision interactions without relying on cloud servers. It democratizes access to advanced AI tools, making them more private, cost-effective, and accessible on personal hardware, which could revolutionize language learning and other applications. As hardware capabilities grow, such local AI solutions are poised to become mainstream, enhancing user privacy and reducing dependency on external servers.

Key Takeaways

Parlor

On-device, real-time multimodal AI. Have natural voice and vision conversations with an AI that runs entirely on your machine.

Parlor uses Gemma 4 E2B for understanding speech and vision, and Kokoro for text-to-speech. You talk, show your camera, and it talks back, all locally.

parlor_realtime_ai_with_audio_video_input_optimized.mp4

Research preview. This is an early experiment. Expect rough edges and bugs.

I'm self-hosting a totally free voice AI on my home server to help people learn speaking English. It has hundreds of monthly active users, and I've been thinking about how to keep it free while making it sustainable.

The obvious answer: run everything on-device, eliminating any server cost. Six months ago I needed an RTX 5090 to run just the voice models in real-time.

Google just released a super capable small model that I can run on my M3 Pro in real-time, with vision too! Sure you can't do agentic coding with this, but it is a game-changer for people learning a new language. Imagine a few years from now that people can run this locally on their phones. They can point their camera at objects and talk about them. And this model is multi-lingual, so people can always fallback to their native language if they want. This is essentially what OpenAI demoed a few years ago.

How it works

Browser (mic + camera) │ │ WebSocket (audio PCM + JPEG frames) ▼ FastAPI server ├── Gemma 4 E2B via LiteRT-LM (GPU) → understands speech + vision └── Kokoro TTS (MLX on Mac, ONNX on Linux) → speaks back │ │ WebSocket (streamed audio chunks) ▼ Browser (playback + transcript)

... continue reading