Skip to content
Tech News
← Back to articles

Show HN: Morph Reflexes – Multi-head classifiers for agent traces

read original more articles
Why This Matters

Morph Reflexes introduces a scalable, fast, and cost-effective method for evaluating agent behaviors using multi-head classifiers. This innovation enables large-scale monitoring of agent performance, addressing the critical need for real-time insights in high-volume environments. Its efficient architecture significantly benefits both the tech industry and consumers by improving agent reliability and user experience.

Key Takeaways

The most common failures for production agents are behavioral: looping, reasoning leakage, user frustration, and more. Using a frontier model like GPT or Sonnet to judge every turn is too expensive and slow to run at scale.

To solve this, we built Reflexes: semantic signals from agent traces, served fast and cheap over API. Built on custom kernels and a custom inference engine forked from vLLM.

Under the hood, it is a small LLM architected around multi-head inference. Small models need to be trained for specific tasks, but running 50 separate small models on the same input for 50 tasks makes no sense.

How it works: We use a modern LLM with hybrid attention and remove the decode step. We built an inference engine that lets prefill compute be 99% reused from reflex to reflex, similar in spirit to older 2019-era BERT/HYDRA and older multiple-head techniques. we built the inference engine to reuse the KV/cache across inputs and compute across all reflexes. One shared backbone reads the trace once, then many heads classify different signals. Our inference engine reuses the same KV/cache and compute across all reflexes, giving us sub-30ms inference with less than 0.1% overhead for each additional reflex.

We took the same high-level idea and did the hard work to make it work with a modern architecture and attention. On it, we can run inference in under 30ms and serve the full request in under 90ms. If you run 4 reflexes or 100, the extra overhead is less than 2ms.

Why does optimizing this matter?

If you’re even a medium-sized startup, you’re dealing with tens of thousands of agent runs and millions of turns. If you want to track things like user frustration rates over time, frontier LLM-as-judge does not scale.

I built a similar stack at Tesla. When ML engineers needed to sample data across petabytes for signals like `is_camera_obfuscated=true`, along with 200 other things, you need to 1) spin them up quickly 2) run at scale efficiently

What it is not: A dashboard. 99% of dashboards go unused. 100% API first and made for devs who want to use this to trigger their own stuff.

vibetrain a custom reflex in our dashboard, and/or then let it self improve in production: https://www.morphllm.com/dashboard/reflex

... continue reading