Today, we’re announcing a research preview of interaction models: models that handle interaction natively rather than through external scaffolding. We think interactivity should scale alongside intelligence; the way we work with AI should not be treated as an afterthought. Interaction models let people collaborate with AI the way we naturally collaborate with each other—they continuously take in audio, video, and text, and think, respond, and act in real time.
We train an interaction model from scratch. To ensure real-time responsiveness, we adopt a multi-stream, micro-turn design. Our research preview demonstrates qualitatively new interaction capabilities, as well as state-of-the-art combined performance in intelligence and responsiveness.
The collaboration bottleneck
AI labs often treat the ability for AI to work autonomously as the model’s most important capability. Kwa, T., West, B., Becker, J., et al. Measuring AI Ability to Complete Long Tasks. METR, 2025. As a result, today’s models and interfaces aren’t optimized for humans to remain in the loop. A recent frontier model card states: “Importantly, we find that when used in an interactive, synchronous, “hands-on-keyboard” pattern, the benefits of the model were less clear. When used in this fashion, some users perceived [our model] as too slow and did not realize as much value. Autonomous, long-running agent harnesses better elicited the model’s coding capabilities.”
Autonomous interfaces are valuable, but in most real work, users can’t fully specify their requirements upfront and walk away—good results benefit from a collaborative process where the human stays in the loop, clarifying and giving feedback along the way. However, humans increasingly get pushed out not because the work doesn’t need them, but because the interface has no room for them. Instead, people are most effective when they can collaborate with AI the same way we do with other people: messaging, talking, listening, seeing, showing, and interjecting as needed—and for the model to do the same. Communication gets better with: (a) Copresence: people can interact with what others are interacting with; (b) Contemporality: people receive information as it’s produced by others with instant feedback; (c) Simultaneity: people receive and produce information at the same time. Clark H. and Brennan S., “Grounding in Communication,” in Perspectives on Socially Shared Cognition, 1991., The evanescence of orality for its participatory (cf. objectively distanced) nature. Today’s computers and mediums of knowledge work have similar interactive properties. Ong, W. J.. In Orality and Literacy: The technologizing of the word, 1982.
In order to resolve this, we need to move beyond the current turn-based interface for the models. Today’s models experience reality in a single thread. We are referring to commercial general-purpose frontier models—there are smaller-scale or specialized models like Moshi, PersonaPlex, Nemotron VoiceChat, or GPT-Realtime-Translate. Until the user finishes typing or speaking, the model waits with no perception of what the user is doing or how the user is doing it. Until the model finishes generating, its perception freezes, receiving no new information until it finishes or is interrupted. This creates a narrow channel for human-AI collaboration that limits how much of a person’s knowledge, “Metis, with the premium it places on practical knowledge, experience, and stochastic reasoning…is the mode of reasoning most appropriate to complex material and social tasks where the uncertainties are so daunting that we must trust our (experienced) intuition and feel our way.” Scott, J. C: Métis. In Seeing like a State: How certain schemes to improve the human condition have failed, 1998., “A little reflection will show that there is…a body of very important but unorganized knowledge…: the knowledge of the particular circumstances of time and place.” Hayek, F. A. “The use of knowledge in society.” The American Economic Review, 1945. intent, and judgement can reach the model, and how much of the model’s work can be understood. Picture trying to resolve a crucial disagreement over email rather than in person.
At Thinking Machines, we believe we can solve this bandwidth bottleneck by making AI interactive in real time across any modality. This enables AI interfaces to meet humans where they are, rather than forcing humans to contort themselves to AI interfaces.
Most existing AI models bolt on interactivity with a harness: stitching components together to emulate interruptions, multimodality, or concurrency. Most real-time commercials speech systems use voice-activity-detection components to detect turn boundaries. However, “the bitter lesson” Sutton R. The Bitter Lesson, 2019. suggests that these hand-crafted systems will be outpaced by the advance of general capabilities. For interactivity to scale with intelligence, it must be part of the model itself. With this approach, scaling a model makes it smarter and a better collaborator.
Capabilities
Having interactivity be part of the model unlocks a variety of capabilities that would otherwise need to be implemented in the harness.
... continue reading