AI Models Need a Virtual Machine Applications using AI embed the AI model in a framework that interfaces between the model and the rest of the system, providing needed services such as tool calling, context retrieval, etc. Software for early chatbots took user input, called the LLM, and returned the result to the user; essentially just a read-eval-print loop. But, as the capabilities of LLMs have evolved and extension mechanisms, such as MCP were defined, the complexities of the control software that calls the LLM have increased. AI software systems require the same qualities that an operating system provides, including security, isolation, extensibility, and portability. For example, when an AI model needs to be given a file as part of its context, access control must be established that determines if the model should be allowed to view that file. We believe it is time to consider standardizing the ways in which the AI models are embedded into software and think of that control software layer as a virtual machine, where one of the machine instructions, albeit a super-powerful one, is to call the LLM. Our approach decouples model development from integration logic, allowing any model to “plug in” to a rich software ecosystem that includes tools, security controls, memory abstractions, etc. Similar to the impact that the Java Virtual Machine had, creating a specification of a VM for the AI orchestrator could enable a “write once, run anywhere” execution environment for AI models while at the same time providing familiar constraints and governance to maintain security and privacy in existing software systems. Below we outline related work in this direction, the motivation behind it, and the key benefits of an AI Model VM. Introduction AI models are being leveraged in existing software as application copilots, embedded in IDEs, and with the rise of the MCP protocol, are increasingly able to use tools, implement agents, etc. This rapid evolution of valuable use cases brings with it a greater need to ensure that the AI-powered applications maintain privacy, are secure, and operate correctly. Guarantees of security and privacy are best provided if the underlying system is secure by design and not added on to systems as an afterthought. We take the Java Virtual Machine (JVM) as our inspiration in making the case for the importance of a standard AI Virtual Machine. The Java Virtual Machine guarantees memory safety by design, defines access control policies, and prevents code injection with bytecode verification. These properties allow Java programs running on the JVM to be executed with trust despite being shipped remotely, enabling “write once, run anywhere” software distribution. How does the JVM relate to applications that use AI models? We used the following example to explain: The diagram illustrates the role of the software layer that interacts with an AI model, which we call the Model Virtual Machine (MVM). That layer intermediates between the model and the rest of the world. For example, a chatbot user might type a prompt (1) that the MVM then sends unmodified to the AI model (2). In practice, the MVM will add additional context, including the system prompt, chat history, to the AI model input as well. The AI model generates a response, which in the example requires a specific tool to be called (3). This response has a specific format that is mutually agreed upon between the model and the MVM, such as MCP. In our example, because it is important to restrict the model from making undesired tool calls, the MVM first consults the list of allowed tools (4) before deciding to call the tool the model requested (5). This check (4) guarantees that the model doesn’t make unauthorized tool calls. Every commercial system using AI models requires some version of this control software. We make the analogy that the interface with the LLM should be a virtual machine. If that is the case, what are the instructions that the machine can execute? Here are examples of operations that existing AI model interfaces have: Certifying, loading, initializing, and unloading a given AI model Calling a model with context Parsing the output from the model Certifying, loading, initializing, and unloading tools Calling a tool Parsing the results from a tool call Storing the results from a tool call into memory Asking the user for input Adding content to a history memory Standard control constructs such as conditionals, sequencing, etc. A VM would support all of these operations in a well-typed context where constraints are placed on the calls made, the arguments passed, etc. Existing Work Informs What is Needed Some of the required elements of a well-specified interface are emerging in AI systems explored in academic work and in applications that are widely deployed: OpenAI’s Structured Tool Calling Protocols : OpenAI introduced a JSON-based function calling API that lets models invoke code-defined functions in a structured way. This approach, along with OpenAI’s plugin system (which uses OpenAPI specifications for tools), showed how structured tool-calling protocols can reduce ambiguity and simplify integration. : OpenAI introduced a JSON-based function calling API that lets models invoke code-defined functions in a structured way. This approach, along with OpenAI’s plugin system (which uses OpenAPI specifications for tools), showed how structured can reduce ambiguity and simplify integration. Anthropic’s Model Context Protocol (MCP, 2024): MCP is an open protocol for connecting AI assistants to external data and tools, explicitly aiming to be a universal interface. “Think of MCP like a USB-C port for AI applications,” Anthropic explains. Instead of every service having a custom AI integration, MCP provides a common schema and client-server approach. Despite being relatively new, MCP adoption, including in large companies, has been rapid. MCP is an for connecting AI assistants to external data and tools, explicitly aiming to be a universal interface. “Think of MCP like a USB-C port for AI applications,” Anthropic explains. Instead of every service having a custom AI integration, MCP provides a common schema and client-server approach. Despite being relatively new, MCP adoption, including in large companies, has been rapid. Secure Orchestrators – FIDES & AC4A (2025): Security remains a weak point in current AI systems. Two recent projects propose runtime-level controls. FIDES (by Microsoft Research) enforces information-flow policies on agents by tracking data confidentiality labels and adding new agent actions like “inspect” to limit what agents can access (where a quarantined LLM can safely summarize restricted data) (paper). AC4A (Access Control for Agents) (manuscript in preparation) takes an OS-style approach: All tools and data are organized into hierarchies (like files and folders), and the agent must request read/write access for each resource. AC4A’s runtime intercepts every agent action and blocks anything not permitted, forcing a least-privilege operation mode. These projects show how a standard AI VM could include built-in security and access control , just as modern operating systems do. Even with strong access controls built into a VM specification, AI models present new security challenges that need to be considered in the design. For example, an AI model, when prevented from accessing a particular item of data, might use its chain-of-thought reasoning to devise ways to gather accessible data that allows it to infer the inaccessible item. As such, security researchers have to devise new mitigations to prevent AI models taking adversarial actions even with the virtual machine constraints. Security remains a weak point in current AI systems. Two recent projects propose runtime-level controls. (by Microsoft Research) on agents by tracking data confidentiality labels and adding new agent actions like “inspect” to limit what agents can access (where a quarantined LLM can safely summarize restricted data) (paper). (manuscript in preparation) takes an OS-style approach: All tools and data are organized into hierarchies (like files and folders), and the agent must request read/write access for each resource. AC4A’s runtime intercepts every agent action and blocks anything not permitted, forcing a least-privilege operation mode. These projects show how a standard AI VM could include built-in , just as modern operating systems do. Even with strong access controls built into a VM specification, AI models present new security challenges that need to be considered in the design. For example, an AI model, when prevented from accessing a particular item of data, might use its chain-of-thought reasoning to devise ways to gather accessible data that allows it to infer the inaccessible item. As such, security researchers have to devise new mitigations to prevent AI models taking adversarial actions even with the virtual machine constraints. Open-Source Agent Runtimes: Several projects are actively building general-purpose runtimes for AI. For example, langchain and Semantic Kernel provide numerous common runtime services that make writing reliable AI-enabled applications easier. The AI Controller Interface (AICI) (later renamed llguidance), integrates a lightweight VM into the model-serving pipeline, allowing developers to script and constrain model behavior at a low level (e.g., control of generations token-by-token). Defining a specification for a VM interface for AI systems from these emerging approaches will require more than an agreement on protocols and APIs. Because AI systems derive their behavior from training data, model training data must reflect the specification of the VM interface so that the models and the VM model interface can co-evolve. This will enable otherwise diverse models to exhibit broadly compatible behavior with respect to the VM interface specification. Benefits of a Well-Specified AI Model VM As mentioned, many applications that leverage AI models require reliability, privacy, and security. In addition, new models are developed almost daily and updating the model being used by an application is often necessary. Given this confluence of factors, creating robust AI software presents significant engineering challenges. We believe that a specification of the interface between the AI model and the surrounding software that interfaces to it will address some of these challenges. The need for an AI Model VM specification is driven by several clear motivations: Separation of Concerns: An interface specification enforces a clean separation between model logic and integration logic. This means models become interchangeable components. You could swap in a new model (or move an agent to a different platform) and, as long as both adhere to the standard, everything still works. Likewise, virtual machine implementors can increase the performance, security, and tooling of the virtual machine while maintaining compatibility with the AI model interfaces. An interface specification enforces a clean separation between model logic and integration logic. This means components. You could swap in a new model (or move an agent to a different platform) and, as long as both adhere to the standard, everything still works. Likewise, virtual machine implementors can increase the performance, security, and tooling of the virtual machine while maintaining compatibility with the AI model interfaces. Built-in Safety and Governance: A VM specification can enforce safety by design . By routing all tool usage and external access through a well-defined interface, it becomes easier to apply permission checks, audit logs, and fail-safes. As shown by projects like AC4A, the VM can act as a gatekeeper, restricting what models can do unless explicitly authorized. This creates a safer deployment solution for powerful AI systems: even if the model behaves unpredictably, the VM layer can contain its effects. Standards bodies could even define security requirements (e.g., certain calls must always require user confirmation), creating a shared foundation of trust. Similar to the benefits of signed assemblies in the Common Language Runtime, have a certification process around loading and unloading models and tools ensures the end-to-end security of the supply chain. A VM specification can enforce . By routing all tool usage and external access through a well-defined interface, it becomes easier to apply permission checks, audit logs, and fail-safes. As shown by projects like AC4A, the VM can act as a gatekeeper, restricting what models can do unless explicitly authorized. This creates a safer deployment solution for powerful AI systems: even if the model behaves unpredictably, the VM layer can contain its effects. Standards bodies could even define security requirements (e.g., certain calls must always require user confirmation), creating a shared foundation of trust. Similar to the benefits of signed assemblies in the Common Language Runtime, have a certification process around loading and unloading models and tools ensures the end-to-end security of the supply chain. Transparent Performance & Resource Tracking: A VM specification could also give developers visibility to runtime diagnostics. Post-execution manifests could report model performance, resource consumption, and data access level which helps developers evaluate overall efficiency and performance. Benchmarks for accuracy, utility, and responsiveness can be supported directly in the VM interface across models and platforms. A VM specification could also give developers visibility to runtime diagnostics. Post-execution manifests could report model performance, resource consumption, and data access level which helps developers evaluate overall efficiency and performance. Benchmarks for accuracy, utility, and responsiveness can be supported directly in the VM interface across models and platforms. Verifiability of Model Output: Leveraging a VM specification, experts can explore integrating formal methods to verify their model behavior. Techniques such as zero-knowledge proofs could confirm the integrity of model outputs without sensitive internal logic. While still emerging, this possibility hints at new levels of trust and accountability in AI systems and should be carefully considered during development. Conclusion We argue that a well-specified AI Model Virtual Machine is needed. Developments occurring in multiple directions, including work from tech companies, startups, and academia, all motivate the need for a VM specification that lets AI models safely and seamlessly interact with the world around them. The motivation is clear – reducing complexity and unlocking interoperability – and the potential benefits range from technical (faster development, modular upgrades) to strategic (cross-platform AI ecosystems, improved safety). From enforcing controls for security and privacy, to potentially formal proof capabilities for trust, the opportunities are wide-ranging. Learning a lesson from older generations of software virtualization, a VM specification can increase AI systems portability, interoperability, security, and reliability. The purpose of this document is to highlight these issues and start engaging with the community on building a consensus that such a specification is needed and what it should include. Biographies: Shraddha Barke is a Senior Researcher at Microsoft Research in Redmond, Washington in the Research in Software Engineering (RiSE) group. Her research interests include AI for proof generation, training AI models for program-reasoning tasks using RL and improving the reliability of AI agents. Betül Durak is a Principal Researcher at Microsoft Research in Redmond, Washington in Security, Privacy, and Cryptography group. Her research interests broadly include security analysis as well as secure and private protocol designs motivated from real world problems. Dan Grossman is a Professor at the University of Washington and the Vice Director of the Paul G. Allen School of Computer Science & Engineering. His research interests are in programming languages, particularly in applying programming languages concepts and analyses to emerging domains. Peli de Halleux is a Principal Research Software Developer Engineer in Redmond, Washington working in the Research in Software Engineering (RiSE) group. His research interests include empowering individuals to build LLM-powered applications more efficiently. Emre Kıcıman is a Senior Principal Research Manager and Head of Research for Copilot Tuning at Microsoft. His research interests include causal methods, the security of AI, and applications of LLM and AI-based systems, together with their implications for people and society. Reshabh K Sharma is a PhD student at the University of Washington. His research lies at the intersection of PL/SE and LLMs, focusing on developing infrastructure and tools to create better LLM-based system that are easier to develop reliably and correctly. Ben Zorn is a Partner Researcher at Microsoft Research in Redmond, Washington working in (and previously having co-managed) the Research in Software Engineering (RiSE) group. His research interests include programming language design and implementation, end-user programing, and AI software including technology for ensuring responsible AI. Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.