Rethinking the Linux cloud stack for confidential VMs

Rethinking the Linux cloud stack for confidential VMs This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible. There is an inherent limit to the privacy of the public cloud. While Linux can isolate virtual machines (VMs) from each other, nothing in the system's memory is ultimately out of reach for the host cloud provider. To accommodate the most privacy-conscious clients, confidential computing protects the memory of guests, even from hypervisors. But the Linux cloud stack needs to be rethought in order to host confidential VMs, juggling two goals that are often at odds: performance and security. Isolation is one of the most effective ways to secure the system by containing the impact of buggy or compromised software components. That's good news for the cloud, which is built around virtualization — a design that fundamentally isolates resources within virtual machines. This is achieved through a combination of hardware-assisted virtualization, system-level orchestration (like KVM, the hypervisor integrated into the kernel), and higher-level user-space encapsulation. On the hardware side, mechanisms such as per-architecture privilege levels (e.g., rings 0-3 in x86_64 or Exception Levels on ARM) and the I/O Memory Management Unit (IOMMU) provide isolation. Hypervisors extend this by handling the execution context of VMs to enforce separation even on shared physical resources. At the user-space level, control groups limit the resources (CPU, memory, I/O) available to processes, while namespaces isolate different aspects of the system, such as the process tree, network stack, mount points, MAC addresses, etc. Confidential computing adds a new layer of isolation, protecting guests even from potentially compromised hosts. In parallel to the work on security, there is a constant effort to improve the performance of Linux in the cloud — both in terms of literal throughput and in user experience (typically measured by quality-of-service metrics like low I/O tail latency). With the knowledge that there is room to improve, the cloud providers increasingly turn to I/O passthrough to speed up Linux: bypassing the host kernel (and sometimes the guest kernel) to expose physical devices directly to guest VMs. This can be done with user-space libraries like the Data Plane Development Kit (DPDK), which bypasses the guest kernel, or hardware-access features such as virtio Data Path Acceleration (vDPA), which allow paravirtualized drivers to send packets straight to the smartNIC hardware. But hardware offloading exemplifies a fundamental friction in virtualization, where security and performance often pull in opposite directions. While it is true that offloading provides a faster path for network traffic, it has some downsides, such as limiting visibility and auditing, increasing reliance on hardware and firmware, and circumventing OS-based security checks of flows and data. The uncomfortable reality is that it's tricky for Linux to provide fast access to resources while concurrently enforcing the strict separation required to secure workloads. As it happens, the strongest isolation isn't the most performant. A potential solution to this tension is extending confidential computing to the devices themselves by making them part of the VM's circle of trust. Hardware technologies like AMD's SEV Trusted I/O (SEV-TIO) allow a confidential VM to cryptographically verify (and attest to) a device's identity and configuration. Once trust is established, the guest can interact with the device and share secrets by allowing direct memory access (DMA) to its private memory, which is encrypted with its confidential VM key. This avoids bounce buffers — temporary memory copies used when devices, like GPUs when they are used to train AI models, need access to plaintext data — which significantly slow down I/O operations. The TEE Device Interface Security Protocol (TDISP), an industry standard published by PCI SIG, defines how a confidential VM and device establish mutual trust, secure their communications, and manage interface attachment and detachment. A common way to implement TDISP is using a device with single root I/O virtualization (SR-IOV) support — a PCIe feature that a physical device can use to expose multiple virtual devices. In those setups, the host driver manages the physical device, and each virtual device assigned to a guest VM acts as a separate TEE device interface. Unfortunately, TDISP requires changes in the entire software stack, including the device's firmware and hardware, host CPU, and the hypervisor. TDISP also faces headwinds because not all of the vendors are on board. Interestingly, NVIDIA, one of the biggest players in the GPU arena, sells GPUs with its own non-TDISP architecture. Secure Boot Beyond devices, many other parts of the Linux cloud stack must change to accommodate confidential computing, starting right at boot. To understand how, we need to look at Secure Boot. A typical sequence is shown in the area outlined in red in the figure below. First, the firmware verifies the shim pre-bootloader using a cryptographic key embedded in the firmware's non-volatile memory by the OEM, along with a database of valid signatures (DB) and a revocation list (DBX) to reject known-bad binaries, such as a first-stage bootloader, and revoked certificates. Once verified, shim is loaded into system memory and execution jumps to it. Shim then does a similar check on the next step, the bootloader (usually GRUB), using a key provided by the Linux distribution. Finally, the bootloader verifies and loads the kernel inside the guest VM. The guest kernel can read the values of the Platform Configuration Registers (PCRs) stored in a virtual Trusted Platform Modules (TPM) that the hypervisor provides (e.g. using swtpm) to get the digests of all previously executed components and verify that they match known-good values. Extra steps need to take place during boot to set up for confidential computing. In the figure above, a secure VM service module (SVSM) on the left becomes the first component to execute, verifying the firmware itself while running in a special hardware mode known as VMPL0 (Intel's equivalent is VTL0). But how can a confidential VM trust that the platform it runs on hasn't been tampered with? In traditional Secure Boot, the chain of trust relies on a virtual TPM (vTPM) provided by the host. However, the hypervisor itself is now untrusted, so the guest cannot rely on a TPM controlled by it. Instead, the SVSM, or other trusted component isolated from the host, must provide a vTPM that supplies measurements for remote attestation. This allows the guest OS to verify the integrity of the platform and decide whether it is safe to run. The details of remote attestation can vary depending on the model followed; the most well-known is the Remote ATtestation procedureS (RATS) architecture. In this model, three actors play a role: Attester : Dedicated hardware like AMD's Platform Security Processor (PSP) that generates evidence about its current state (e.g., firmware version) by signing measurements with a private key stored within it. : Dedicated hardware like AMD's Platform Security Processor (PSP) that generates evidence about its current state (e.g., firmware version) by signing measurements with a private key stored within it. Verifier : A remote entity that evaluates the evidence's integrity and trustworthiness. To do so, it consults an endorser to validate that the signing key and reported measurements (digests) are legitimate. The verifier can also be configured to enforce appraisal policies — for example, rejecting systems with outdated firmware versions from receiving secrets. : A remote entity that evaluates the evidence's integrity and trustworthiness. To do so, it consults an endorser to validate that the signing key and reported measurements (digests) are legitimate. The verifier can also be configured to enforce appraisal policies — for example, rejecting systems with outdated firmware versions from receiving secrets. Endorser: A trusted third party, typically the hardware vendor, provides certificates confirming the signing key belongs to genuine cryptographic hardware. The endorser also supplies reference measurement values used by the verifier for validation. The final product is an attestation result prepared by the verifier, confirming that the measured platform components match expected good values. A Linux confidential VM can use this report — including a vTPM quote with the current PCR values signed by a vTPM private key and a nonce supplied by the guest (to prevent replay attacks) — to decide whether to continue booting. Secure Boot helps prevent malicious code from executing early in the boot sequence, but it can also increase boot time by a few seconds. Adding confidential computing to the equation slows down things even more. For most Linux users, the slight delay of Secure Boot is negligible and well worth the security benefits. But, in cloud environments, even a few extra seconds for guest boot can be consequential — small delays quickly add up at fleet scale. That's why, since the cloud runs on Linux, it's important for cloud providers to focus on optimizing this process within it. To complicate things even more, there are different flavors of confidential computing. For example, instead of using an SVSM, Microsoft's Linux Virtualization-Based Security (LVBS) opts for a paravisor, as shown in the figure below. In LVBS, the paravisor is a small Linux kernel that runs in a special hardware mode (e.g. VTL0) after the bootloader. This design has the advantage of being vendor-neutral, but also has drawbacks, such as a significantly larger attack surface than the SVSM. Even though there are many ways to implement confidential VMs in Linux, we still lack a clear, shared understanding of the trade-offs between them. Once the confidential VM is booted, two major sources of runtime overhead are DRAM encryption and decryption, as well as enforcing memory access permissions from the hardware. That said, because this happens inline within the memory controller, the delay is usually small; this impact can vary depending on the workload, particularly for cache-sensitive applications. A separate, more significant performance hit comes from the process of accepting memory pages. Before a confidential VM can access DRAM, each page must be explicitly accepted by the guest. This step binds the guest physical address (gPA) of the page to a system physical address (sPA), preventing remapping — that is, once validated, the hardware enforces this mapping, and any attempt by the hypervisor to remap the gPA to a different sPA via nested page tables will trigger a page fault (#PF). The validation process is slow and requires the guest kernel to spend virtual-CPU cycles issuing hypercalls and causing VMEXITs, since it cannot directly execute privileged instructions like PVALIDATE on x86 processors. Only components running in special hardware modes — such as the SVSM at VMPL0 — can call them directly. To avoid this overhead cost at runtime, the SVSM (or whatever component is used) should pre-accept all memory pages early during the boot process. Scaling Fleet scalability — meaning how many guest VMs can be created — is also impacted by confidential computing. The most significant hardware limitations come from architectural constraints: for example, the number of available address-space identifiers (ASIDs). Each confidential VM requires a unique ASID in order to be tagged and isolated; without a unique ASID, the hardware cannot differentiate between encrypted memory regions belonging to different VMs. The maximum number of ASIDs that Linux can use is typically capped by the BIOS and limited to a few hundred. That might seem enough, but modern multicore processors can have hundreds of cores, each hosting one or even two virtual CPUs with simultaneous multithreading. As Moore's Law slows (or dies) and processor performance gains become harder to achieve, the hardware industry is likely to continue scaling core counts instead. Thus, without scalable support in Linux for confidential VMs, the cloud risks underutilizing cores. A possible solution to the hardware scalability problems would be hybrid systems, where Linux could run both confidential and conventional VMs side by side. Today, kernel-configuration options enforce an all-or-nothing approach — either the system hosts only encrypted VMs or it hosts no encrypted VMs. Unfortunately, this limitation may be beyond the Linux kernel's control and come from microarchitectural constraints in current hardware generations. In confidential VMs, swap memory needs to be encrypted to preserve the confidentiality of data even when moved to disk. Likewise, when the VMs communicate over the network — particularly through host-managed NICs — they must establish secure end-to-end sessions to maintain data integrity and confidentiality across untrusted host networks. Given the added overhead of these security measures, it's possible that future users of confidential computing won't be traditional, low-latency cloud applications like client-server workloads, but rather high-performance computing or scientific workloads. While these batch-oriented applications may still experience some performance impact, they generally have a higher tolerance for latency — not because they are inherently less sensitive to it, but because they lack realtime human interaction (e.g., there are no users sitting in front of a browser waiting for a reply). Live migration is another important aspect of the cloud, allowing VMs to move between hosts (such as during maintenance in specific regions of the fleet) with minimal impact on the VMs — ideally without a noticeable disruption, as IP addresses can be preserved using virtual LAN technologies like VXLAN. However, after migration, the attestation process must be repeated on the destination node. While pre-attesting a destination node (as a plan B option) can help reduce overhead, unexpected emergencies in the fleet may force the VM to migrate again shortly after arrival. Worse still, because the guest VM no longer implicitly trusts the host, it must also verify that its memory and execution context were correctly preserved during migration, and that any changes were properly tracked throughout the live migration. To facilitate all of this, a migration agent running in a separate confidential VM can help coordinate and secure live migration. In conclusion Hardware offloading has always implied a tradeoff in virtualization: it improves I/O performance but weakens security. Thanks to confidential computing, Linux can now achieve the former without sacrificing the latter. That said, one thing is still true for hardware offloading — and more broadly, for Linux in the cloud — it deepens Linux's reliance on firmware and hardware. In that sense, trust doesn't grow or shrink, it simply shifts. In this case, it shifts toward OEMs (hardware and device manufacturers). But what happens if (or when) an attacker exploits vulnerabilities or backdoors in hardware or firmware? Unlike software, hardware is difficult to verify, leaving open the risk of hidden compromises that can undermine the entire security model. Open architectures like RISC-V may offer a solution with hardware designs that can be inspected and audited. This speaks to the security value of transparency and openness — ultimately the only way to eliminate the need to trust third parties. Cloud providers are already expected to respect user privacy, but confidential computing turns that promise into more than just a leap of faith taken in someone else's computer. That shift puts the guest Linux kernel in an awkward spot. Cooperation with the host can be genuinely useful — say, synchronizing schedulers to make the most of NUMA layouts, or avoiding guest deadlocks. But the host is also, unavoidably, untrusted. This means that Linux finds itself trying to work with something it's supposed to be protected from. As a consequence, a lot has to change in the Linux cloud stack to truly accommodate cloud confidential computing. Is this a worthwhile investment for the overall kernel community? As the foundation of the modern public cloud, Linux is in a good position to explore the potential of confidential VMs. Index entries for this article GuestArticles Bilbao, Carlos to post comments

Rethinking the Linux cloud stack for confidential VMs

Share this article

Related Articles