Rethinking the Linux cloud stack for confidential VMs

This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

There is an inherent limit to the privacy of the public cloud. While Linux can isolate virtual machines (VMs) from each other, nothing in the system's memory is ultimately out of reach for the host cloud provider. To accommodate the most privacy-conscious clients, confidential computing protects the memory of guests, even from hypervisors. But the Linux cloud stack needs to be rethought in order to host confidential VMs, juggling two goals that are often at odds: performance and security.

Isolation is one of the most effective ways to secure the system by containing the impact of buggy or compromised software components. That's good news for the cloud, which is built around virtualization — a design that fundamentally isolates resources within virtual machines. This is achieved through a combination of hardware-assisted virtualization, system-level orchestration (like KVM, the hypervisor integrated into the kernel), and higher-level user-space encapsulation.

On the hardware side, mechanisms such as per-architecture privilege levels (e.g., rings 0-3 in x86_64 or Exception Levels on ARM) and the I/O Memory Management Unit (IOMMU) provide isolation. Hypervisors extend this by handling the execution context of VMs to enforce separation even on shared physical resources. At the user-space level, control groups limit the resources (CPU, memory, I/O) available to processes, while namespaces isolate different aspects of the system, such as the process tree, network stack, mount points, MAC addresses, etc. Confidential computing adds a new layer of isolation, protecting guests even from potentially compromised hosts.

In parallel to the work on security, there is a constant effort to improve the performance of Linux in the cloud — both in terms of literal throughput and in user experience (typically measured by quality-of-service metrics like low I/O tail latency). With the knowledge that there is room to improve, the cloud providers increasingly turn to I/O passthrough to speed up Linux: bypassing the host kernel (and sometimes the guest kernel) to expose physical devices directly to guest VMs. This can be done with user-space libraries like the Data Plane Development Kit (DPDK), which bypasses the guest kernel, or hardware-access features such as virtio Data Path Acceleration (vDPA), which allow paravirtualized drivers to send packets straight to the smartNIC hardware.

But hardware offloading exemplifies a fundamental friction in virtualization, where security and performance often pull in opposite directions. While it is true that offloading provides a faster path for network traffic, it has some downsides, such as limiting visibility and auditing, increasing reliance on hardware and firmware, and circumventing OS-based security checks of flows and data. The uncomfortable reality is that it's tricky for Linux to provide fast access to resources while concurrently enforcing the strict separation required to secure workloads. As it happens, the strongest isolation isn't the most performant.

A potential solution to this tension is extending confidential computing to the devices themselves by making them part of the VM's circle of trust. Hardware technologies like AMD's SEV Trusted I/O (SEV-TIO) allow a confidential VM to cryptographically verify (and attest to) a device's identity and configuration. Once trust is established, the guest can interact with the device and share secrets by allowing direct memory access (DMA) to its private memory, which is encrypted with its confidential VM key. This avoids bounce buffers — temporary memory copies used when devices, like GPUs when they are used to train AI models, need access to plaintext data — which significantly slow down I/O operations.

The TEE Device Interface Security Protocol (TDISP), an industry standard published by PCI SIG, defines how a confidential VM and device establish mutual trust, secure their communications, and manage interface attachment and detachment. A common way to implement TDISP is using a device with single root I/O virtualization (SR-IOV) support — a PCIe feature that a physical device can use to expose multiple virtual devices.

In those setups, the host driver manages the physical device, and each virtual device assigned to a guest VM acts as a separate TEE device interface. Unfortunately, TDISP requires changes in the entire software stack, including the device's firmware and hardware, host CPU, and the hypervisor. TDISP also faces headwinds because not all of the vendors are on board. Interestingly, NVIDIA, one of the biggest players in the GPU arena, sells GPUs with its own non-TDISP architecture.

... continue reading