Around 20% of Firefox’s HTTP traffic today uses HTTP/3, which runs over QUIC, which in turn runs over UDP. This translates to substantial UDP I/O activity. Firefox uses NSPR for most of its network I/O. When it comes to UDP I/O, NSPR only offers a limited set of dated APIs, most relevant here PR_SendTo and PR_RecvFrom , wrappers around POSIX’s sendto and recvfrom . The N in NSPR stands for Netscape, giving you a hint of its age. Operating systems have evolved since. Many offer multi-message APIs like sendmmsg and recvmmsg . Some offer segmentation offloading like GSO (Generic Segmentation Offload) and GRO (Generic Receive Offload). Each of these promise significant performance improvements for UDP I/O. Can Firefox benefit from replacing its aging UDP I/O stack with modern system calls? This project began in mid-2024 with the goal of rewriting Firefox’s QUIC UDP I/O stack using modern system calls across all supported operating systems. Beyond performance improvements, we wanted to increase security by using a memory-safe language to do UDP I/O. Firefox’s QUIC state machine itself is implemented in Rust already. We thereby chose Rust for this project as well, giving us both increased security and easy integration with the existing QUIC codebase. Instead of starting from scratch, we built on top of quinn-udp , the UDP I/O library of the Quinn project, a QUIC implementation in Rust. This sped up our development efforts significantly. Big thank you to the Quinn project. Operating system calls are complex, with a myriad of idiosyncrasies, especially across versions. Firefox is multi-platform, focusing on Windows, Android, MacOS and Linux as tier 1. The main complexity though stems from Firefox supporting ancient versions of each of them, e.g. Android 5. One year later, i.e., mid 2025, this project is now rolling out to the majority of Firefox users. Performance benchmark results are promising. In extreme cases, on purely CPU bound benchmarks, we’re seeing a jump from < 1Gbit/s to 4 Gbit/s. Looking at CPU flamegraphs, the majority of CPU time is now spent in I/O system calls and cryptography code. Below are the many improvements we were able to land, plus the ones we weren’t. I hope other projects in need of fast UDP I/O can benefit from our work. To make their lifes easier, below I am documenting the many learnings we made. The basics# To understand the improvements, it’s helpful to first examine how UDP I/O traditionally works and how modern optimizations change this picture. Single datagram# Previously Firefox would send (and receive) single UDP datagrams to (and from) the OS via sendto (and recvfrom ) system call family. The OS would send (and receive) that UDP datagram to (and from) the network interface card (NIC). The NIC would send (and receive) it to (and from) the Internet. Thus each datagram would require leaving user space which is cheap for one UDP datagram, but expensive when sending at say a 500 Mbit/s rate. In addition all user space and kernel space overhead independent of the number of bytes sent and received, is paid per datagram, i.e. per < 1500 bytes. +----------------------+ | Firefox | | +-----------+ | | | QUIC | | | +-----------+ | +----------------------+ | [ datagram ] | === User / Kernel === | [ datagram ] | +----------------------+ | OS | +----------------------+ | [ datagram ] | +----------------------+ | NIC | +----------------------+ | [ datagram ] | +----------------------+ | Internet | +----------------------+ Batch of datagrams# Instead of sending a single datagram at a time, some operating systems nowadays offer multi-message system call families, e.g. on Linux sendmmsg and recvmmsg . The idea is simple. Send and receive multiple UDP datagrams at once, save on the costs that are independent of the number of bytes sent and received. +--------------------------+ | Firefox | | +-----------+ | | | QUIC | | | +-----------+ | +--------------------------+ | [ datagram, datagram, datagram ] | ===== User / Kernel ===== | [ datagram, datagram, datagram ] | +--------------------------+ | OS | +--------------------------+ | [ datagram, datagram, datagram ] | +--------------------------+ | NIC | +--------------------------+ | [ datagram, datagram, datagram ] | +--------------------------+ | Internet | +--------------------------+ Single large segmented datagram# Some modern operating systems and network interface cards also support system call families with UDP segmentation offloading, e.g. GSO and GRO on Linux. Instead of sending multiple UDP datagrams in a batch, it enables the application to send a single large UDP datagram, i.e. larger than the Maximum Transmission Unit, to the kernel. Next, either the kernel, but really ideally the network interface card, will segment it into multiple smaller packets, add a header to each and calculates the UDP checksum. The reverse happens on the receive path, where multiple incoming packets can be coalesced into a single large UDP datagram delivered to the application all at once. +------------------------------+ | Firefox | | +-----------+ | | | QUIC | | | +-----------+ | +------------------------------+ | [ large segmented datagram ] | ====== User / Kernel ====== | [ large segmented datagram ] | +------------------------------+ | OS | +------------------------------+ | [ large segmented datagram ] | +------------------------------+ | NIC | +------------------------------+ | [ datagram, datagram, datagram ] | +------------------------------+ | Internet | +------------------------------+ Note: Unfortunately, Wireshark does not yet support GSO, making network-level debugging more challenging when these optimizations are active. For performance analysis of these different approaches, Cloudflare’s comprehensive study provides excellent benchmarks and detailed explanations. Replacing NSPR in Firefox# Batching and segmentation offloading aside for now, first step in the project was to replace usage of NSPR with quinn-udp, still sending and receiving one UDP datagram at a time. We updated the Mozilla QUIC client and server test implementation, then integrated quinn-udp into Firefox itself. Next we rewrote the UDP datagram processing pipeline in the Mozilla QUIC implementation to send and receive batches of datagrams. This is done in a way, such that we can leverage both the multi-message style system calls, as well as the segmentation offloading style, if available. We added this along with various other I/O improvements, e.g. Lars added in-place en-/decryption. Going into detail here is better done in a separate blog post. Let’s focus on UDP I/O here. So far so good. This was the easy part. Up next, the edge cases by platform. Platform details# Windows offers WSASendMsg and WSARecvMsg to send and receive a single UDP datagram. That UDP datagram can either be a classic MTU size datagram, or a large segmented datagram. For the latter, what Linux calls GSO and GRO , Windows call USO and URO . As described above, we started off rolling out quinn-udp using single-datagram system calls only. This went without issues on Windows. Next we tested WSARecvMsg with URO , i.e. receiving a batch of inbound datagrams as a single large segmented datagram, but got the following bug report: fosstodon.org doesn’t load with network.http.http3.use_nspr_for_io=false on ARM64 Windows fosstodon is a Mastodon server. It is hosted behind the CDN provider Fastly. Fastly is a heavy user of Linux’s GSO, i.e. sends larger UDP datagram trains, perfect to be coalesced into a single large segmented UDP datagram when Firefox receives it. Why would Window’s URO prevent Firefox from loading the site? After many hours of back and forth with the reporter, luckily a Mozilla employee as well, I ended up buying the exact same laptop, same color, in a desperate attempt to reproduce the issue. Without much luck at first, I eventually needed a Linux command line tool, thus installed WSL, and to my surprise, that triggered the bug (reproducer). Turns out, on Windows on ARM, with WSL enabled, a WSARecvMsg call with URO would not return a segment size, thus Firefox was unable to differentiate a single datagram, from a single segmented datagram. QUIC short header packets don’t carry a length, thus there is no way to tell where one QUIC packet ends and another starts, leading to the above page load failures. We have been in touch with Microsoft since. No progress thus far. Thereby we are keeping URO on Windows disabled in Firefox for now. After URO we started using WSASendMsg USO , i.e. sending a single large segmented datagram per system call. But this too we rolled back quickly, seeing increased packet loss on Firefox Windows installations. In addition, we have at least one report of a user, seeing their network driver crash due to Firefox’s usage of USO . More debugging needed. The transition on MacOS from NSPR to quinn-udp for HTTP/3 QUIC UDP I/O involved switching from the system calls sendto and recvfrom to the system calls sendmsg and recvmsg . As with Windows, no issues on this first step, ignoring one report where MacOS 10.15 might be seeing IP packets other than v4 and v6 (fixed since). Unfortunately MacOS does not offer UDP segmentation offloading, neither on the send, nor on the receive side. What it does offer though are two undocumented system calls, namely sendmsg_x and recvmsg_x , allowing a user to send and receive batches of UDP datagrams at once. Lars from Mozilla added it to quinn-udp, exposed behind the fast-apple-datapath feature flag, off by default. After multiple iterations with smaller bugfixes (#2154, #2214, #2216 …) we decided to not ship it to users, not knowing how MacOS would behave, in case Apple ever decides to remove it, but with Firefox still calling it. Linux provides the most comprehensive and mature UDP optimization support, offering both multi-message APIs ( sendmmsg / recvmmsg ) and segmentation offloading (GSO/GRO). The quinn-udp library makes a deliberate choice to prioritize GSO over sendmmsg for transmission, as GSO typically provides superior performance with diminishing returns when both techniques are combined. Thus far, this has proven the right choice for Firefox as well. In addition to segmentation offloading being superior in the first place, Firefox uses one UDP socket per connection in order to improve privacy. As each socket gets its own source port it is harder to correlate connections. Why is this relevant here? GSO (and GRO ) can only segment (and coalesce) datagrams from the same 4-tuple (src IP, src port, dst IP, dst port), sendmmsg and recvmmsg on the other hand can send and receive across 4-tuples. Given that Firefox uses one socket per connection, it cannot make use of that distinct benefit of sendmmsg (and recvmmsg ), making segmentation offloading yet again the obvious choice for Firefox. Ignoring minor changes required to Firefox’s optional network sandboxing, and an additional at runtime GSO support check, replacing Firefox’s QUIC UDP I/O stack on Linux has been without issues, now enjoying all the benefits of segmentation offloading. During the time of this project I learned quickly that (a) Android is not Linux and (b) that Firefox still supports Android 5, …, on x86 (32 bit). On x86, Android dispatches advanced socket calls through socketcall system call instead of calling e.g. sendmsg directly. In addition Android has various default seccomp filters, crashing an app when e.g. not going through the required socketcall system call. The combination of the two did cost me a couple of days, resulting in this (basically single line) change in quinn-udp. On Android API level 25 and below, calling sendmsg with an ECN bit set results in an error EINVAL . quinn-udp will now simply retry on EINVAL disabling various optional settings (e.g. ECN) on the second attempt. Great benefit of the Quinn community is that Firefox will benefit from any improvements made to quinn-udp. For example this excellent find by Thomas where Android in some cases would complain if we did a GSO with a single segment only. Explicit congestion notifications (ECN)# With Firefox using modern system calls across all major operating systems, a nice additional benefit is the ability to send and receive ancillary data like IP ECN. This too came with some minor surprises, but QUIC ECN in Firefox is well on its way now. Firefox Nightly telemetry shows around 50% of all QUIC connections running on ECN outbound capable paths. With L4S and thus ECN becoming more and more relevant in today’s Internet, this is a great step forward.