The Limits of NTP Accuracy on Linux

The Limits of NTP Accuracy on Linux Lately I’ve been trying to find (and understand) the limits of time syncing between Linux systems. How accurate can you get? What does it take to get that? And what things can easily add measurable amounts of time error? After most of a month (!), I’m starting to understand things. This is kind of a follow-on to a previous post, where I walked through my setup and goals, plus another post where I discussed time syncing in general. I’m trying to get the clocks on a bunch of Linux systems on my network synced as closely as possible so I can trust the timestamps on distributed tracing records that occur on different systems. My local network round-trip times are in the 20–30 microsecond (μS) range and I’d like clocks to be less than 1 RTT apart from each other. Ideally, they’d be within 1 μS, but 10 μS is fine. It’s easy to fire up Chrony against a local GPS Technically, GNSS, which covers multiple satellite-backed navigation systems, not just the US GPS system, but I’m going to keep saying “GPS” for short. -backed time source and see it claim to be within X nanoseconds of GPS, but it’s tricky to figure out if Chrony is right or not. Especially once it’s claiming to be more accurate than the network’s round-trip time 20 μS or so. , the amount of time needed for a single CPU cache miss 50-ish nanoseconds. , or even the amount of time that light would take to span the gap between the server and the time source. About 5 ns per meter. I’ve spent way too much time over the past month digging into time, and specifically the limits of what you can accomplish with Linux, Chrony, and GPS. I’ll walk through all of that here eventually, but let me spoil the conclusion and give some limits: GPSes don’t return perfect time. I routinely see up to 200 ns differences between the 3 GPSes on my desk when viewing their output on an oscilloscope. The time gap between the 3 sources varies every second, and it’s rare to see all three within 20 ns of each other. Even the best GPS timing modules that I’ve seen list ~5 ns of jitter on their datasheets. I’d be surprised if you could get 3-5 GPS receivers to agree within 50 ns or so without careful management of consistent antenna cable length, etc. Even small amounts of network complexity can easily add 200-300 ns of systemic error to your measurements. Different NICs and their drivers vary widely on how good they are for sub-microsecond timing. From what I’ve seen, Intel E810 NICs are great, Intel X710s are very good, Mellanox ConnectX-5 are okay, Mellanox ConnectX-3 and ConnectX-4 are borderline, and everything from Realtek is questionable. A lot of Linux systems are terrible at low-latency work. There are a lot of causes for this, but one of the biggest is random “stalls” due to the system’s SMBIOS running to handle power management or other activities, and “pausing” the observable computer for hundreds of microseconds or longer. In general, there’s no good way to know if a given system (especially cheap systems) will be good or bad for timing without testing them. I have two cheap mini PC systems that have inexplicably bad time syncing behavior, 1300-2000 ns. and two others with inexplicably good time syncing 20-50 ns . Dedicated server hardware is generally more consistent. All in all, I’m able to sync clocks to within 500 ns or so on the bulk of the systems on my network. That’s good enough for my purposes, but it’s not as good as I’d expected to see. Now, it’s certainly possible to do better than this in specific cases. For examples, see Chrony’s examples page, where they get <100 ns of error over the network for a specific test case. In general, though it’s going to be hard to do much better than 200 ns consistently across a real network without a lot of careful engineering. I’ll explain my conclusions in a bit, but first some background and context. My Setup For the sake of testing time, I’m using 8 different (but identical) servers as time clients and 5 different GPS-backed time sources, all local. Relevant bits of my network for time testing. NTP sources are in blue circles, the servers tested are the purple rectangle, and network switches are orange or yellow rectangles. Time Sources ntp1 : an older LeoNTP GPS-backed NTP server. In the garage, connected to its own outdoor GPS antenna. Only has a 10/100 Mbps Ethernet connection, but this hasn’t mattered in practice. ntp2 : identical hardware to ntp1 . Sitting on my desk and connected to a different Ethernet switch. Connected to a GPS antenna splitter and an outdoor antenna. My desktop. A 32-core AMD Threadripper 5975WX with a ConnectX-6 NIC (2x40 Gbps) for network traffic and an Intel E810-XXVDA4T (using 2x10 Gbps, one to each switch) with a GPS receiver and hardware timing support. Shares the antenna with ntp2 , ntp4 , and ntp5 . ntp4 : Where is NTP3 you ask? I ran out of antenna ports, and anyway the system that I dubbed ntp3 only supports PTP, not NTP. a Raspberry Pi CM5 with a Timebeat GPS module including PPS timing straight to the NIC. Connected via 1 GbE. ntp5 : a Raspberry Pi CM5 with a Waveshare GPS module with GPIO PPS but no working Ethernet PPS. Connected via 1 GbE. Test Devices Eight identical servers ( d1 through d8 ) running Ubuntu 24.04 with identical Chrony configs. The servers are HPE M510 blades with 16 Xeon-D cores in a pair of HPE EL4000 enclosures. Each enclosure is connected to both of the core switches, giving each of the 8 servers 2 dedicated 10 GbE links via a built-in Mellanox ConnectX-3 NIC. Chrony metrics are collected every 10 seconds and stored in Prometheus for analysis. A Siglent SDS1204X-E Oscilloscope connected to the PPS outputs from ntp2 , ntp4 , and my desktop. It can show relative differences in PPS times within about a nanosecond. The oscilloscope only has 200 MHz bandwidth but captures 1 billion samples per second, so I’d expect it to be able to show differences between PPS sources to somewhere between 1 and 5 nanoseconds. In any case, the observed differences are much larger than this, see below. Network The core of the network is a pair of Arista 7050QX-32S switches. These are 32-port 40 GbE switches with hardware support for PTP. They’re older, but very solid. Linux systems with multiple network connections (the 8 test servers and my desktop) are connected to each with a /30 link per interface and then run OSPF with ECMP to provide redundancy. Devices with a single network interface (Raspberry Pis and LeoNTP devices in this example) are connected to layer 2 switches which are then connected to the core switches via MLAG links. This means that there are multiple possible paths between any two devices through the network, as both ECMP and LAG use a mix of source and destination addresses to decide which link to use. So the path between d1 and ntp1 may be almost completely different from the path between d2 and ntp1 , even though d1 and d2 are sitting less than an inch from each other and share all of the same physical network links. Even more entertaining, the path back from ntp1 to d1 and d2 may or may not be the same as the forward path. This only matters when nanosecond-level timings are involved, as we’ll see in a bit. Sources of Error So — finally — I have multiple NTP servers, presumably synced to GPS satellites as accurately as possible, and multiple servers, all synced to the NTP servers over a relatively low-latency network. How accurately are my servers syncing to GPS time? And where is that going wrong? Chrony’s claims So, if you’re trying to see how accurate Chrony’s time syncing is, the easiest place to start is with Chrony’s own metrics. In this case, Chrony claims that it’s had a median offset of 25–110 ns over the past day: Chrony’s median offset over the past day. Now, this isn’t the best metric for a number of reasons, but it’s a start. It says that Chrony thinks that it’s synced to within 110 ns of something, but it doesn’t really tell us anything about what it’s synced to or how accurate it actually is. So, let’s dig in a bit deeper. GPS error and drift First, the GPS receivers in my NTP time servers aren’t perfectly accurate. Even top-tier GPS receivers will still have ~5 ns of timing noise, and lower-tier ones will be 20–60 ns (or possibly higher). Datasheet links: the ublox ZED-F9T in my desktop claims 5 ns of accuracy and 4 ns of jitter. The ublox NEO-M8T in NTP5 (not graphed here) claims 20-500ns of accuracy, depending on the antenna. And LeoNTP claims 30ns of RMS accuracy and 60ns of 99-th percentile accuracy. Fortunately, this is relatively easy to measure, at least when the devices are within a few feet of each other. You can connect an oscilloscope to their PPS outputs and directly view the differences between them. Here’s the result for ntp2 , ntp4 , and my desktop: Oscilloscope output. The Raspberry Pi/Timebeat Timecard Mini Essential is on top in yellow, then the LeoNTP in purple, and an Intel E810 on the bottom in blue. Animated; each update covers 1 second of real time. Notice that (a) they don’t all agree and (b) they move around relative to each other. In this sample, there’s about a 200ns difference between NTP4 (top, yellow) and my desktop (bottom, blue). Some of this is due to cable length differences (my antenna and PPS leads aren’t all identical-length, so there’s probably ~20ns in difference there alone), but that doesn’t explain all of it. Even ignoring the NTP4, there’s ~25ns in variance between ntp2 (purple, middle) and my desktop (blue, bottom). Notice that they move relative to each other over time in a bit of a pattern. In general, offsets can mostly be compensated for, either in Chrony or directly on the GPS device, but jitter is trickier. Depending on how you look at things, I’m seeing a minimum of 25 ns of error at this level, and potentially up to 200ns. When you give Chrony multiple time sources that are all equivalently good, then it’ll generally average its time across the whole set of sources. So adding one time source with 200 ns of offset to 2 other mostly-identical time sources should only add ~67 ns of error at most, and possibly no error at all, if Chrony decides that the 200 ns source is too far off to be used. Network error Chrony tries to compensate for network delays when it syncs to NTP sources over the network, but it has to make some assumptions that aren’t always true. It assumes that network delays are symmetrical (that is, if it takes 30 μS for network traffic to get from the client to the server and back, then it takes 15 μS each way). This isn’t generally true, but for a lot of networks it’s close enough. Apparently it’s not particularly true for my network. One of the things that I’m monitoring with Chrony and Prometheus is the current offset for each time source on each Chrony client. I have data for my 8 test servers ( d1 through d8 ) tracking the relative offsets for ntp1 and ntp2 . I was expecting to see that either ntp1 or ntp2 was consistently ahead of the other one, given cable lengths, network delays, antenna differences, and so forth. Instead, half of the servers see ntp1 as running faster, while half show ntp2 as running faster: The relative time offsets for ntp1 vs ntp2 across d1 through d8 . Each line is one of the d* servers. Note that half of the servers see ntp1 as being ahead of ntp2 and half see the opposite. Prometheus query for graph quantile_over_time( 0.5, chrony_sources_last_sample_offset_seconds{instance=~"${client}",source_address="10.1.0.238"}[1h] ) - on (instance) quantile_over_time( 0.5, chrony_sources_last_sample_offset_seconds{instance=~"${client}",source_address="10.1.0.239"}[1h] ) The servers can’t agree on whether ntp1 runs faster than ntp2 or not — 4 of the 8 see ntp1 as faster, while 4 see ntp2 as faster, with the servers in two bands around +100ns and -300ns. This has been consistent for weeks. To be clear, since half of the d* servers are in one enclosure and half are in another: the timing differences are basically random, and don’t follow which chassis they’re in or which network cables they use. Of the 4 physical servers in each enclosure, 2 think ntp1 is faster and 2 think ntp2 is faster, but which two aren’t even consistent between enclosures. Presumably this is caused by asymmetric traffic paths in my network. If you look back to the network diagram above, you’ll see that the test servers each have a link to each core switch, and that the L2 switches that the NTP servers use are each connected to both core switches. Any time you have redundant links like this, something has to decide which path any given packet is going to take over the network. In general network people really dislike just picking paths at random, largely because that’d mean that packets could arrive out of order, and a lot of TCP stacks hate out-of-order traffic. So, generally, traffic is assigned to a path using a hash of source and destination addresses. The exact implementation varies widely and is frequently configurable on higher-end devices. For L2 links most devices just hash the source and destination MACs, while for L3 links the hash usually includes the source and destination IPs and may include the TCP/UDP port numbers or other easy-to-locate data. Presumably one of the possible paths between servers and time sources on the network is faster than the others, and paths that hash onto the faster path consistently skew the results in one direction or the other. A less complex (and less redundant!) network would have less of this sort of error, but asymmetric round trip times show up everywhere in networking when you’re counting nanoseconds. At some level, this isn’t avoidable. So, on my network, this seems to cause a minimum of 200ns of potential error, as various paths take different amounts of time, and Chrony isn’t able to compensate automatically. Chrony has a per-source setting for adjusting latency asymmetry, so I could probably hand-adjust all 16 ( d* -> ntp* ) config lines to minimize the error if I really cared about ~200 ns of error, but it’s unlikely that it’d buy me much useful accuracy. Cross-server synchronization As an experiment, I told all 8 of my test servers to use each other as time sources. I added them using Chrony’s noselect flag, so they wouldn’t try to use each other as authoritative; they’d just monitor the relative offsets between servers and record them over time. I’m actually measuring time really aggressively between d* servers. I’m polling every 250 ms and averaging across 5 samples to try to minimize noise. Flags from chrony.conf : server xxx noselect xleave presend 9 minpoll -2 maxpoll -2 filter 5 extfield F323 Here’s the median Note that the median isn’t really the best way to look for offsets in general, but since Chrony maintains its own view of time and slowly adjusts it time relative to its sources, so a few wildly inaccurate responses won’t really change Chrony’s time much, if at all. offset between servers, in nanoseconds, over 4 hours: d1 d2 d3 d4 d5 d6 d7 d8 d1 83 -70 -138 -18 -161 -207 -132 d2 -145 -29 -75 -4 -138 -158 -29 d3 51 -23 -33 28 -31 -65 -74 d4 74 106 -42 106 -23 27 91 d5 5 -40 -66 -89 -49 -48 0 d6 153 173 0 62 63 -28 47 d7 190 173 36 -32 43.0 19 58 d8 131 -6 58 -64 5 -52 -47 Prometheus query for chart quantile_over_time( 0.5, chrony_sources_last_sample_offset_seconds{instance=~"d[1-8].*",source_address=~"10.0.0.10[0-9]"}[5m] ) Plus some work in Grafana to turn this into a table. Notice that they’re all within 207ns of each other, but the timings aren’t particularly consistent. For instance, looking at the timings between d2 and d3 show that they’re 29 ns apart when you query one direction and 23 ns apart when you query in the other direction, but they’re both off in the same direction. If network error wasn’t a factor, then I’d expect to see one number be positive and the other be negative; that’s not always the case here. In general, this aligns nicely with the 200-300ns of error seen in the previous section, but it shows that there’s a serious limit to how accurately Chrony can measure nanoseconds on this hardware. Observed offsets across all sources Earlier, I discussed the difference between ntp1 and ntp2 , and how each server had a different view of the difference between them. On average, ntp1 seems to run 50–150 ns ahead of ntp2 . Remember that my big goal here is less accurate time and more consistent time. This 50–150 ns of inconsistency isn’t a big deal, but when I started adding additional time sources, I discovered that some of them were even further away from ntp1 and ntp2 , and I wanted to minimize the total time spread. I’d really like it if adding additional NTP sources to the mix didn’t make things even less consistent. There are a lot of things to like about the LeoNTP time servers, but configurability isn’t one of them. There’s no way that I can see to add an offset between GPS time and the NTP time that they export. On the other hand, the 3 Chrony-based time servers (my desktop, ntp4 , and ntp5 ) can be adjusted to control the offset between GPS time and NTP time. And, in fact, you can’t really run with a 0-second offset because GPS time is based on TAI and NTP time is usually UTC, Strictly speaking, it’s usually a weird mutant that mostly ignores leap seconds, look up “leap smear” for the ugly mess. and the two are currently 37 seconds apart. Leap seconds make life hard. Originally, I discovered that time from my desktop was around 1 μS when compared with ntp1 and ntp2 by d* servers, and time from the Raspberry Pi-based ntp4 was almost 38 μS off! To mitigate this, I graphed the average offset between each time source across all 8 servers and then adjusted offsets on my desktop and ntp4 to be as close as possible to the median of ntp1 and ntp2 . To do this, I changed my desktop’s TAI offset from -37 to -36.999999160 and the offset of ntp4 to -37.000033910 . Now, all 4 sources are basically in unison: Observed offsets over the past day. Prometheus query for graph avg by (source_address) ( quantile_over_time( 0.5, chrony_sources_last_sample_offset_seconds{ instance=~"d[1-8].*", source_address=~"10[.].*", source_address=~"${ntpsource}" }[1h] ) ) Why were times so far off? For my desktop, it’s probably a mix of multipath weirdness and delay in the network stack. 840 ns isn’t a huge amount of time, although it’s bigger than what I’ve seen elsewhere. I’m less sure what’s going on with ntp4 . It was originally seeing over 50 μS of error, but reducing the Ethernet coalescing limits on eth0 ethtool -C eth0 tx-usecs 0 rx-usecs 0 helped quite a bit. I’m going to have to keep poking at this for a while. Observed NTP jitter across all sources I can compare the jitter of 4 of my GPS time sources across all 8 d* servers. To calculate jitter in this case, I’m looking at the difference between the 1st and 99th percentile of each source’s offset from Chrony’s best estimate of the current time. I’m calculating the percentiles over 15 minute windows, subtracting the 1st percentile from the 99th, and then averaging those results across all 8 servers. It’s not the best way to do this statistically, but there’s a limit to what you can do with Prometheus easily. Graph of jitter by source across all 8 d* servers. Prometheus query for graph ( avg by (source_address) ( quantile_over_time( 0.99, chrony_sources_last_sample_offset_seconds{ instance=~"d[1-8].*", source_address=~"10[.].*", source_address=~"${ntpsource}" }[15m] ) ) ) - ( avg by (source_address) ( quantile_over_time( 0.01, chrony_sources_last_sample_offset_seconds{ instance=~"d[1-8].*", source_address=~"10[.].*" }[15m] ) ) ) Over the past hour, that works out to: Time Source Jitter desktop 1.01 μS ntp1 1.28 μS ntp2 1.40 μS ntp4 2.02 μS So, my desktop (with a fast NIC and a very good GNSS module) has the least jitter. The two LeoNTP boxes are next, with a bit more, and the Raspberry Pi has 2x the jitter of my desktop. Since Chrony averages out offsets across sources and over time, jitter isn’t necessarily a big deal as long as it’s under control. Which brings up ntp5 , which I’d excluded from the previous graph. Here’s why: Graph of jitter by source across all 8 d* servers including ntp5 , which has accuracy issues every 2 hours. I still haven’t figured out why this loses accuracy every 2 hours, but there are other weird things about ntp5 , so I’m not all that worried about it overall. Things that hurt syncing Along the way, I’ve found a bunch of things that hurt time syncing. A short list: Network cards without hardware timestamps. Realtek, for instance. Tunnels. I had 3 servers who were sending traffic to the network with ntp1 and ntp2 over VxLAN originally, and their time accuracy was terrible. I suspect that the NICs’ hardware timestamp wasn’t propagated correctly through the tunnel decapsulation. Plus, it made network times even less symmetrical. and over VxLAN originally, and their time accuracy was terrible. I suspect that the NICs’ hardware timestamp wasn’t propagated correctly through the tunnel decapsulation. Plus, it made network times even less symmetrical. NIC packet coelescing. On Raspberry Pi CM5s especially, I had to disable NIC coelescing via ethtool -c or I had terrible accuracy. or I had terrible accuracy. Software in general. I get the best results on NTP servers where the GPS’s PPS signal goes directly into the NIC’s hardware, completely bypassing as much software as possible. Running ptp4l and Chrony on the same ConnectX-4 NIC, or potentially ConnectX-3 or -5 NICs. Intel seems perfectly happy under the same situation. Summary So, in all, I’m seeing time syncing somewhere in the 200–500 ns range across my network. The GPS time sources themselves are sometimes as far as 150 ns apart, even after compensating for systemic differences, and the network itself adds another 200–300 ns of noise. In an ideal world, it’d be cool to see ~10 ns accuracy, but it’s not really possible at any level with this hardware. My time sources aren’t that good, my network adds more systemic error than that, and when I try to measure the difference between test servers I see a couple hundred nanoseconds of noise. So 10 ns isn’t going to happen. On the other hand, though, I’m almost certainly accurate to within 1 μS across the set of 8 test servers most of the time, and I’m absolutely more accurate than my original goal of 10 μS.

The Limits of NTP Accuracy on Linux

Share this article

Related Articles