Speaking from personal experience, using a Mac as a server or server-like contraption is quite an interesting proposition, as despite its Unix roots, the operating system isn't exactly designed for unattended, 24/7 usage and is difficult to set up and use as such — fighting words, but I stand by them. While most every user will reboot their Mac at least once in the space of a few weeks, if you happen to leave one running for precisely 49 days, 17 hours, 2 minutes, and 47 seconds, many parts will suddenly stop working as its TCP/IP networking stack dies.
Those are the findings of the folks at Photon, who did some serious sleuthing after encountering a mysterious issue in a fleet of Macs they use to monitor iMessage services. The problem revealed itself when some machines just up and stopped responding to network connections out of the blue, even though they answered ping requests with an "all good here, boss!"
Said machines kept their existing network connections going, making the situation even harder to diagnose, as the failure was unexplainable and otherwise invisible. Not left with much of an option, Photon's boffins had to reboot the machines to clear the issue, something any systems administrator hates as a "solution" to a mystery issue. After all, if it happened once, it'll happen again, and assuredly at the worst possible time.
Article continues below
After the team spotted another set of machines that was reaching the 49.7-day uptime, they set up some scripts to test their theory. Alas, they found that when the fateful moment arrived, the Mac they had continuously creating new connections just stopped doing so without so much as an error.
The team then turned its attention to the root cause, as it was clearly related to a networking-related timer. They found the culprit to be the "tcp_now" internal counter, a figure that was "destined to overflow." The job tcp_now does is to keep track of the current time since boot as far as the TCP stack is concerned, down to the millisecond. tcp_now is represented as a 32-bit unsigned integer, and those have a maximum value of 4,294,967,295 (2^32 - 1) before they wrap around to zero. Since it tracks milliseconds, tcp_now's maximum is 4,294,967 seconds, or 49.7 days.
As defined by standards, operating systems collect and remove closed TCP connections after a short while; 30 seconds in the case of macOS. The result of attempting to clean up these inactive connections when tcp_now is close to or at its limit (and gets stuck there thanks to a bug in Apple's XNU kernel) is that any connection's expiration status is calculated against that frozen number, resulting in a value that always overflows a 32-bit unsigned integer. When the periodic check comes to see whether a closed connection is meant to be deleted, the result is always "no," because the comparison math doesn't work.
The TCP stack then fills up with errantly held ephemeral ports and effectively grinds to a halt when no more are available. How quickly that happens depends on the amount of network activity, but in any server or professional environment that's bound to be a rapid event. This class of problems is hardly known, integer overflows have been the cause of Windows 98's famous 49.7-day crash and the upcoming Year 2038 problem.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter Get Tom's Hardware's best news and in-depth reviews, straight to your inbox. Contact me with news and offers from other Future brands Receive email from us on behalf of our trusted partners or sponsors
According to Photon, the current mitigation is a reboot, although the team says it's working on an alternative solution. They also found this issue to be the source of some bugs discussed online in the Apple Community forums, too. The long-existing RFC 7323 specifies what should happen to the timestamp clock (tcp_now) when it reaches its limit, but Apple's kernel performs an incorrect implementation. It's safe to say this issue will likely be fixed quickly—and hopefully before 49.7 days after the report.
... continue reading