Or, why does sort(1) order differently on macOS and Linux? Zhiming Wang 2020-06-03 Today I noticed something interesting while working with a sorted list of package names: sort(1) orders them differently on macOS and Linux (Ubuntu 20.04). A very simple example, with locale set explicitly: (macOS) $ LC_ALL=en_US.UTF-8 sort <<<$'python-dev python3-dev' python-dev python3-dev (Linux) $ LC_ALL=en_US.UTF-8 sort <<<$'python-dev python3-dev' python3-dev python-dev What the hell? Same locale, different order (or technically, collation). This is not even a difference between GNU and BSD userland; coreutils sort on macOS produces the same output as /usr/bin/sort . (Of course, when LC_ALL=C is used, the results are the same, matching the macOS result above, since “ - ” as 0x2D on the ASCII table comes before “ 3 ” as 0x33 .) Therefore, the locale itself becomes the prime suspect. macOS LC_COLLATE for any locale on macOS is very easy to find: just look under /usr/share/locale/ . Somewhat surprisingly, /usr/share/locale/en_US.UTF-8/LC_COLLATE is a symlink to ../la_LN.US-ASCII/LC_COLLATE . The US-ASCII part is a giveaway for lack of sophistication, while the unfamiliar language code la and unfamiliar country code LN gave me pause. Turns out la is code for Latin and LN isn’t really code for anything (I guess they invented it for the Latin script influence sphere)? In fact, if we look a little bit closer, most locales’ LC_COLLATE are symlinked to la_LN dot something (mostly dot US-ASCII ), which isn’t very remarkable once we realize it stands for Latin: realpath in the following command is part of GNU coreutils. In fact I’ll be liberally using coreutils commands in this article. You can brew install coreutils (make sure you read the caveats). $ realpath /usr/share/locale/*/LC_COLLATE | sort | uniq -c | sort -nr 122 /usr/share/locale/la_LN.US-ASCII/LC_COLLATE 21 /usr/share/locale/la_LN.ISO8859-1/LC_COLLATE 20 /usr/share/locale/la_LN.ISO8859-15/LC_COLLATE 5 /usr/share/locale/la_LN.ISO8859-2/LC_COLLATE 3 /usr/share/locale/de_DE.ISO8859-15/LC_COLLATE 3 /usr/share/locale/de_DE.ISO8859-1/LC_COLLATE 2 /usr/share/locale/is_IS.ISO8859-1/LC_COLLATE 2 /usr/share/locale/cs_CZ.ISO8859-2/LC_COLLATE 1 /usr/share/locale/uk_UA.KOI8-U/LC_COLLATE 1 /usr/share/locale/uk_UA.ISO8859-5/LC_COLLATE 1 /usr/share/locale/sv_SE.ISO8859-15/LC_COLLATE 1 /usr/share/locale/sv_SE.ISO8859-1/LC_COLLATE 1 /usr/share/locale/sr_YU.ISO8859-5/LC_COLLATE 1 /usr/share/locale/sl_SI.ISO8859-2/LC_COLLATE 1 /usr/share/locale/ru_RU.KOI8-R/LC_COLLATE 1 /usr/share/locale/ru_RU.ISO8859-5/LC_COLLATE 1 /usr/share/locale/ru_RU.CP866/LC_COLLATE 1 /usr/share/locale/ru_RU.CP1251/LC_COLLATE 1 /usr/share/locale/pl_PL.ISO8859-2/LC_COLLATE 1 /usr/share/locale/lt_LT.ISO8859-4/LC_COLLATE 1 /usr/share/locale/lt_LT.ISO8859-13/LC_COLLATE 1 /usr/share/locale/la_LN.ISO8859-4/LC_COLLATE 1 /usr/share/locale/kk_KZ.PT154/LC_COLLATE 1 /usr/share/locale/is_IS.ISO8859-15/LC_COLLATE 1 /usr/share/locale/hy_AM.ARMSCII-8/LC_COLLATE 1 /usr/share/locale/hi_IN.ISCII-DEV/LC_COLLATE 1 /usr/share/locale/et_EE.ISO8859-15/LC_COLLATE 1 /usr/share/locale/es_ES.ISO8859-15/LC_COLLATE 1 /usr/share/locale/es_ES.ISO8859-1/LC_COLLATE 1 /usr/share/locale/el_GR.ISO8859-7/LC_COLLATE 1 /usr/share/locale/de_DE-A.ISO8859-1/LC_COLLATE 1 /usr/share/locale/ca_ES.ISO8859-15/LC_COLLATE 1 /usr/share/locale/ca_ES.ISO8859-1/LC_COLLATE 1 /usr/share/locale/bg_BG.CP1251/LC_COLLATE 1 /usr/share/locale/be_BY.ISO8859-5/LC_COLLATE 1 /usr/share/locale/be_BY.CP1251/LC_COLLATE 1 /usr/share/locale/be_BY.CP1131/LC_COLLATE Oddly enough though (until we realize it’s just lack of sophistication), many of the outliers are in fact Latin script-based languages, while markedly non-Latin ones are lumped together under the Latin arm: $ realpath /usr/share/locale/{zh_CN,ja_JP,ko_KR}.UTF-8/LC_COLLATE /usr/share/locale/la_LN.US-ASCII/LC_COLLATE /usr/share/locale/la_LN.US-ASCII/LC_COLLATE /usr/share/locale/la_LN.US-ASCII/LC_COLLATE Of course, these locale files are compiled binaries, so it’s hard to gleen the collation rules from them (with my untrained eyes). We still need to find the source code. Looking for OS X / macOS source code is always kind of a pain. Fortunately, searching for la_LN.US-ASCII site:opensource.apple.com led me to the adv_cmds package, or more precisely, an old version of it. This package contains source code for locale-related commands (among other things) colldef , locale , localedef , and mklocale , and until v118 (from Mac OS X 10.5 era) it contained a usr-share-locale.tproj directory with locale definitions in source form. You can download a tarball from here. They sure don’t make it easy to find the link. The collation definitions are in usr-share-locale.tproj/colldef , and looking at the list usr-share-locale.tproj/colldef/*.src we immediately notice the overlap with the resolved list above. In fact, it’s a perfect match save for de_DE-A.ISO8859-1 in the list above which wasn’t present in the OS X 10.5 era source package. And here’s the entirety of the la_LN.US-ASCII ruleset (link): # ASCII # # $FreeBSD: src/share/colldef/la_LN.US-ASCII.src,v 1.2 1999/08/28 00:59:47 peter Exp $ # order \ \x00;...;\xff I’m no expert on locale definitions (in fact this doesn’t seem to follow the standard, and looks more like colldef -specific langauge – see man 1 colldef ), but the meaning is crystal clear: just compare the byte values one by one, semantics be damned. Same as the POSIX locale (aka C locale). That explains why LC_COLLATE=en_US.UTF-8 sorts the same as LC_COLLATE=C . Also, the README (link) for context: $FreeBSD: src/share/colldef/README,v 1.2 2002/04/08 09:28:22 ache Exp $ WARNING: For the compatibility sake try to keep collating table backward compatible with ASCII, i.e. add other symbols to the existent ASCII order. The content and timestamps place these source files perfectly in the FreeBSD 5.0.0 tree. It just so happens to be known that OS X’s BSD layer was synchronized with FreeBSD 5 back in 10.3 Panther, so the story as told by the source files checks out. However, do recall usr-share-locale.tproj has been long gone from the adv_cmds package. Have the rules changed? One simple test: $ colldef -o /dev/stdout usr-share-locale.tproj/colldef/la_LN.US-ASCII.src | sha256sum 9ec9b40c837860a43eb3435d7a9cc8235e66a1a72463d11e7f750500cabb5b78 - $ sha256sum