Unicode is good. If you’re designing a data structure or protocol that has text fields, they should contain Unicode characters encoded in UTF-8. There’s another question, though: “Which Unicode characters?” The answer is “Not all of them, please exclude some.”
This issue keeps coming up, so Paul Hoffman and I put together an individual-submission draft to the IETF and now (where by “now” I mean “two years later”) it’s been published as RFC 9839. It explains which characters are bad, and why, then offers three plausible less-bad subsets that you might want to use. Herewith a bit of background, but…
Please · If you’re actually working on something new that will have text fields, please read the RFC. It’s only ten pages long, and that’s with all the IETF boilerplate. It’s written specifically for software and networking people.
The smoking gun · The badness that 9839 focuses on is “problematic characters”, so let’s start with a painful example of what that means. Suppose you’re designing a protocol that uses JSON and one of your constructs has a username field. Suppose you get this message (I omit all the non- username fields). It’s a perfectly legal JSON text:
{ "username" : "\u0000\u0089\uDEAD\uD9BF\uDFFF" }
Unpacking all the JSON escaping gibberish reveals that the value of the username field contains four numeric “code points” identifying Unicode characters:
The first code point is zero, in Unicode jargon U+0000 . In human-readable text it has no meaning, but it will interfere with the operation of certain programming languages. Next is Unicode U+0089 , official name “CHARACTER TABULATION WITH JUSTIFICATION”. It’s what Unicode calls a C1 control code, inherited from ISO/IEC 6429:1992, adopted from ECMA 48 (1991), which calls it “HTJ” and says: HTJ causes the contents of the active field (the field in the presentation component that contains the active presentation position) to be shifted forward so that it ends at the character position preceding the following character tabulation stop. The active presentation position is moved to that following character tabulation stop. The character positions which precede the beginning of the shifted string are put into the erased state. Good luck with that. The third code point, U+DEAD , in Unicode lingo, is an “unpaired surrogate”. To understand, you’d have to learn how Unicode’s much-detested UTF-16 encoding works. I recommend not bothering. All you need to know is that surrogates are only meaningful when they come in pairs in UTF-16 encoded text. There is effectively no such text on the wire and thus no excuse for tolerating surrogates in your data. In fact, the UTF-8 specification says that you mustn’t use UTF-8 to encode surrogates. But the real problem is that different libraries in different programming languages don’t always do the same things when they encounter this sort of fœtid interloper. Finally, \uD9BF\uDFFF is JSON for the code point U+7FFFF . Unicode has a category called “noncharacter”, containing a few dozen code points that, for a variety of reasons, some good, don’t represent anything and must not be interchanged on the wire. U+7FFFF is one of those.
The four code points in the example are all clearly problematic. The just-arrived RFC 9839 formalizes the notion of “problematic” and offers easy-to-cite language saying which of these problematic types you want to exclude from your text fields. Which, if you’re going to use JSON, you should probably do.
Don’t blame Doug · Doug Crockford I mean, the inventor of JSON. If he (or I or really anyone careful) were inventing JSON now that Unicode is mature, he’d have been fussier about its character repertoire. Having said that, we’re stuck with JSON-as-it-is forever, so we need a good way to say which of the problematic characters we’re going to exclude even if JSON allows them.
PRECISion · You may find yourself wondering why the IETF waited until 2025 to provide help with Bad Unicode. It didn’t; here’s RFC 8264: PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols ; the first PRECIS predecessor was published in 2002. 8264 is 43 pages long, containing a very thorough discussion of many more potential Bad Unicode issues than 9839 does.
... continue reading