Unicode is good. If you’re designing a data structure or protocol that has text fields, they should contain Unicode characters encoded in UTF-8. There’s another question, though: “Which Unicode characters?” The answer is “Not all of them, please exclude some.”
This issue keeps coming up, so Paul Hoffman and I put together an individual-submission draft to the IETF and now (where by “now” I mean “two years later”) it’s been published as RFC 9839. It explains which characters are bad, and why, then offers three plausible less-bad subsets that you might want to use. Herewith a bit of background, but…
Please · If you’re actually working on something new that will have text fields, please read the RFC. It’s only ten pages long, and that’s with all the IETF boilerplate. It’s written specifically for software and networking people.
The smoking gun · The badness that 9839 focuses on is “problematic characters”, so let’s start with a painful example of what that means. Suppose you’re designing a protocol that uses JSON and one of your constructs has a username field. Suppose you get this message (I omit all the non- username fields). It’s a perfectly legal JSON text:
{ "username" : "\u0000\u0089\uDEAD\uD9BF\uDFFF" }
Unpacking all the JSON escaping gibberish reveals that the value of the username field contains four numeric “code points” identifying Unicode characters:
The first code point is zero, in Unicode jargon U+0000 . In human-readable text it has no meaning, but it will interfere with the operation of certain programming languages. Next is Unicode U+0089 , official name “CHARACTER TABULATION WITH JUSTIFICATION”. It’s what Unicode calls a C1 control code, inherited from ISO/IEC 6429:1992, adopted from ECMA 48 (1991), which calls it “HTJ” and says: HTJ causes the contents of the active field (the field in the presentation component that contains the active presentation position) to be shifted forward so that it ends at the character position preceding the following character tabulation stop. The active presentation position is moved to that following character tabulation stop. The character positions which precede the beginning of the shifted string are put into the erased state. Good luck with that. The third code point, U+DEAD , in Unicode lingo, is an “unpaired surrogate”. To understand, you’d have to learn how Unicode’s much-detested UTF-16 encoding works. I recommend not bothering. All you need to know is that surrogates are only meaningful when they come in pairs in UTF-16 encoded text. There is effectively no such text on the wire and thus no excuse for tolerating surrogates in your data. In fact, the UTF-8 specification says that you mustn’t use UTF-8 to encode surrogates. But the real problem is that different libraries in different programming languages don’t always do the same things when they encounter this sort of fœtid interloper. Finally, \uD9BF\uDFFF is JSON for the code point U+7FFFF . Unicode has a category called “noncharacter”, containing a few dozen code points that, for a variety of reasons, some good, don’t represent anything and must not be interchanged on the wire. U+7FFFF is one of those.
The four code points in the example are all clearly problematic. The just-arrived RFC 9839 formalizes the notion of “problematic” and offers easy-to-cite language saying which of these problematic types you want to exclude from your text fields. Which, if you’re going to use JSON, you should probably do.
Don’t blame Doug · Doug Crockford I mean, the inventor of JSON. If he (or I or really anyone careful) were inventing JSON now that Unicode is mature, he’d have been fussier about its character repertoire. Having said that, we’re stuck with JSON-as-it-is forever, so we need a good way to say which of the problematic characters we’re going to exclude even if JSON allows them.
PRECISion · You may find yourself wondering why the IETF waited until 2025 to provide help with Bad Unicode. It didn’t; here’s RFC 8264: PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols ; the first PRECIS predecessor was published in 2002. 8264 is 43 pages long, containing a very thorough discussion of many more potential Bad Unicode issues than 9839 does.
Like 9839, PRECIS specifies subsets of the Unicode character repertoire and goes further, providing a mechanism for defining more.
Having said that, PRECIS doesn’t seem to be very widely used by people who are defining new data structures and protocols. My personal opinion is that there are two problems which make it hard to adopt. First, it’s large and complex, with many moving parts, and requires careful study to understand. Developers are (for good reason) lazy.
Second, using PRECIS ties you to a specific version of Unicode. In particular, it forbids the use of the (nearly a million) unassigned code points. Since each release of Unicode includes new code point assignments, that means that a sender and receiver need to agree on exactly which version of Unicode they’re both going to use if they want reliably interoperable behavior. This makes life difficult for anyone writing a general-purpose code designed to be used in lots of different applications.
I personally think that the only version of Unicode anybody wants to use is “as recent as possible”, so they can be confident of having all the latest emojis.
Anyhow, 9839 is simpler and dumber than PRECIS. But I think some people will find it useful and now the IETF agrees.
Source code · I’ve written a little Go-language library to validate incoming text fields against each of the three subsets that 9839 specifies, here. I don’t claim it’s optimal, but it is well-tested.
It doesn’t have a version number or release just yet, I’ll wait till a few folk have had a chance to spot any dumb mistakes I probably made.
Details · Here’s a compact summary of the world of problematic Unicode code points and data formats and standards.
Problematic classes excluded? Surrogates Legacy controls Noncharacters CBOR yes no no I-JSON yes no yes JSON no no no Protobufs no no no TOML yes no no XML yes partial [1] partial [2] YAML yes mostly [3] partial [2] RFC 9839 Subsets Scalars yes no no XML yes partial partial Assignables yes yes yes
Notes:
[1] XML allows C1 controls.
[2] XML and YAML don’t exclude the noncharacters outside the Basic Multilingual Pane.
[3] YAML excludes all the legacy controls except for the mostly-harmless U+0085 , another version of
used in IBM mainframe documents.
Thanks! · 9839 is not a solo production. It received an extraordinary amount of discussion and improvement from a lot of smart and well-informed people and the published version, 15 draft revisions later, is immensely better than my initial draft. My sincere thanks go to my co-editor Paul Hoffman and to all those mentioned in the RFC’s “Acknowledgements” section.
On individual submissions · 9839 is the second “individual submission” RFC I’ve pushed through the IETF (the other is RFC 7725, which registers the HTTP 451 status code). While it’s nice to decide something is worth standardizing and eventually have that happen, it’s really a lot of work. Some of that work is annoying.
I’ve been involved in other efforts as Working-Group member, WG chair, and WG specification editor, and I can report authoritatively that creating an RFC the traditional way, through a Working Group, is easier and better.
I feel discomfort advising others not to follow in my footsteps, but in this case I think it’s the right advice.