Corrected UTF-8 (2022)

UTF-8 is decent and all but it contains some design errors, partly because its original designers just messed up, and partly because of ISO and Unicode Consortium internal politics. We’re probably going to be using it forever so it would be good to correct these design errors before they get any more entrenched than they already have.

Corrected UTF-8 is almost the same as UTF-8. We make only three changes: overlength encodings become impossible instead of just forbidden; the C1 controls and the Unicode surrogate characters are not encoded; and the artifical upper limit on the code space is removed.

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.

Eliminating overlength encodings

The possibility of overlength encodings is the design error in UTF-8 that’s just a plain old mistake. As originally specified, the codepoint U+002F (SOLIDUS, / ) could be encoded as the one-byte sequence 2F , or the two-byte sequence C0 AF , or the three-byte sequence E0 80 AF , etc. This led to security holes and so the specification was revised to say that a UTF-8 encoder must produce the shortest possible sequence that can represent a codepoint, and a decoder must reject any byte sequence that’s longer than it needs to be.

Corrected UTF-8 instead adds offsets to the codepoints encoded by all sequences of at least two bytes, so that every possible sequence is the unique encoding of a single codepoint. For example, a two-byte sequence, 110xxxxx 10yyyyyy, encodes the codepoint 0000 0xxx xxyy yyyy plus 160; therefore, C0 AF becomes the unique encoding of U+00CF (LATIN CAPITAL LETTER I WITH DIAERESIS, Ï ).

Not encoding C1 controls or surrogates

The C1 control character range (U+0080 through U+009F) is included in Unicode primarily for backward compatibility with ISO/IEC 2022, an older character encoding standard in which the byte ranges 00 through 1F and 7F through 9F are reserved for control characters.

It is never appropriate to use the C1 controls in interchangeable text, as they are very likely to be misinterpreted according to one of the DOS code pages that defined bytes 80 through 9F as graphic characters. Corrected UTF-8 skips over them entirely; this is why the offset for two-byte sequences is 160 rather than 128. (I would like to discard almost all of the C0 controls as well—preserving only U+0000 and U+000A—but that would break ASCII compatibility, which is a step too far.) If there is a need to represent U+0080 through U+009F, perhaps for round-tripping historical documents, they can be mapped to some convenient private-use codepoints.

Similarly, the only reason the surrogate space (U+D800 through U+DFFF) exists is to support UTF-16. These codepoints will never appear in well-formed Unicode text, and the current generation of the UTF spec actually forbids the three-byte sequences ED A0 80 through ED BF BF to be emitted or accepted at all, rather like the overlength sequences. In Corrected UTF-8, we skip this range just like we do for the C1 controls. (This unfortunately does mean that the three-byte sequences are split into two ranges with two different offsets.) Again, programs that need to represent actual surrogates (perhaps for the same reasons that motivated the creation of WTF-8) can map them into private-use space.

... continue reading