Tech News
← Back to articles

Dark Corners of Unicode (2015)

read original related products more articles

I’m assuming, if you are on the Internet and reading kind of a nerdy blog, that you know what Unicode is. At the very least, you have a very general understanding of it — maybe “it’s what gives us emoji”.

That’s about as far as most people’s understanding extends, in my experience, even among programmers. And that’s a tragedy, because Unicode has a lot of… ah, depth to it. Not to say that Unicode is a terrible disaster — more that human language is a terrible disaster, and anything with the lofty goals of representing all of it is going to have some wrinkles.

So here is a collection of curiosities I’ve encountered in dealing with Unicode that you generally only find out about through experience. Enjoy.

Also, I strongly recommend you install the Symbola font, which contains basic glyphs for a vast number of characters. They may not be pretty, but they’re better than seeing the infamous Unicode lego.

There are already plenty of introductions to Unicode floating around (wikipedia, nedbat, joel), and this is not going to be one. But here’s a quick refresher.

Unicode is a big table that assigns numbers (codepoints) to a wide variety of characters you might want to use to write text. We often say “Unicode” when we mean “not ASCII”, but that’s silly since of course all of ASCII is also included in Unicode.

UTF-8 is an encoding, a way of turning a sequence of codepoints into bytes. All Unicode codepoints can be encoded in UTF-8. ASCII is also an encoding, but only supports 128 characters, mostly English letters and punctuation.

A character is a fairly fuzzy concept. Letters and numbers and punctuation are characters. But so are Braille and frogs and halves of flags. Basically a thing in the Unicode table somewhere.

A glyph is a visual representation of some symbol, provided by a font. It might represent a single character, or it might represent several. Or both!

Unicode is divided into seventeen planes, numbered zero through sixteen. Plane 0 is also called the Basic Multilingual Plane, or just BMP, so called because it contains the alphabets of most modern languages. The other planes are much less common and are sometimes informally referred to as the astral planes.

... continue reading