Confusables.txt and NFKC disagree on 31 characters

and why your homoglyph detection needs to know about both

If you’ve ever built a login system, you’ve probably dealt with homoglyph attacks: someone registers аdmin with a Cyrillic “а” (U+0430) instead of Latin “a” (U+0061). The characters are visually identical, and if your system accepts Unicode identifiers, you have an impersonation vector.

The Unicode Consortium maintains an official defence against this: confusables.txt, part of Unicode Technical Standard #39 (Security Mechanisms). It’s a flat file mapping ~6,565 characters to their visual equivalents. Cyrillic а → a, Greek ο → o, Cherokee Ꭺ → A, and thousands more.

It’s worth noting that confusables.txt is designed for detection, not normalization. TR39 itself says skeleton mappings are “not suitable for display to users” and “should definitely not be used as a normalization of identifiers.” The correct use is to check whether a submitted identifier contains characters that visually mimic Latin letters, and if so, reject it — not to silently remap those characters and let it through.

Here’s the wrinkle. If your application also runs NFKC normalization (which it should — ENS, GitHub, and Unicode IDNA all require it), then 31 entries in confusables.txt map the same character to a different target than NFKC. If you’re building a confusable map for use after NFKC normalization, those entries are unreachable. NFKC has already transformed the character before your confusable check sees it.

What NFKC normalization does

NFKC (Normalization Form Compatibility Composition) is Unicode’s way of collapsing “compatibility variants” to their canonical form. Fullwidth letters → ASCII, superscripts → normal digits, ligatures → component letters, mathematical styled characters → plain characters:

Ｈｅｌｌｏ → Hello (fullwidth → ASCII) ﬁnance → finance (ﬁ ligature → fi) 𝐇ello → Hello (mathematical bold → plain)

This is the right first step for slug/handle validation. You want Ｈｅｌｌｏ to become hello , not to be rejected as containing confusable characters. NFKC handles hundreds of these compatibility forms automatically.

NFKC and confusables serve different purposes. NFKC is for normalization: producing a canonical form for storage and comparison. Confusables detection is for security: flagging characters that could fool a human reader. They answer different questions about the same input, and in a well-designed system they’re applied separately rather than chained together to produce a single output.

... continue reading