Introduction to the experience of rendering Arabic typography&its technical debt

To understand why every machine since Gutenberg has wrestled this script and mostly lost, you need one structural fact: Arabic is cursive always. There is no print-versus-handwriting distinction, no block letters. The letters connect in stone inscriptions, in manuscripts, in metal, on screens. Each letter therefore changes shape depending on its neighbours (an isolated form, an initial, a medial, a final), and six letters refuse to connect forward at all, which breaks words into joined clusters and gives the script its rhythm. The shapes are not costumes over some underlying "real" letter. The positional variation is the letter.

And the alphabet is bigger than Arabic the language. Persian extends it with four letters Arabic does not have (پ pe, چ che, ژ zhe, گ gaf) and uses two of the existing letters in subtly different forms (ی for the final yāʾ, ک for kaf). Urdu adds an aspirated do-chashmī he (ھ), a retroflex set (ٹ ڈ ڑ), and a hanging ye barree (ے), and writes most of its everyday text in Nastaʿlīq, which a Naskh-shaped font will produce as a phonetically correct but visually unrecognisable approximation. Sindhi has more again. Pashto, Kurdish, Uyghur, Kashmiri, and Punjabi each take the alphabet, add what their phonology requires, and ship. Any font that calls itself "Arabic" without consulting the Persian and Urdu communities will produce, for hundreds of millions of readers in Iran and South Asia, text that is technically rendered but functionally wrong: the kaf has the wrong terminal, the heh fuses where it shouldn't, the digits are from the wrong belt. The Noto Sans Arabic family ships separate sub-fonts to cover these (NotoNaskhArabic, NotoNastaliqUrdu, NotoSansArabicUI), and OS font fallback chains usually get it right. Usually.

stored codepoint isolated initial medial final U+0639 ʿAYN ع عـ ـعـ ـع U+0647 HEH ه هـ ـهـ ـه One codepoint, four shapes, chosen at render time by the shaping engine. The medial heh and the isolated heh are, to an untrained eye, different letters; I have watched students of Arabic meet the medial heh in week three and file a complaint with management. A Latin font that ships 26 lowercase shapes needs no opinion about any of this. An Arabic font is wrong unless it has opinions about all of it.

The arrangement we eventually settled on, after decades of wrong answers, is this: the encoding stores the abstract letter, and the font supplies the shapes. Unicode gives you one codepoint for ʿayn; the font carries the four positional glyphs; a shaping engine applies the OpenType features ( isol , init , medi , fina , plus rlig for the ligatures the script requires, plus mark and mkmk for stacking the vowel signs) at render time. An Arabic font is a small program. The text you store is its input, not its output. The word is performed fresh every time you look at it, like music from a score.

The cleanest way to feel this is to assemble a word one letter at a time and watch every prior letter renegotiate its shape as the next one arrives:

click letters to add them to the word, in the order they appear in the codepoint stream: STORED CODEPOINTS — RENDERED BY THE SHAPING ENGINE — ⌫ back clear Try م , then ح , then م , then د : build the name Muḥammad. The first م drops into its initial form the moment you add the ح , the ح goes medial when the next م arrives, and so on through to the د , one of the six non-joining letters, which interrupts the flow and forces what would have been the third م into a final form. Four codepoints in storage, one continuous stroke on screen. None of it happens without a shaping engine; a PDF generator that lacks one will render the same four codepoints as four disconnected isolated forms.

The wrong answers are still in the standard, fossilised, and they make excellent souvenirs. Before shaping engines existed, the 8-bit code pages of the DOS and early Windows era encoded the shapes themselves: a separate character for initial ʿayn, medial ʿayn, and so on. Unicode, which promised round-trip compatibility with anything else, had to swallow those sets whole, and they live on at U+FB50 through U+FEFF under the name Arabic Presentation Forms: several hundred codepoints that no new document should ever contain and that PDF text extractors merrily emit to this day, which is one of the reasons searching an Arabic PDF so often fails in silence. The haystack is encoded as shapes and your needle is encoded as letters. My favourite resident of the block, and one of my favourite characters in all of Unicode, is U+FDFD, ﷽ : four-word invocation, bismillāh ar-raḥmān ar-raḥīm, as a single codepoint. A monument from the era when rendering was baked into the encoding because nobody trusted the renderer to do anything, preserved forever, like a fly in amber that recites.

This bites because the two encodings render identically and compare differently. The customer search bug I mentioned at the top of this article was, specifically, this:

a small customer database. Some names were entered through a modern Arabic keyboard. Others were migrated in 2017 from a legacy system that stored them in Arabic Presentation Forms. Look identical, don't they? NAME (as rendered) ENCODING IN STORAGE ACCOUNT محمد علي modern Unicode EGP-9341-0021 ﻣﺤﻤﺪ ﻋﻠﻲ presentation forms EGP-2014-7732 سارة أحمد modern Unicode EGP-9341-0044 ﺳﺎﺭﺓ ﺃﺣﻤﺪ presentation forms EGP-2014-8810 search by name: apply NFKC normalisation Type a name into the search. Or click the seed button below to paste a query that exactly matches one of the records, the way a customer-service agent might copy the name from an SMS. Without normalisation, you get the records that share your query's encoding and miss the others. Tick the box and the search runs after NFKC, which collapses the presentation forms back to their abstract letters. The fix is one Unicode call. It took a quarter to find because the bug presents as "customer not in system", and customer-service tickets do not arrive with codepoint dumps attached. paste modern-Unicode query paste presentation-forms query

And if you want to know what the world looks like when software skips all of this, the shaping engine, the bidi algorithm, the whole apparatus, you do not have to imagine it, because an enormous amount of software still skips all of it: