A years-long Turkish alphabet bug in the Kotlin compiler

Muhammed Demirbaş couldn’t have been more spot on in his investigation and assessment of the compiler bug. Since Kotlin is open source, he was able to search the compiler’s code for the exact line of code where that “Unknown compiler message tag” string appears: val qNameLowerCase = qName.toLowerCase() var category: CompilerMessageSeverity? = CATEGORIES[qNameLowerCase] if (category == null) { messageCollector.report(ERROR, "Unknown compiler message tag: $qName") category = INFO } So what does this code do, and why does it sometimes go wrong? The code is part of a class named CompilerOutputParser , and is responsible for reading XML files containing messages from the Kotlin compiler. Those files look something like this: This is a message from the compiler about a line of code. At the time, the tags in this file were named in all-caps: , , and so on (source: GitHub), like the HTML 1.0 webpages your grandpa used to write. In the Kotlin code we just saw, qName is the name of an XML tag that we’re parsing from this file. If we’re looking at an tag, the qName is “INFO.” To determine what the message means, the CompilerOutputParser next looks up that string in its CATEGORIES map to find its corresponding CompilerMessageSeverity enum entry. But wait: the keys in the CATEGORIES map are lower case! (source: GitHub) val categories = mapOf( "error" to CompilerMessageSeverity.ERROR, "info" to CompilerMessageSeverity.INFO, … ) Instead of searching for “INFO,” we need to search for “info.” That’s why the code we looked at calls qName.toLowerCase() before looking it up in the CATEGORIES map. Here’s the code again, or at least the relevant lines: val qNameLowerCase = qName.toLowerCase() var category: CompilerMessageSeverity? = CATEGORIES[qNameLowerCase] And that’s where the bug sneaks in. If your computer is configured in English, "INFO".toLowerCase() is "info" , just like we wanted. is , just like we wanted. But if your computer is configured in Turkish, "INFO".toLowerCase() turns out to be "ınfo" . Notice the difference? In the Turkish version, the lower case letter ‘ı’ has no dot above it. The tiny discrepancy might be hard for a human to spot, but to a computer, these are two completely different strings. The dotless "ınfo" string isn’t one of the keys in CATEGORIES map, so the code fails to find the correct CompilerMessageSeverity for our tag, and complains that “INFO” must be a completely unknown category of message. So why does calling toLowerCase() on a Turkish computer produce this strange result? Muhammed already provided part of the answer in his reply to Mehmet Nuri’s forum post. Turkic languages have two versions of the letter ‘i’: an ‘i’ with a dot, as in the word insan (human), and a separate ‘ı’ without a dot, as in the word ırmak (river). What’s more, the dotted/dotless distinction is also preserved in the upper case letters: capital ‘i’ is ‘İ’, as in insan → İnsan, and capital ‘ı’ is ‘I’, as in ırmak → Irmak. That uppercase dotless ‘I’ is the same one we use in English. As a result, the single Unicode character I (U+0049) has two different lower case forms: dotted i (U+0069) in English, and dotless ı (U+0131) in Turkish. For Kotlin’s toLowerCase() function, that’s a problem! When toLowerCase() sees an I character, which lower case form should it use? The lower case form of the Turkish word IRMAK should be ırmak, with no dot. But the lower case form of the English word INFO, which starts with exactly the same character, should be info, with a dot. When you ask your computer to convert text to lower case, you should technically also specify the alphabet rules to use—English, Turkish, or something else entirely. But that’s a lot of hard work, so if you don’t specify, many systems — including, in those days, Kotlin’s toLowerCase() function — will just use the language settings you chose when you set up your computer. That’s why "INFO".toLowerCase() is "ınfo" when you run it on a Turkish machine, and that’s why IntelliJ installations in Turkey couldn’t match the Kotlin compiler’s messages to the lowercase "info" string they were expecting to see. But in 2016, all of that was still just a bug ticket waiting to be worked on. Muhammed Demirbaş had identified the right place to start the search, but the YouTrack issue linked to his findings was just one of hundreds of tickets in the Kotlin project backlog. With only a tiny number of people reporting that they were affected by the bug, a more thorough investigation was never a priority. That would all change with the release of coroutines two years later, when the unassuming little bug wormed its way even deeper into the foundations of the Kotlin compiler.

A years-long Turkish alphabet bug in the Kotlin compiler

Share this article

Related Articles