The Missing 11th of the Month
Source xkcd. Image licensed under CC-BY-NC.
On November 28th, 2012, Randall Munroe published an xkcd comic that was a calendar in which the size of each date was proportional to how often each date is referenced by its ordinal name (e.g. "October 14th") in the Google Ngrams database since 2000. Most of the large days are pretty much what you would expect: July 4th, December 25th, the 1st of every month, the last day of most months, and of course a September 11th that shoves its neighbors into the margins. There are not many days that seem to be smaller than the typical size. February 29th is a tiny speck, for instance. But if you stare at the comic long enough, you may get the impression that the 11th of most months is unusually small. The title text of the comic concurs, reading "In months other than September, the 11th is mentioned substantially less often than any other date. It's been that way since long before 9/11 and I have no idea why." After digging into the raw data, I believe I have figured out why.
First I confirmed that the 11th is actually interesting. There are 31 days and one of them has to be smallest. Maybe the 11th isn't an outlier; it's just on the smaller end and our eyes are picking up on a pattern that doesn't exist. To confirm this is real, I compared actual numbers, not text size. The Ngrams database returns the total number times a phrase is mentioned in a given year normalized by the total number of books published that year. The database only goes up to the year 2008, so it is presumably unchanged from when Randall queried it in 2012.
I retrieved the count for each day for the year ( January 1st , January 2nd etc.) and took the median over the months for each day (median of January 1st , February 1st , etc.) for each year. This summarizes how often the 11th and the other 30 days of the month appear in a given year. Using the median prevents outlier days like July 4th from dragging up the average for its corresponding ordinal (the 4th ). Only if a ordinal is unusual for at least 6 of the 12 months will its median appear unusual.
I took the median for each ordinal over the years 2000-2008. The graph below is a histogram of the 31 medians. The 1st of the month stands out far above them all and the 15th just barely distinguishes itself from the remainder. Being the first day and the middle day of the month, these two make sense. However, the 11th stands out as the lowest by a significant margin (p-value < 0.05), with no immediate explanation.
This deficit has been around for a long time. Below is all the ordinals for every year in the data set, 1800-2008. The data is smoothed over eleven years to flatten out the noise. Even at the beginning, the 11th is significantly lower than the main group. This mild deficit continues for a few decades and then something weird happens in 1860s; the 11th suddenly diverges from its place just below the pack. The gap between the 11th and the ordinary ordinals expands rapidly until the 11th is about half of what one would expect it to be throughout the first half of the twentieth century. The gap shrinks in the second half of the twentieth century, but still persists at a smaller level until the end.
Astute graph readers will notice that something else weird is going on. There are four other lines that are much lower than they should be. From highest to lowest, they are the 2nd , the 3rd , the 22nd , and the 23rd . They were even lower than the 11th from 1800 until the 1890s. However, starting around 1900, their gaps started shrinking even as the 11th diverged until the gap disappeared completely in the 1930s. There is an interesting story there, but because their effect doesn't persist to the present, I'll continue to focus on the 11th and leave the others for a future post.
Typographical hijinks
When I began this study, I was hoping to find a hidden taboo of holding events on the 11th or typographical bias against the shorthand ordinal. Alas, the reason is far is far more mundane: a numeral 1 looks a lot like a capital I or a lowercase l or a lowercase i in most of the fonts used for printing books. An 11 also looks like an n , apparently. Google's algorithms made mistakes when reading the 11th from a page, interpreting the ordinal as some other word.
... continue reading