Transcription
Chandler’s book includes population data from 2250 BC to AD 1975 in various charts and tables. The book contains 656 9×5.5 inch pages and is divided into multiple sections, including Sources and Methods, Continental Tables and Maps (highlighting locations of major cities as illustrated in Fig. 4), Data Sheets for Ancient Cities (the main tables of the book shown in Fig. 1), Tables of the World’s Largest Cities, and Whereabouts of Unfamiliar Cities. Each page in the Data Sheets for Ancient Cities section (Fig. 1) contains a range of 15-30 data points per page. These pages are divided into four columns: (1) data year, (2) the population value (underlined values are Chandler’s estimates), (3) text describing the origin of the population estimate, and (4) citation information for each entry.
Figure 4: Sample Chandler Map. Continental map illustration from Chandler’s book, located in the Continental Tables and Maps section of the text. Although useful in locating some cities, the image quality of these maps is variable even in the original text and they provide approximate locations only. Full size image
As with any digitization project, a significant component of the project is to convert the printed text—in this case a hardcover book—into digital format. There are several ways this task could be done. Because the Chandler book is 656 pages, its size warranted use of a Kirtas machine. A Kirtas machine uses optical character recognition (OCR) to convert printed text into an encoded format. An OCR system is able to convert text into a portable document format (pdf), which can be manipulated using a word processing program. This differs from a scanner, which converts print media to a picture that cannot be readily manipulated.
We had planned to use a Kirtas machine to convert the printed text to digital format. However, due to issues associated with the font of the printed book, which was not easily recognized by the Kirtas machine, and the variable quality of the printed pages, (e.g., Fig. 1), none of the OCR software we tested—Microsoft One Note, Adobe Acrobat Pro, and Free OCR—were able to accurately convert the printed text. After multiple attempts with the Kirtas machine, this approach was discarded and the text was manually transcribed into Microsoft Excel (Fig. 3). In total, 1,746 city locations were originally transcribed and checked twice by research assistants for transcription errors and accuracy. If entries did not match in all three cases, we referred back to the original documents for assessment and amendment. The final Chandler dataset contains 1,599 city locations, since some originally transcribed cities were later combined or were unable to be geocoded accurately.
We received Modelski’s dataset directly from the author in digital text format depicted in Fig. 2. The book itself, which consists of 245 pages, contains descriptive text recounting shifts in population values and their origins. We formatted these Microsoft Word tables into Excel tables using a similar format to the Chandler dataset. This format includes country names along the y-axis and time periods across the x-axis as depicted in Supplementary Fig. 1.
Geolocation
Geocoding is the process of assigning geo-referenced coordinates, or longitude and latitude values, to a record to identify its location on Earth’s surface. It is often the first step in any spatial analysis when the data are not already geolocated18. Online geocoding platforms, such as CartoDB or Google Places API (Application Program Interface) can be used to process large amounts of data when the entries can be matched in batch-mode. This process allows all locations (up to a pre-determined limit for some geocoding services, such as 10,000 queries per day for Google Places API) to be submitted in a group batch, rather than interactively or individually19,21.
Geocoding or geolocating tools have been used most frequently and discussed in relation to medical field-based studies, such as public health or epidemiology19–21. In these studies, geocoding is often done at the individual address level. Geocoding at the address level allows for application of accuracy validation techniques and procedures since address locations can be checked by multiple geocoding services. When geocoding at the address level, accuracy can be measured by comparing the distances between geocoded points of different methods22,23.
Here, we geocoded population data for cities using a single, central latitude and longitudinal point with 2 to 8 significant figures depending on the geocoding database used. Urban extent data, or polygons defining the city boundary rather than city center points, are not included in our dataset due to a lack of available data. This lack of area extent information may limit the type of analysis possible using this dataset, but the point estimates of population size is a first step towards developing a more comprehensive dataset of urban extent. For example, users of this dataset could estimate area extents based on assumptions about population densities and land use, but this adds another level of uncertainty. Ultimately, the quality of the final geocoded dataset is partially determined by the quality and limitations of the original data.
... continue reading