“Data is multidisciplinary” is my mantra—it’s 2025, and I’ve now worked 20 years in every possible flavour of data—data visualization, open data advocacy, data pipelines in healthcare, data-driven national-scale services, AI innovation, and more. Whatever the application or project, my take on data literacy is the fundamental ability to challenge your own assumptions about the data you have or don’t, the appropriateness in using it, the ethics of your application, and ask yourself: is there a different way, perhaps? Here is a gallery of some of my most treasured eureka moments working with data.
You have a clear purpose but the data isn’t quite right for it
I regularly walk through the Turnpike Lane Bus Station, there’s a pretty big sign pointing to it. It’s a major node for North London public transport and yet, a few years back, I found out that it did not exist… in the data, at least. I used to run the official data set of bus stops for the UK Government—a rather obscure dataset that made its way into powering a few popular journey planners like Google Maps and City Mapper.
This was 2020 during COVID, and one of my colleagues wanted a list of all bus stations in the country in order to send posters which advertised social distancing. While the dataset contained over 500,000 points, it did not contain this bus station. The problem data definitions: the dataset listed bus stops, which were not the same things as bus stations. While the words “bus station” have a common sense meaning in our minds as a collection of bus stops, that meaning was not translated into the dataset. The individual bus stops making the bus station are all in the datasets, except there was no way to group them together other than trying to infer they’re part of the same bus station because of their proximity.
I found other interesting issues in the dataset. Some were easy to spot, like bus stations in the middle of the North Sea. Other stations were a few meters away from their real location, which would not have a huge impact unless we were trying to use the dataset to get self-driving buses to park automatically. So, why weren’t these groupings captured in the first place? The process that created and populated the data never asked itself “are we capturing everything that we need about this bus stop?”. As a result, the dataset wasn’t quite fit for the purpose we were looking to deliver. The translated definitions of common sense concepts into data is a major element of making sure that a dataset is usable and stays current, and having a process that allows that question to emerge is an ingredient of good data management. At the time, to my surprise, we didn’t have either.
Disappointingly, data that may appear suitable to your purpose is not always; and if you are in the fortunate position of being the owner of a dataset, always ask: are there any use cases that would be out of scope for this dataset, and is it worth expanding?
Image credit: Giuseppe Sollazzo
Sometimes the data is really incomplete or missing
W.E.B. Du Bois is widely remembered for his infographics about the conditions of African Americans at the end of the nineteenth century. What I always hail him for was having shown that a lack of data should not stop a good data project and that sometimes the hard work is putting data together. When he realised the US Census lacked data about African Americans, he assembled his own survey and team,collecting data that resulted in his now famous infographics. Incomplete or missing data is something that I’ve regularly had to cope with and decide whether to pursue the initial project or pivot to something different. Once again, during the pandemic, we were trying to see if there was a way to check the density of people on pavements, and entered a tunnel trying to find the accurate measurements of pavements for the whole of the UK—an impossible task. This is when I realised using a proxy would have given informative enough results, as did The Economist in the chart below, created by collecting, over time, Google Places “busy times” for major points of interest in major cities. Simple, effective, but not based on anywhere close to “complete” data.
Sometimes missing data should make us reflect. In one of my projects while working for public healthcare in the UK, a team of dermatologists came asking if my team could develop an AI algorithm to grade a type of skin condition. The intent was very positive: in their clinical research, they realised human medics were biased, resulting in less accurate grading for people who are not white, and were looking for AI to help correct that bias. We found that the collection of images we could find about this condition were themselves biased, so any AI model trained on them would have not addressed the issue. The image below captures what dermatologists call the Fitzpatrick scale—the official measure of skin darkness.
... continue reading