Skip to content
Tech News
← Back to articles

UK Biobank health data keeps ending up on GitHub

read original get Data Privacy Compliance Guide → more articles
Why This Matters

The repeated exposure of UK Biobank's sensitive health data on public platforms like GitHub highlights ongoing challenges in data privacy and security within the tech industry. It underscores the importance of robust data governance and the need for better safeguards to prevent unauthorized sharing of confidential research information, protecting both participants and institutions.

Key Takeaways

To build this webpage, I used data from the github/dmca repository, where GitHub publishes the full text of every DMCA takedown notice it receives. When a rights holder asks GitHub to remove content that infringes their copyright, the notice is posted publicly as a Markdown file in this repository. According to The Guardian, UK Biobank has used this process to request the removal of files or repositories that contain (or that it believes contain) participant data covered by its data access agreements.

To identify UK Biobank-related notices, I match filenames containing the slug "uk-biobank" (the convention GitHub uses when naming notice files). Just in case, I also search the full text of every other notice file for the phrases "UK Biobank" or "UKBiobank" (case-insensitive) to catch notices filed under different slugs, such as those submitted on behalf of UK Biobank. From each matching notice, I extract the filing date (parsed from the filename, which follows GitHub's YYYY-MM-DD-slug.md convention) and all GitHub repository URLs mentioned in the notice body. URLs pointing to GitHub's own infrastructure (e.g. github.com/contact or github.com/site) are excluded.

For each unique GitHub username found in the notices, I query the GitHub REST API ( GET /users/{username} ) to retrieve the user's public profile, specifically the self-reported location field. This is a free-text string that users enter voluntarily. It may be a city, a country, a university name, or left blank entirely. Deleted accounts return a 404 and are not included further.

I derive countries from the raw location strings by hand. When a user's GitHub profile does not include a location, I also determine their country by inspecting their GitHub profile and associated email address domains. This process is inherently imperfect: some locations are ambiguous (e.g. "Cambridge" could refer to the UK or the US), and many users do not provide any location at all. Of the 170 unique developers in the dataset, only 75 have a location that could be resolved to a country.

The data is regularly refreshed by re-running the collection script against the latest state of the github/dmca repository. This page does not make any claims about the content of the targeted repositories, including whether they contained actual participant data, derived datasets, analysis code, or just documentation. It reports only what is visible in the public DMCA notices filed by UK Biobank.