Earlier this year, Monadic DNA kicked off an experiment to demonstrate that people can access and analyze their genetic data with anonymity and privacy. Monadic DNA collected saliva samples from thirty encrypted genomics pioneers at an event in Denver. These participants later used a Web app to claim their genotyping results using a unique kit ID and a self-selected PIN. The app guided users through uploading their data to encrypted storage powered by Nillion’s multi-party compute (MPC) technology. From there, users could generate insights without exposing their raw genomic data. Computations were performed on the server-side using MPC without decryption and then processed and revealed locally. This experiment offers a practical model for handling sensitive genetic information in a way that prioritizes privacy, security, and data sovereignty. Sample Collection The sample collection event happened in late February during the ethDenver conference. We chose Terminal Bar at Union Station for its central location and copious outdoor seating during some excellent weather. A Luma event during a busy conference and some social media posts were enough to draw a decent crowd. We underestimated the turnout so we didn’t end up having enough sample collectors. It was good to have our own designated area so we wouldn’t unduly worry anyone with our medical equipment. 🙂 The lucky thirty that made it on time were given a standard sample collector and asked to spit into it until they hit a marked level. We provided latex gloves to everyone involved and all refuse was collected in a bag for proper disposal. Before providing a sample, each participant had to fill in a form agreeing to some legalese and record their kit ID and pick a four-digit PIN for the later data claim. They also had the option to provide an email address, Signal ID or Telegram ID in case they wanted to receive updates. Participants couldn’t eat or drink for thirty minutes before providing a sample but, for humane reasons, were provided with a beverage of their choice right after they were done with their sample collectors. Those who wished could also claim a POAP to mark their participation in this event. Surprisingly, most participants were highly interested in the POAP and others went to decent lengths to get their POAP afterwards once they realized they had missed the bit about getting a virtual collectible. Regular people walking by couldn’t help being interested in the event after seeing the equipment and general fanfare. Unfortunately, we had not accounted for this in advance and didn’t budget enough collectors to include walk-ins. It was good to have a team of three organizing the event as things would have become chaotic otherwise. Interestingly, most participants had never had their DNA analyzed before due to privacy and legal concerns. Even though this was run as an experimental event, it was heartening to see the same individuals trusting the process enough to participate. After all was said and done, it was time to gather the samples into a box and ship them to the lab. In spite of the “biological materials” label on the box, UPS was more than happy to ship the box for under fifty dollars. During transit both ways, the box of sample collectors had a tracker in it so that if anything obviously untoward were to happen, we’d have an idea and disclose it to the participants. Lab Processing We partnered in advance with Autogen for the lab portion of the exercise. We had spoken to a cross-section of labs, from large commercial providers to major university labs to smaller operations but Autogen was the easiest to work with for our batch size and timelines. Interestingly, every provider we spoke to was happy to work with anonymized data. Most had to retain the actual samples for some period of time to comply with laws and regulations but none strictly needed metadata beyond a manifest file with kit IDs and customer IDs. The anonymization was important because some degree of assumed trust in the lab is unavoidable. When biology meets code or generally when meatspace meets cyberspace, perfect guarantees are no longer possible. A rogue lab could try to deanonymize data, especially by colluding with genomics apps so a bunch of social and legal mitigations will have to be put in for the longer haul. We opted to work with the Global Screening Array for genotyping since it fit the best with our budget. This gave each participant ~500k markers. With a slightly higher budget, we could have gone with the Global Diversity Array for ~1.5m markers each but the marginal benefit wasn’t significant for a limited exercise. Note that we were not going for full genome sequencing to get all ~3-5 billion markers for each participant. The cost would be significantly higher, not just for sequencing but also for storage and analysis and there would not be a clear benefit for a consumer product with our knowledge of the human genome today. Even the much more reasonably priced blended genome-exome sequencing offered by Broad Clinical Labs would have been excessive for now. The actual processing took about two months as our sample size was apparently too small to be batched quickly for processing. In a production setting, a two to three week turnaround is possible with enough ongoing volume. Nevertheless, even after some passage of time, the participants were still excited to receive their data and analysis. Data Claim and Analysis The data was itself shared by the lab as a collection of raw data files along with some metadata files. Each data file is tagged with its kit ID for identification. The industry standard seems to be sharing data with regular encryption over Amazon S3 or Google Drive. At a small scale, there isn’t yet any appetite for calling custom APIs or employing any nuanced encryption. The raw data consists of rows of values for known sites on the genome called SNPs (single nucleotide polymorphisms). The RSID column gives a known identifier while the “result” column gives its actual genotype value. The chromosome and position columns give the position of the site on the genome. NIH’s dbSNP has a directory of SNPs and their associated information. After copying the data over to our own secured DigitalOcean Spaces storage bucket, we deployed a lightweight microservice in front of the data to let participants claim their data using their kit ID and PIN, which the participants were able to do by using our Web app at https://ethdenver.monadicdna.com/. While the kit ID and PIN combination provides decent anonymity, it does not prevent people from trying to brute force PINs. Since kit IDs are generally sequential, an attacker who knows at least one kit ID (maybe their own) can try to pull others’ data, thus weakening anonymity. If we had more time, we would have used something more sophisticated than a Google form so that users could instead use cryptographic keys to claim their data. Once participants claimed their data, they could upload it back into the app for secure storage and computation. This video shows the overall user flow: The data is stored and analyzed under encryption using multiparty computation (MPC) using Nillion’s Private Storage platform. Once the data is encrypted on the user’s device, it is uploaded, stored and queried in encrypted form without any need for decryption. Nillion Private Storage works pretty much along the lines of MongoDB with the data stored under a known schema and then queried accordingly. For the purposes of this exercise the genotypes were encrypted locally before the data was sent to a dedicated microservice. This encrypted data was then sent using app-specific keys to a Private Storage cluster run by Nillion and its partners for storage. This intermediate microservice was needed because at the time of this exercise, an app-specific key allowed one to query all data stored under that key. Therefore, user-specific keys on each device were used to encrypt sensitive data while the app-specific keys were secured on the intermediary microservice. This prevented users from being able to query out each others’ data while still keeping the day encrypted at all times. User keys are stored in the browser’s site-specific local storage for the app. While good enough for this experimental phase, we could have used a custom plugin or a hardware device to better secure the keys. The MPC cluster on which the data was secured needs just one honest node to prevent the data from being decrypted or handled maliciously. A three node cluster therefore seemed fit for our purposes. The analysis bit consisted of simple lookups on single genotypes. As this is still an experimental app, we did not want to calculate polygenic risk scores or do anything with a health and wellness angle without the proper legal and infrastructural guardrails in place. The app simply fetched SNP values from the encrypted database and accordingly gave the user some insight about their sense of taste, sleep, etc. Those curious can look at entries on SNPedia as examples of how combinations of SNPs can be used for making research-based inferences about people. Discerning readers may wonder why we didn’t simply run all storage and computations locally without MPC. The fundamental purpose of this exercise was to prove that MPC is workable for genomic analysis and this can be done asynchronously without any active participation by the user. It so happens that this specific app is synchronous but it can easily be extended to make it fully asynchronous such that a fresh batch of insights is calculated by our back-end as well as third-parties in the background. When the user opens the app they see all the insights all together without any delay. Once we announced that the data and app were ready, the participants were able to use the app and receive their insights without any usability issues. They were largely satisfied with the flow and the responsiveness and were eager for further insights. We are pleased to conclude that the experiment was a success! Future Work It’s hard to have perfect anonymity and privacy when physical labs and instruments are involved, at least until at-home sequencing is possible. We can, however, try to get the strongest possible guarantees. Projects such as geneinfosec use “wetware” solutions like molecular cryptography to protect genetic data. These technologies can be combined with software cryptography for better end to end anonymity. For compliance and legal protections, people have to sign legal agreements with apps with some linkages to their real identities, creating an avenue for de-anonymization. Protocols using zero-knowledge proofs can be constructed to prove a given individual signed a given agreement with a certain identifier without giving away extraneous sensitive information. With active cooperation from labs, genetic data can be encrypted from near-source to storage and analysis. Keys can be generated upfront from user devices and then submitted along with user samples for end to end encryption. Finally, our iOS app uses fully homomorphic encryption (FHE) to keep data encrypted and maintain privacy through storage and computation. By combining FHE and the MPC described above, we hope to create the most robust possible architecture for our apps. Stay tuned to learn more soon!