S3 Files and the changing face of S3

Almost everyone at some point in their career has dealt with the deeply frustrating process of moving large amounts of data from one place to another, and if you haven’t, you probably just haven’t worked with large enough datasets yet. For Andy Warfield, one of those formative experiences was at UBC, working alongside genomics researchers who were producing extraordinary volumes of sequencing data but spending an absurd amount of their time on the mechanics of getting that data where it needed to be. Forever copying data back and forth, managing multiple inconsistent copies. It is a problem that has frustrated builders across every industry, from scientists in the lab to engineers training machine learning models, and it is exactly the type of problem that we should be solving for our customers.

In this post, Andy writes about the solution that his team came up with: S3 Files. The hard-won lessons, a few genuinely funny moments, and at least one ill-fated attempt to name a new data type. It is a fascinating read that I think you’ll enjoy.

–W

Part 1: The Changing Face of S3

First, some botany

It turns out that sunflowers are a lot more promiscuous than humans.

About a decade ago, just before joining Amazon, I had wrapped up my second startup and was back teaching at UBC. I wanted to explore something that I didn’t have a lot of research experience with and decided to learn about genomics, and in particular the intersection of computer systems and how biologists perform genomics research. I wound up spending time with Loren Rieseberg, a botany professor at UBC who studies sunflower DNA—analyzing genomes to understand how plants develop traits that let them thrive in challenging environments like drought or salty soils.

The botanists’ joke about promiscuity (the one that started this blog) was one reason why Loren’s lab was so fun to work with. Their explanation was that human DNA has about 3 billion base pairs, and any two humans are 99.9% identical at a genomic level—all of our DNA is remarkably similar. But sunflowers, being flowers, and not at all monogamous, have both larger genomes (about 3.6 billion base pairs) and way more variation (10 times more genetic variation between individuals).

One of my PhD grads at the time, JS Legare, decided to join me on this adventure and went on to do a postdoc in Loren’s lab, exploring how we might move these workloads to the cloud. Genomic analysis is an example of something that some researchers have called “burst parallel” computing. Analyzing DNA can be done with massive amounts of parallel computation, and when you do that it often runs for relatively short periods of time. This means that using local hardware in a lab can be a poor fit, because you often don’t have enough compute to run fast analysis when you need to, and the compute you do have sits idle when you aren’t doing active work. Our idea was to explore using S3 and serverless compute to run tens or hundreds of thousands of tasks in parallel so that researchers could run complex analysis very very quickly, and then scale down to zero when they were done.

The biologists worked in Linux with an analytics framework called GATK4—a genomic analysis toolkit with integration for Apache Spark. All of their data lived on a shared NFS filer. In bridging to the cloud, JS built a system he called “bunnies” (another promiscuity joke) to package analyses in containers and run them on S3, which was a real win for velocity, repeatability, and performance through parallelization. But a standout lesson was the friction at the storage boundary.

... continue reading