Cohorts
UK Biobank
The UKB is a prospective cohort of approximately 500,000 individuals from the UK aged 40–70 years at enrolment, described in more detail elsewhere54. The resource includes a vast array of information on each individual, including hospital inpatient diagnoses, questionnaires and surveys collected at the time of enrolment, and several aliquots of peripheral blood. These blood samples are used to obtain estimates of blood cell composition, quantitative levels of blood biomarkers and genomic data. The final tranche of WGS data was recently released, covering virtually everyone in the cohort. This is in addition to whole-exome sequencing (WES) and genotyping array data that also exist for each participant. Pertinent to the present study, the final tranche of WGS data was generated using Illumina NovaSeq 6000 sequencing machines across Sanger and the deCODE facility to an average coverage of 32.5×55. All WGS and WES data were accessed using the UKB Research Analysis Platform (RAP) under application 31063.
All of Us
AoU is a large longitudinal cohort intended to capture the diversity of people living in the USA and includes physical measurements, biospecimen collection and survey data on enrolment, as well as capture of electronic health records56. For this study, participant data were accessed from the ‘Genetic determinants of mitochondrial DNA phenotypes (v7)’, ‘Genetic determinants of mitochondrial DNA phenotypes (v7) second 40k’, and ‘Genetic determinants of mitochondrial DNA phenotypes (v7) finish pipeline’ workspaces, all of which use the Controlled Data Repository v.7. This data release incorporated WGS data from 245,394 individuals obtained from peripheral blood and sequenced at Baylor College of Medicine, the Broad Institute and the University of Washington. DNA extraction was performed using one of two kits: Autogen or Chemagen.
mtSwirl v2: an updated mtDNA coverage estimation and variant-calling approach
To infer mtCN and accurately call variation on mtDNA, we used mtSwirl, a pipeline we have previously described in more detail4. In brief, mtSwirl is a scalable pipeline intended for deployment on cloud platforms for analysis of hundreds of thousands of whole-genome sequences. The pipeline involves initial extraction of unmapped reads, as well as a set of 385 nuclear regions in the GRCh38 reference genome with high homology to mtDNA obtained previously through BLAST (putative nuclear regions of mitochondrial homology, NUMT)4. These reads are used to perform an initial pass of variant calling using Mutect2 and HaplotypeCaller for mtDNA and nucDNA, respectively, with GATK v.4.2.6.0. High-confidence homoplasmic and homozygous variation, respectively, are then used to construct a per-individual consensus sequence to which all extracted reads are realigned. Alignment is repeated to a shifted version of the consensus sequence to avoid artificial coverage depression along the edges of the mtDNA sequence. mtDNA variant calling is then repeated using Mutect2, and a custom pipeline is used to map all variant calls and per-base coverage estimates back to GRCh38 for output. In parallel, Haplogrep57 is used to perform haplogroup inference.
mtSwirl is intended to be usable on any platform that supports the WDL pipeline language. Owing to substantial limitations of parallelization on UKB RAP, we previously released mtSwirlMulti, which runs multiple samples on individual larger machines. As part of this study, to boost efficiency in UKB, we implemented parallelization within each machine for the most computationally intensive tasks in mtSwirl (SubsetBam, ProcessBamAndRevert, HaplotypeCaller, AlignToMtRegShiftedAndMetrics and LiftoverVCFAndGetCoverage). In effect, our updated version of mtSwirlMulti now implements two layers of parallelization, distributing sets of samples across machines and then using multicore machines to efficiently process the sets of samples in parallel. We ran this new version of mtSwirlMulti on approximately 2,000 samples from the previous release and found that the results were completely identical to those obtained previously. mtSwirlSingle is unchanged from before. We also developed new submission pipelines to efficiently automate job submissions in the UKB RAP and AoU. More details are available at our GitHub repository (https://github.com/rahulg603/mtSwirl).
When used on platforms with a Google Cloud backend (such as AoU Researcher Workbench), the pipeline accesses only the portions of the WGS file relevant for the analysis. Despite this, if WGS data are stored in cold storage (for instance, Nearline), data access charges can be substantial.
We deployed mtSwirlMulti on UKB RAP, processing all of the approximately 300,000 new samples in around 3 weeks given the limit of 100 concurrent machines. We ran mtSwirlSingle across approximately 150,000 new samples in AoU over about 1 week. Within each biobank, new results were merged with mtDNA callsets generated in our previous effort across approximately 250,000 individuals4, producing a final, per-biobank mtDNA callset of approximately 500,000 individuals in UKB and approximately 250,000 individuals in AoU.
... continue reading