Sample preparation and exome sequencing
Exome sequencing was performed at the Regeneron Genetics Center using a custom automated sample preparation approach (Supplementary Information 9). Samples from MAYO-CLINIC and part of the CNCD (n = 25,580) cohorts were captured with Twist Comprehensive Exome probes and the remaining cohorts were captured with IDT xGen v1 and sequenced using the Illumina HiSeq 2500-v4 or Illumina NovaSeq instrument, with 75-bp paired-end reads and two index reads. For the MAYO-CLINIC cohort sequenced with Twist, probes also included the Twist Diversity SNP panel, for which multipoint refinement was conducted using GLIMPSE before further genotype QC and imputation61.
Study populations
This study included 1,020,833 participants from seven cohorts: 469,662 participants from the UKB, 173,585 participants from the GHS cohort, 140,996 participants from the MCPS cohort, 116,345 participants from the Mayo-Clinic Biobank, 44,970 participants from the CNCD cohort, 44,027 participants from the University of Pennsylvania Penn Medicine Biobank, 31,248 participants from the Mount Sinai cohort (Supplementary Table 10 and Supplementary Information 1).
Genetic relatedness analysis
To estimate the portion of identical-by-descent (IBD) genomic regions shared between pairs of individuals in each study, we first obtained a set of high-equality common SNPs from the exome variant set by excluding SNPs with minor allele frequency < 10% and genotype missingness >5% and all indels. Furthermore, variants with abnormal heterozygosity rates based on the expected (exp) versus observed (obs) heterozygosity calculations based on empirically determined cutoffs (obs − exp > 0.01 or exp − obs > 0.1) were excluded from further analysis. The asymmetry in the cut-off values was selected to account for the Wahlund effect. IBD estimates were calculated among individuals within the same ancestral superclass that was determined in ancestry predictions as mentioned above using PLINK with a minimum PI_HAT cut-off of 0.1875 to capture out to second-degree relationships, which generates ancestry–version IBD estimates. A separate IBD estimation was calculated among all individuals using a minimum PI_HAT cut-off of 0.3 to identify the first-degree relationships among all samples to generate first-degree family networks, which are connected components of individuals (nodes) and first-degree relationships (edges). Each first-degree family network was analysed using the prePRIMUS pipeline built into PRIMUS62 using the default settings to produce improved IBD estimates for the relationships within each family network and capture close relationships that span more than one ancestral superclass that were not captured in the ancestry–version IBD estimates. The two versions of IBD estimates were combined in the form of a PLINK .genome file and subject to summary analysis after removing overly related samples with more than 100 close relatives (PI_HAT > 0.1875) or 25,000 relatives (PI_HAT > 0.08). All of the samples in the predicted first- and second-degree relationships were removed to generate the maximum unrelated data set for further analysis.
Ancestry assignment
We used array data released by the UKB study to determine continental ancestry super-groups (AFR, AMR, EAS, EUR and SAS) by projecting each sample onto reference PCs calculated from the HapMap3 reference panel. In brief, we merged our samples with HapMap3 samples and retained only SNPs in common between the two datasets. We further excluded SNPs with minor allele frequency < 10%, genotype missingness > 5% or Hardy–Weinberg equilibrium test P < 1 × 10−5. We calculated PCs for the HapMap3 samples and projected each of our samples onto those PCs. To assign a continental ancestry group to each non-HapMap3 sample, we trained a kernel density estimator (KDE) using the HapMap3 PCs and used the KDEs to calculate the likelihood of a given sample belonging to each of the five continental ancestry groups. When the likelihood for a given ancestry group was greater than 0.3, the sample was assigned to that ancestry group. When two ancestry groups had a likelihood of greater than 0.3, we arbitrarily assigned AFR over EUR, AMR over EUR, AMR over EAS, SAS over EUR, and AMR over AFR. Samples were excluded from analysis if no ancestry likelihoods were greater than 0.3, or if more than three ancestry likelihoods were greater than 0.3.
These sample analysed in this study include 763,174 participants of EUR ancestry, 149,124 participants of AMR ancestry, 57,196 participants of SAS ancestry, 41,405 participants of AFR ancestry and 5,313 participants of EAS ancestry (Supplementary Table 10). Within the UKB cohort, repeats called from WGS data were available for 465,021 samples, of which 441,379 were of EUR ancestry, 848 were of AMR ancestry, 10,085 of SAS ancestry, 9,173 of AFR ancestry and 2,287 of EAS ancestry.
Repeat expansion genotyping and QC
... continue reading