Estimation and mapping of the missing heritability of human phenotypes

Ethics declaration

This research used data from participants in the UKB study for discovery and from the Vanderbilt University’s biorepository of DNA (BioVU) linked to de-identified medical records for replication of specific results. Written informed consent was obtained from every participant in UKB study. The BioVU study was designed as an opt-out biobank. The UKB study received ethics approval from the North West Centre for Research Ethics Committee (no. 11/NW/0382) and the BioVU study from the Vanderbilt Institutional Review Board.

Research collaboration framework

This study mainly used WGSs called with DRAGEN 3.7.8 from 490,542 UKB2,3 participants (data field 24310). Further analyses used SNP-array genotypes and imputed genotypes from two reference panels: Haplotype Reference Consortium (HRC) plus UK10K and TOPMed. This work is the result of a collaboration between teams at Illumina Inc. (UKB application ID 33751) and The University of Queensland (UKB application ID 12505). All WGS analyses were performed under Illumina’s application. All analyses requiring WGS data or SNP-array data imputed with TOPMed reference panel were performed on the DNA Nexus platform, whereas analyses not requiring individual-level data or not cloud-restricted were performed on local computing clusters.

Selection of samples of European ancestry using SNP-array data

The first sample selection was performed using principal component loadings computed from 1000 Genomes (1KG) for 207,965 autosomal SNPs on the 488,377 samples with SNP-array data available and selecting samples within 3 s.d. of the 1KG reference European ancestry population mean for the first 10 principal components (455,516 samples of European ancestry retained). We selected samples having both SNP-array and WGS data available and consenting to data use to have 452,618 samples of European ancestry in the GWAS analyses.

Processing of raw WGS data

We defined stringent quality control steps to ensure high quality of the remaining genotypic information. We first individually processed each one of the 136,477 autosomes chunks of raw Binary Variant Call Format data. In the first step, we kept all samples and removed variants with the following conditions: minor allele count (MAC) < 30, non-‘PASS’ variant and variants with more than 200 alleles. Multi-allelic variants were split into separate rows and long allele names fewer than 100 characters were renamed. We merged each chunk into a single file containing all autosomes variants of MAC > 30 and all WGS samples (about 130 million variants). We then applied a second quality control step, keeping only the European ancestry samples identified previously, normalizing variants on GRCh38 reference genome and applying the following filters: genotype missingness more than 0.1, Hardy–Weinberg equilibrium P = 10−8 and sample missingness threshold of 0.05. We had a total of \({M}_{{\rm{W}}{\rm{G}}{\rm{S}}}=40,575,204\) SNPs and indels. These samples were used in GWAS analyses. We computed a genomic relationship matrix (GRM) for these 452,618 samples from 583,191 genotyped SNPs of MAF > 0.01. We extracted a sparse GRM with non-zero entries for pairs of relatives with a genomic relationship coefficient (calculated as an allelic correlation, equation (1)) above 0.05 and used it to estimate pedigree-based heritability. We also extracted a set of 347,630 unrelated European ancestry samples for the GREML analyses for which we generated a new set of WGS genotypes. Finally, we computed allele frequencies from the full 452,618 set and the LD scores from the smaller 347,630 set with a block size of 1 Mb and an overlap of 500 kb between blocks.

Grouping variants for GREML-LDMS and covariates processing

To compute MAF and LD partitioned GRMs, each variant was assigned in one of four MAF (0.01–0.1%, 0.1–1%, 1–10%, 10–50%) and further assigned an LD bin (on the basis of the median LD score statistic within each MAF bin) (Supplementary Table 2). LD score statistics were calculated for each SNP as the sum of squared correlations between allele counts at that SNP and that of all nearby SNPs within a 1-Mb window. Sample relatedness between individuals i and k was computed using the following estimator47:

... continue reading