UKB resource
The protocols for the UKB are overseen by the UKB Ethics Advisory Committee; for more information, see https://www.ukbiobank.ac.uk/ethics and https://www.ukbiobank.ac.uk/wp-content/uploads/2025/01/Ethics-and-governance-framework.pdf. Informed consent was obtained for all participants. The Northwest Research Ethics Committee reviewed and approved UKB’s scientific protocol and operational procedures (REC reference no. 06/MRE08/65). Data for the present study were obtained and research was conducted under UKB application license nos. 24898 and 68574.
WGS, sample selection and CNV discovery
Whole-genome-sequencing (WGS) data for 490,560 UKB participants were generated by deCODE Genetics and the Wellcome Trust Sanger Institute as part of a public–private partnership involving AstraZeneca, Amgen, GlaxoSmithKline, Johnson & Johnson, Wellcome Trust Sanger, UK Research and Innovation, and the UKB. The WGS sequencing and quality control methods have been previously described24. Briefly, genomic DNA underwent paired-end sequencing on Illumina NovaSeq 6000 instruments with a read length of 2 × 151 and an average coverage of 32.5×. The DRAGEN 3.7.8 variant-calling pipeline workflow has also been described in detail in ref. 24, including running mode, parameters used in DRAGEN and computational requirements.
A set of filters were applied, including removing samples (1) with consent withdrawals; (2) less than 2% match with whole-exome sequencing and array data; (3) sequences identified as sample duplicates with multiple birth events; (4) more than 4% contaminated sequences (verifybamid_freemix ≤ 0.04) using VerifyBAMID; (5) low CCDS coverage (fewer than 94.5% of CCDS r22 bases covered with at least ten-fold coverage); and (6) coverage uniformity outliers with coverage uniformity > 0.433 (more than six times the interquartile range). We used peddy53 and 1000 Genomes Project data to classify ancestries (peddy_prob ≥ 0.9) using the gnomAD classifier52 to subdivide European-ancestry samples into those of non-Finnish (NFE) and Ashkenazi Jewish (ASJ) ancestries. We performed further quality control on NFE ancestry samples using peddy-derived principal components removing samples that fell outside four standard deviations from the mean over the first four principal components. A workflow of sample filtering is shown in Supplementary Fig. 1.
DRAGEN version 3.7.8 was used for whole-genome CNV calling with self-normalization. Input was provided as a BAM file aligned to the hg38 reference genome with alt-aware support. Quality control included coverage analysis across multiple genomic regions and cross-contamination checks. The CNV calling pipeline consists of five key modules. First, the TargetCounts module bins read counts and other alignment signals. Next, the BiasCorrection module adjusts for systematic biases. The Normalization module then determines normal ploidy levels and normalizes the sample. After normalization, the Segmentation module detects breakpoints in the signal. Finally, the Calling/Genotyping module scores, qualifies and filters putative events to identify CNVs. The final call file contains segments of duplication, deletion and copy-neutral events. In the present study, we only examined regions with copy number loss or gain. Full details of DRAGEN variant calling have been previously described24.
Olink proteogenomics study cohort
The Olink proteomics study cohort has been described in detail in previous publications21,22,23. In the present study, we analysed 49,736 samples with available paired-genome sequence data. In total, 2,941 plasma proteins were included in the analyses.
Phenotypes
We studied two main phenotypic categories: binary and quantitative traits taken from the April 2023 data release that was accessed on 28 June 2024 as part of UKB applications 68601 and 65851. To parse the UKB phenotypic data, we used our previously described PEACOK package, available at https://github.com/astrazeneca-cgr-publications/PEACOK. A detailed description can be found in a previous publication3. In total, we studied 13,336 binary and 1,911 quantitative phenotypes. As previously described, for all binary phenotypes, we matched controls by sex when the percentage of female cases was significantly different (Fisher’s exact two-sided P < 0.05) from the percentage of available female controls. This included sex-specific traits for which, by design, all controls would be the same sex as cases3. All phenotypes and corresponding chapter mappings for all phenotypes are provided in Supplementary Table 1; generation of the composite phenotypes (prefixed ‘unions’) has been described by Wang et al.3. Seven quantitative traits related to leukocyte telomere length are marked with tl in the field_id column and described31 in the plain_english_name column in Supplementary Table 1.
... continue reading