Analysis of 173,303 exomes and genomes in the Pakistan Genome Resource

Ethics statement

The institutional review board (IRB) at the Center for Non-Communicable Diseases (CNCD) (IRB: 00007048, IORG0005843 and FWAS00014490) approved the study. This study has also been approved by the National Bioethics Committee for Research Pakistan (reference number NBC-756). All participants gave written informed consent, which includes publication of relevant indirect identifiers.

Recruitment

Cases and control individuals were recruited in both public and private hospitals across Pakistan. A breakdown of participants is outlined in Supplementary Table 1. Phenotypes were defined based on confirmed clinical diagnoses established by primary physicians, supported by relevant laboratory investigations and standardized diagnostic criteria. Participants were identified at the time of hospital presentation and enrolled following diagnostic confirmation and written and oral informed consent was obtained. Cases were identified according to predefined inclusion and exclusion criteria. Control individuals were selected from the same hospitals, typically unrelated or distantly related case attendants who met control criteria. To minimize selection bias, enrolment was carried out across both urban and suburban hospitals that also serve surrounding rural populations.

PGR sequencing

Exome sequencing was performed in multiple batches at two different locations: the Broad Institute as described previously4 and the Regeneron Genetics Center (RGC) as described previously using either IDT probes or the Twist human comprehensive exome panel75,76,77. Whole-genome sequencing (WGS) was performed as part of the TOPMed consortium78.

PGR sample quality control, variant quality control and variant annotation

Samples were removed if there was discordance between their reported and computationally predicted sex, or if they were predicted to be genetic duplicates within or across sequencing batches. In the RGC datasets, samples were additionally removed if fewer than 80% of bases were covered at 20× or >5% human contamination was observed79. For a subset of the RGC datasets, whole plates were removed if self-reported gender was discordant with computationally predicted sex for >10% of samples. After filtering, a total of 166,625 WES and 6,678 WGS samples were analysed. Variant-level quality control was applied separately to each sequencing batch. Before applying quality control filters, multiallelic sites were decomposed and all variants were left-aligned and normalized. In the WES dataset sequenced at the Broad Institute, individual variant calls were set to missing if they did not meet all of the following criteria: Quality by Depth (QD) ≥ 2 for insertion–deletion mutations (indels), or QD ≥ 3 for single nucleotide variants (SNVs); Variant Quality Score Recalibration (VQSR) = “PASS” or (VQSR = “VQSRTrancheSNP99.60to99.80” and allele count = 1); 0.25 ≤ allele balance ≤ 0.75 for heterozygous calls, or allele balance ≥ 0.9 for homozygous calls; Depth (DP) ≥ 10. For a batch of WGS data sequenced at the Broad Institute, calls with DP < 10 were removed. In the datasets sequenced at the RGC, individual variant calls were set to missing if they did not meet all of the following criteria: 0.25 ≤ allele balance ≤ 0.75 for heterozygous calls, or allele balance ≥ 0.9 for homozygous calls; DP ≥ 7 for SNVs, or DP ≥ 10 for indels. After applying the above variant quality control steps, variants were removed entirely if they had a call rate <0.9. All variant quality control steps were performed using Hail. Variants that were filtered in gnomAD v4.1 were removed from PGR. Variants were annotated using Variant Effect Predictor (VEP) v107 and LOFTEE13,80. All variant annotations are reported relative to the Ensembl canonical transcript v107.

Admixture analysis

A reference panel was created using the individuals of specific subgroups from the 1KG dataset: SAS, n = 492; EAS, n = 515; AFR, n = 671; and EUR, n = 521 (ref. 15); 752 Central or South Asian, European, Middle Eastern, or African participants from the HGDP16,17; and 200 unrelated individuals from each self-reported ethnicity within PGR. Data from 1KG were downloaded from https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ and https://www.internationalgenome.org/data-portal/sample. HGDP VCF files were downloaded from https://ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516/. The dataset was merged after filtering to variants with MAF > 1% in each of 1KG, HGDP, and each PGR sequencing tranche, and then linkage disequilibrium-pruned using the PLINK (v2.00a3.7) flag ‘--indep-pairwise’ with a window size of 200 kb and an r2 threshold of 0.5 (ref. 81). Unsupervised clustering analysis using ADMIXTURE82 was run on this set of reference samples for k = 3 to k = 25 using fivefold cross-validation. k = 17 was selected as it minimized cross-validation error. The remaining PGR samples were then projected into the k = 17 results using ‘-P’.

... continue reading