The 1000 Chinese Pangenome empowers medical and population genetics

Sample collection

The first phase of the 1KCP project enrolled 1,379 participants aged 18–76 years (737 men and 642 women) from the Health Examination Center of the Second Affiliated Hospital of Wenzhou Medical University. Peripheral blood samples were collected for library preparation and sequencing. All samples underwent Illumina short-read WGS and RNA-seq. Among these participants, 55 samples were selected for high-coverage PacBio HiFi long-read WGS and Hi-C sequencing, and 1,099 were selected for modest-coverage PacBio long-read WGS. Self-reported ancestry information was not explicitly collected and was therefore not available for this study. Informed consent was obtained from all participants. The study was approved by the Ethics Committee of Westlake University (approval no. 20201127YJ001) and the Ethics Committee of Wenzhou Medical University (approval no. 2019-107).

Short-read WGS

gDNA was extracted from peripheral blood using a gDNA extraction kit and a HF24 nucleic acid purification instrument (Concert). DNA concentration was measured using a Qubit 4.0 fluorometer (Thermo Fisher). A total of 1 µg of DNA was treated with ribonuclease A (Takara) and purified using CleanNGS-R beads (Vdobiotech). Fragmentation, end-repair and A-tailing were performed using 5× WGS fragmentation mix and 10× fragmentation buffer (Enzymatics), followed by ligation with TruSeq index adaptors using T4 DNA ligase and 5× rapid ligation buffer (Enzymatics). Purification, size selection and PCR amplification produced a sequencing library with a target size of 450–550 bp and normalized to 3 ng µl–1. Short-read WGS was performed on an Illumina NovaSeq-6000 platform.

High-coverage long-read WGS

High-quality DNA samples (absorbance of A 260 /A 280 = 1.8–2.0, A 260 /A 230 > 2.0, concentration of ≥100 ng µl–1 and fragment size of ≥40 kb) were purified with Ampure PB beads and fragmented using a Megaruptor2 system (Diagenode). Libraries were prepared using a SMRTbell Prep kit 2.0 with steps similar to modest-coverage sequencing, targeting a library size of 15 kb. Long-read WGS was performed on a PacBio Sequel IIe platform using 8M SMRT Cell and Sequencing plates (v.2.0), with a pre-extension time of 4 h and a video time of 30 h. Primary CCS reads (accuracy >0.88) were generated using CCS (v.6.2.0; https://github.com/PacificBiosciences/ccs) with the --all option. Subreads were aligned to their corresponding primary CCS reads using actc (v.0.60; https://github.com/PacificBiosciences/actc). Finally, HiFi reads (accuracy >0.99) were generated using DeepConsensus (v.1.2.0)60.

Modest-coverage long-read WGS

DNA quality was assessed using a Nanodrop spectrophotometer (Thermo Fisher), a Qubit 4.0 fluorometer (Thermo Fisher) and Mapper XA Chiller system (Bio-Rad). Samples meeting quality criteria (A 260 /A 280 = 1.8–2.0, A 260 /A 230 > 2.0, concentration of ≥70 ng µl–1 and fragment size of ≥40 kb) were purified with Ampure PB beads and fragmented using an Eppendorf centrifuge. Libraries were prepared using a SMRTbell Prep kit 2.0 (PacBio), which involved DNA damage repair, end-repair, adapter ligation and size selection. The libraries were then bound with Sequencing Primer v.3 and Bind Polymerase 2.0, followed by purification with 1.2× AMPure PB beads. Long-read WGS was performed on a PacBio Sequel IIe platform using 8M SMRT Cell and Sequencing plates v.2.0, with a pre-extension time of 4 h and a video time of 30 h. With the --all option, CCS generates one representative read for each zero-mode waveguide (ZMW), either by generating a consensus sequence when sufficient subreads are available or by selecting the median-length subread otherwise (Extended Data Fig. 1a). This process produces two types of read: HiFi reads and non-HiFi reads. Although non-HiFi reads have lower accuracy than HiFi reads, their inclusion resulted in a 2.7-fold increase in total sequencing coverage compared with using HiFi reads alone (Extended Data Fig. 1b).

Hi-C sequencing

Peripheral blood samples were treated with 4 ml of 37% formaldehyde solution (Sigma-Aldrich) for crosslinking, followed by centrifugation to remove the supernatant. The pellets were washed twice with DPBS (Solarbio) and treated with a red blood cell lysis solution (Miltenyi) to remove erythrocytes. After additional washes with DPBS, the samples were resuspended in deionized water. gDNA was released from the suspension using 10% SDS and 10% Triton-X100 (Sigma), followed by digestion with the MboI enzyme (NEB). The resulting DNA fragments were end-repaired using Klenow fragment (Enzymatics) with Biotin-16-dCTP (Trilink) and ligated to adjacent fragments with T4 DNA ligase (Enzymatics). The ligated product was de-crosslinked using proteinase K (Qiagen) and 10% SDS and purified with SPRIselect beads (Beckman Coulter). DNA concentration was quantified using a Qubit 3.0 fluorometer (Thermo Fisher). Approximately 500 ng of purified DNA was subjected to fragmentation, end repair and A-tailing using 5× WGS fragmentation mix and 10× fragmentation buffer (Enzymatics). The fragmented DNA was ligated to Truseq index adaptors and purified using CleanNGS-R beads (Vdobiotech). Biotin-labelled target fragments were captured with M270 magnetic beads (Invitrogen) and amplified using KAPA HiFi HotStart ReadyMix (Kapa Biosystems). The PCR product was purified and normalized to 3 ng µl–1, producing a sequencing library with a target size range of 500–600 bp. Hi-C sequencing was performed on an Illumina NovaSeq-6000 platform.

... continue reading