Rationale of EBV detection
The 171,823-nucleotide EBV genome (NC_007605.1) was first included in December 2013 (hg38 version GCA_000001405.15) as a sink for off-target reads that are often present in sequencing libraries, to account for pervasive EBV reads present from the immortalization of LCLs (as with the 1000 Genomes Project and related consortia). Importantly, WGS in the UKB and AOU consortia was performed on whole blood18,60, reflecting that EBV reads detected would derive from viral DNA from past infections.
WGS data and cohort analyses in the UKB
For the UKB, we obtained per-base abundance of EBV DNA of the 490,560 WGS libraries by extracting reads aligning to chrEBV in the hg38 human genome reference that had a read mapping quality (MAPQ) ≥ 30 (q30) via the SAMtools view command61. To quantify EBV DNA abundance for each position, we summed the coverage of each base in the EBV genome across all libraries (per-base abundance). The resulting coverage across the viral contig was approximately flat, supporting that EBV DNA detection from WGS reads was real viral DNA, with two key exceptions (Fig. 1b). First, a total of 27,692 positions had low to no coverage (per-base abundance ≤10) due to low mappability of the EBV contig. Second, two regions (positions 36,390–36,514 and 95,997–96,037) had orders-of-magnitude higher coverage (per-base abundance of ≥103 at these 166 positions). On further examination, the sequences were highly repetitive. Hence, we reasoned that these two regions may confound EBV DNA quantification. To assess this, we calculated EBV DNA abundance per person before and after masking, by summing MAPQ ≥ 30 coverage either across all J = 171,823 bases, or only across the remaining J′ = 143,965 well-covered bases (10 < per-base abundance < 103 for each base). The per-individual EBV sum unmasked was computed over all J bases, whereas the masking was performed over J′ bases.
We then used a two-sided Fisher’s exact test to test for association between EBV DNA presence (EBV DNA coverage > 0) and EBV serostatus, recorded in the UKB as ‘EBV seropositivity for Epstein–Barr Virus’ (data field 23053). Before masking, EBV DNA presence had a weak but insignificant positive association with EBV seropositivity (odds ratio = 1.2, P = 0.03). Conversely, after masking these repetitive regions and recomputing donor detection status, the association between EBV DNA detection and seropositivity was much stronger (odds ratio = 14.6, P = 1.7 × 10−26) (Fig. 1c). These analyses demonstrate that masking highly repetitive regions in the viral contig is required to perform valid inferences from whole genome sequencing data, as evidenced by statistical overlap with EBV serostatus.
Contig mappability analyses
To confirm that regions of the EBV contig that were not detected were attributable to poor mapping quality of those regions, we generated synthetic reads of length 101 bases by tiling the reference EBV contig. Next, each synthetic read was aligned using bowtie2 v.2.5.1 (ref. 62). We define mappability as the percentage of reads overlapping a position with a map quality score exceeding ten. This analysis reproduced regions depleted from the pseudobulk abundance (Extended Data Fig. 1a), indicating that low detection in these regions was due to homology in the hg38 reference rather than variable DNA presence from past infection.
EBV DNA copy number estimation, simulation and thresholding
To calculate EBV DNA abundance per person, we summed the coverage over the well-covered, non-biased bases (J′). We normalized this value against the effective EBV genome size (143,965 bases) to obtain an estimate of the coverage per EBV genome. Next, we used the 30× human WGS coverage and accounted for the diploid human genome to compute an estimate of EBV DNA copy number per human cell, which resulted in approximately 1 in 1,000–10,000 cells in individuals with detectable EBV DNA (that is, our limit of detection was approximately 1 EBV genome per 10,000 cells). To contextualize these values, the upper range of EBV copy numbers in healthy individuals measured using qPCR was 103 EBV genomes per 1 µg DNA, or 1 EBV genome per 200 cells23. The latter number was estimated with the assumption that 105 cells produce 0.5 µg DNA. Although a previous study similarly used EBV reads in a cohort of ~8,000 donors, this analysis did not correct for the repetitive, biased DNA abundances that significantly skewed the resulting quantification4. After quantifying per-person EBV DNA abundance, 85.7% of individuals in the UKB had no detectable EBV DNA.
In the UKB cohort, over 90% of individuals are seropositive, yet only 14.3% of individuals have non-zero EBV DNA levels detected. Therefore, we conducted a simulation study to better characterize the discrepancy. Using maximum likelihood estimation, we estimated values for the mean and standard deviation of a log-normal distribution to initialize the simulation and subsequently modified these values to (1) account for a mixture including 10% zeros (representing the individuals who were not infected with EBV) and (2) adjust the mean for a round, interpretable number. The final values used in the simulation (Extended Data Fig. 1d) were set to zero for 50,000 individuals, whereas the remaining 450,000 individuals were simulated via a log-normal distribution, with a mean of 0.2 EBV genome copies per 10,000 cells, a standard deviation of 0.62 and a censored value of 0.71. We emphasize that this simulation does not test an explicit statistical question but is designed primarily for illustrative purposes, to show that a single underlying component can explain many features of the empirical data (rather than requiring a second condition).
... continue reading