The DNA virome varies with human genes and environments

Ethics

This research complies with all relevant ethical regulations. The study protocol (NHSR-8429) was determined to be not human subject research by the Broad Institute Office of Research Subject Protection as all data analysed were previously collected and de-identified. Use of SPARK data for this research was approved by SFARI (project 3350.2).

UK Biobank, All of Us, and SPARK WGS data

All WGS data analysed in this work were generated in previous studies. For all cohorts, PCR-free methods were used in library preparation and sequencing was done on Illumina NovaSeq 6000 machines. WGS from UKB23 was performed on libraries prepared with NEBNext Ultra II PCR-free kit (New England Biolabs) using blood-derived DNA from 490,401 individuals at the deCODE facility in Reykjavik, Iceland and the Wellcome Sanger Institute (Sanger), Cambridge, UK. Sequencing reads were aligned to human reference build GRCh38 graph genome with Illumina DRAGEN Bio-IT Platform Germline Pipeline v.3.7.8. Samples were sequenced to an average coverage of 32.5×.

WGS from the NIH All of Us v8 cohort24 was performed on blood and saliva-derived DNA from 414,817 individuals (365,918 blood-derived and 48,899 saliva-derived samples). Sequencing reads from libraries prepared with PCR-Free Kapa HyperPrep library construction kit were aligned to human reference build GRCh38 with Illumina DRAGEN Bio-IT Platform Germline Pipeline v.3.4.12. Samples were sequenced to an average coverage of 37.9×.

WGS from the SPARK cohort of the Simons Foundation Autism Research Initiative (SFARI)25 was performed on saliva-derived DNA from 12,519 individuals. Sequencing reads from libraries prepared with Illumina DNA PCR-Free Library Prep kit were aligned to human reference build GRCh38 with BWA-MEM by the New York Genome Center (NYGC) using Centers for Common Disease Genomics project standards. Samples were sequenced to an average coverage of 42×. Details of saliva sample collection, DNA extraction, and sequencing were described in ref. 68 (which analysed data from sequencing waves WGS1–3 of the SPARK integrated WGS (iWGS) v.1.1 dataset; here we analysed WGS1–5, which included additional samples included in subsequent sequencing waves).

Genotypes of genome-wide human genetic variants (SNPs and insertion–deletions) were previously generated for all three datasets. For UKB, we analysed genotypes previously imputed into UKB SNP-array data55 using the TOPMed reference panel69 (as the final UKB WGS genotype call set was not available at the time of analysis). For AoU and SPARK, we analysed genotypes previously called from WGS using DRAGEN (AoU24) and DeepVariant (SPARK25).

Selection of viral reference genomes

We selected a panel of 31 viruses for which to profile viral DNA load (Supplementary Table 1) based on previous work that identified the most prevalent viruses observed in blood-derived DNA sequencing data from 8,240 individuals9. Additional viral reference genomes were included to more comprehensively represent viral families previously observed. For example, adenovirus was represented by Human adenovirus type 7 and human adenovirus E (type 4) reference genomes, as these are two of the most commonly observed in adults70, and polyomavirus was represented by a set of common types (1/BK, 2/JC, 3/KI, 4/WU, 5/MC, 6, 7)71,72. Undetected herpesviruses (HSV-2 and VZV) were added given the high general prevalence of EBV and HHV-7. To maximize representation of the diversity of anellovirus genomes while limiting the size of the viral reference panel, we selected a representative from each of the five recently described anellovirus clades with available NCBI reference genomes73 (Extended Data Fig. 1a).

The rationale for prioritizing a smaller set of viruses to profile (rather than attempting to comprehensively characterize viral diversity) was that our main goal was to identify effects of human genetics, age, sex and environmental exposures (such as smoking) on viral DNA load, and we only had statistical power to detect such effects for commonly observed viruses. Working with a smaller set of common viruses allowed us to perform careful QC on WGS-based quantifications of viral DNA load, which was important given the potential for read alignment artefacts.

... continue reading