Ethics
This research complies with all relevant ethical regulations. The study protocol was determined to be not human subjects research by the Broad Institute Office of Research Subject Protection as all data analysed were previously collected and de-identified.
Quantification of microbial relative abundances from saliva WGS
We analysed saliva-derived WGS data previously generated for 12,519 individuals from the SPARK cohort of the Simons Foundation Autism Research Initiative (SFARI)25. In brief, DNA extracted from saliva samples was prepared with PCR-free methods for 150 bp paired-end sequencing on Illumina NovaSeq 6000 machines. Reads were aligned to human reference build GRCh38 by the New York Genome Center using Centers for Common Disease Genomics project standards. Details of saliva sample collection, DNA extraction and sequencing were described in ref. 26 (which analysed data from sequencing waves WGS1–3 of the SPARK integrated WGS (iWGS) v.1.1 dataset; here we analysed WGS1–5, which included additional samples included in subsequent sequencing waves).
From the CRAM files previously aligned to GRCh38, we extracted all unmapped reads for subsequent generation and analysis of oral microbiome phenotypes. This retrieved a median of 67.5 million unmapped reads per sample ([35.9 million,126.1 million], quartiles), comparable with the total number of reads used for previous metagenomic characterization of human microbiomes71 and consistent with previous analyses of SPARK samples for oral microbiome profiling26,27. Unmapped reads were converted to compressed FASTQ with samtools (v.1.15.1) and then used as input for microbiome profiling using MetaPhlAn (v.4.0.6) with the vOct22 reference database. To evaluate robustness of these oral microbiome profiles, relative abundances of species in SPARK samples were compared with those from samples in the Human Microbiome Project72 by first subsetting to those profiled in both cohorts and then performing principal coordinate analysis using Bray–Curtis distance (Extended Data Fig. 1b).
From the relative abundance phenotype generated for each species by MetaPhlAn, we estimated the fractions of variance explained by covariates (age, sex, ASD and genetic ancestry PCs; Fig. 1c) using analysis of variance (finding the sum-of-squares for each covariate and dividing by the total sum-of-squares).
Generation of mPCs
Relative abundance measures of all microbes were first filtered to 439 entries corresponding to microbial species found in at least 10% of SPARK DNA samples. Rank-based inverse normal transformation across individuals was then performed for the abundance of each species. This transformation did not always produce values with mean 0 and variance 1 (due to large fractions of samples with zero abundance), so these were then scaled and centred for use as input to PC analysis to obtain 439 orthogonal microbial abundance principal components (mPCs) representing orthogonal axes of microbial variation.
Genotyping and quality control of human genetic variants in SPARK
Variant calling in SPARK was previously performed using DeepVariant (v.1.3.0) to produce sample-level VCFs from reads aligned to GRCh38 followed by GLnexus (v.1.4.1) to call variants jointly across the cohort. We performed a series of QC steps on the joint call set, starting by converting half-calls to missing and then excluding variants with >10% missingness using plink2 (v.2.00a3.6LM). Variants were further excluded if they had a minor allele frequency <1% or if they had a Hardy–Weinberg equilibrium exact test P < 1 × 10−6 with mid-P adjustment for excessive heterozygosity, leaving 12,525,098 common variants. Genetic ancestry PCs were generated by LD-pruning variants in 500 kb windows with r2 > 0.1 and then running plink2 --pca approx. No individuals were filtered for outlier heterozygosity after inspection in each genetic ancestry group. Variants were then filtered to those present in the TOPMed-r3 imputation panel to exclude those in regions of poor mappability to produce a final set of 9,618,621 common variants to test for association with oral microbiome phenotypes.
... continue reading