Genome-wide sweeps create ecological units in the human gut microbiome

Isolate genome collection

A total of 19,837 publicly available isolate genomes were collected and consisted of the entire data in the Unified Human Gastrointestinal Genome (UHGG) catalogue45 (v.1.0, 10,648 genomes) and 4 large-scale culturomic studies28,46,47,48. Furthermore, a group of 186 isolates from Austrian individuals were newly collected and sequenced. A total of 50 patients undergoing CRC screening colonoscopy at the Vienna General Hospital were enrolled in the study, which comprised 24 individuals with irritable bowel syndrome, 5 with UC and 21 healthy participants. None of the participants were found to have carcinomas during the colonoscopy.

For bacterial isolation of participants in the Austrian study, brush samples and biopsy samples were collected during colonoscopy from the ileal or caecal mucosa or the ascending colon with or without an endoscopically visible biofilm and were immediately processed49. Brush and biopsy samples were vortexed or homogenized in 0.6 ml of 0.9% NaCl, and the suspensions were subsequently plated on one of the following six culturing conditions: Columbia agar with 5% sheep blood, MacConkey agar, Columbia CNA agar with 5% sheep blood or CPS agar (Becton Dickinson) under aerobic conditions at 37 °C; Brucella agar with 5% horse blood or Schaedler KV agar with 5% sheep blood (Becton Dickinson) under anaerobic conditions at 37 °C. Aerobic cultures were assessed after 18 h and 48 h, and anaerobic cultures were assessed after 48 h and 72 h. Colonies identified as Bacteroides or Parabacteroides by matrix-assisted laser desorption ionization time-of-flight mass spectrometry analysis on a MALDI Biotyper MBT smart instrument (Bruker) or by 16S rRNA gene sequencing on a capillary sequencer (SeqStudio Genetic Analyzer, Applied Biosystems by Thermo Fischer Scientific) were saved as glycerol stocks at −80 °C.

Glycerol stocks of Bacteroides and Parabacteroides were cultured in brain heart infusion medium with supplements for 24 h before DNA isolation. DNA isolation was performed on a King Fisher Flex instrument (Thermo Fisher Scientific) using a MagMA DNA Multi-Sample Ultra 2.0 kit (Thermo Fisher Scientific), which included an initial proteinase K digestion step and a RNase treatment step. Sequencing libraries were prepared using a NEBNext Ultra II FS DNA Library Prep kit, with NEBNext Multiplex Oligos for Illumina barcodes. Sequencing was performed on an Illumina NovaSeq 6000 platform using SP flow cells (300 cycles, 2 × 150 bp paired-end reads). Reads were trimmed, filtered and merged with BBMap50 (v.38.90; ktrim=r k=21 mink=11 hdist=2 minlen=125 qtrim=r trimq=15), and de novo genome assembly was performed using Spades (v.3.15.5)51 under isolate mode.

Isolate quality filtering and taxonomic assignment

A total of 20,023 genomes were collected and evaluated using CheckM (v.1.2.2)52 to screen for genomes that met the following criteria: >85% genome completeness, <5% contamination and N50 > 50 kb. This selection process produced a total of 16,864 genomes that were of sufficient quality for downstream analyses. We assigned each genome to a SGB according to the MetaPhlAn4 reference genome database (v.Jan 2022)23. Each genome was assigned to the SGB it was most closely related to based on FastANI (v.1.33)53, with the centroid genome as the primary reference or, if unavailable, a representative genome. To account for potential mis-assignments in species with boundaries slightly lower than the commonly used threshold (95%)54, we adopted a more relaxed species boundary of 94% average nucleotide identity (ANI) for SGB assignment. Specifically, SGB assignments were only made for genomes that were less than 6% divergent from at least one reference genome with over 30% sequence alignment. As a result, the total divergence in each SGB could be as high as 10–12%. Genomes failing to meet the 94% ANI cutoff with their closest relatives were converted to synthetic fastq reads (ART-2016.06.05, -ss HS25)55 and assigned by MetaPhlAn4 (v.4.0.3)23. Genomes that could not be assigned by either method were excluded. We further checked whether the ANI-based and MetaPhlAn4-based SGB assignments are generally consistent by converting a random subset of genomes in each ANI-based SGB to synthetic fastq reads (ART-2016.06.05, -ss HS25)55 and assigning them by MetaPhlAn4. Although the majority of ANI-based and MetaPhlAn4-based SGB assignments were congruent, certain ANI-based SGBs had all genomes assigned to another SGB in MetaPhlAn4. All genomes in these SGBs were reassigned to their corresponding MetaPhlAn4-assigned SGB (Supplementary Table 4).

Isolate genome filtering based on metadata

A key step to ensure unbiased genome-wide sweep identification is to filter the genomes so that each SGB only contains isolates that originate from different individuals. For each isolate, we retrieved information on human participant identifiers, age, sex, health status, country, year and creator of collection, and BioProject accession number from the UHGG database45, as well as from the text or supplementary materials of the respective publications. For each SGB, we only retained isolates that originated from a different human participant based on either a unique identifier, country of sample or BioProject number (representing a different study; studies with multiple BioProject numbers were checked for and manually corrected). For studies with more than ten genomes but no human participant identifier or country information, we created a human participant identifier as a combination of the following five factors: age (or age group when the exact age was not available), sex, health status of the participant, year and creator of collection. Human participants with different combination-based identifiers were considered as different individuals. When multiple genomes from a single SGB were isolated from the same individual, we chose the genome with the highest quality score according to dRep (v.3.4.1)56. This procedure resulted in a final collection of 6,411 high-quality isolate genomes that originated from different human participants (Supplementary Table 5). These isolates spanned 995 SGBs, of which 176 contained more than 5 genomes. As one SGB, SGB10068, assigned as Escherichia coli, contained 1,053 genomes and was larger than any other SGB, we randomly subsampled this SGB to 25% of its original size so that it became comparable in size to the second-largest SGB. This SGB was renamed as SGB10068s to indicate the subsampling process. Each SGB was assigned to its corresponding family, genus and species-level taxonomy in the MetaPhlAn4 database.

Estimation of recombined and clonal genome fractions via mixture modelling

To estimate the recombined and clonal fractions in a pair of genomes, we developed a method that uses a combination of maximum-likelihood estimations (MLEs) and hidden Markov models (HMMs). This method is conceptually similar to a previously published method22, but with several technical adjustments (detailed at the end of the model validation section below). The major rationale behind the method is that SNPs introduced by mutations between a pair of genomes should be randomly distributed across the genomes, whereas recombination generates regions in the genomes that have an increased or decreased number of SNPs depending on whether the recombined genome fragment stems from a distant or close relative (Extended Data Fig. 1). It is therefore possible to partition genome alignments into regions that have been vertically inherited or recombined based on SNP distributions across the alignment.

... continue reading