Bacterial genomic datasets compiled for this study
Several bacterial genomic datasets were compiled for this study, and different selections of these datasets were used, as appropriate, for the analyses described herein. We refer to these datasets as: ‘NCBI complete genomes’, ‘NCBI long-read assemblies’, ‘NCBI short-read E. faecium assemblies’, ‘SHC long-read Enterococcus isolates’ and ‘SHC long-read metagenomes’. We also define the ‘high-contiguity Enterococcus genomes’, which consist of Enterococcus genomes from (1) NCBI complete genomes, (2) NCBI long-read assemblies, and (3) SHC long-read Enterococcus isolates. Datasets were combined without deduplication.
To aid readers in interpreting our manuscript and for transparency, Supplementary Table 1 contains (1) a detailed description of bacterial genomic datasets compiled for this study, (2) descriptions of any additional filtering for specific analyses, and (3) a description of how these datasets are used in each main figure and Extended Data figure.
NCBI long-read assemblies: long-read dataset curation, preprocessing, assembly and quality control
Candidate isolate long-read sequencing datasets were identified on NCBI with the following search criteria on 29 February 2024: ((((Enterococcus[Organism]) OR (Staphylococcus[Organism] OR (Klebsiella [Organism]) OR (Acinetobacter [Organism]) OR (Pseudomonas [Organism]) OR (Enterobacter [Organism]) OR (Escherichia [Organism]) OR (Salmonella[Organism]) OR (Streptococcus[Organism])) AND ‘oxford nanopore’[Platform])). Datasets for Shigella and Bordetella were queried separately on 28 October 2024 in the same way. Metadata for these accessions was downloaded with the Sequence Read Archive (SRA) Run Selector interface. Datasets with less than 50 Mb of total sequencing or less than 2 kb average read length were not included for further analysis. Raw reads were downloaded using SRA tools prefetch and fasterq-dump, and assembled with flye56 (v.2.9) using the --nano-raw method. Genome taxonomy was verified with GTDB-Tk57 (v.2.4) with database release v.220.
Assemblies were then filtered for appropriate contiguity (≤20 contigs), genome size (±2 s.d. per genus), coverage (≥40×) and with identified species-level taxonomy. We excluded the GTDB-Tk taxonomy result for the Shigella datasets because the GTDB-Tk database does not include Shigella genomes and thus cannot classify them beyond E. coli. Finally, we filtered the set of genomes to include only species with more than ten samples in this and the NCBI complete genomes (see ‘NCBI complete genomes’ below) combined. The final set of genomes with assembly and taxonomy information is available in Supplementary Table 2. We refer to this dataset in the manuscript as NCBI long-read assemblies. Relevant source code and data can be found in our repository (‘Code Availability’).
NCBI complete genomes
The NCBI command line tool ncbi-datasets-cli (v.16.22.1)58 was used to download all available NCBI genomes with the ‘complete’ assembly level from the following taxa on 3 July 2024: Enterobacter, Staphylococcus, Klebsiella, Acinetobacter, Pseudomonas, Enterococcus, Escherichia, Salmonella, Streptococcus, Shigella and Bordetella. For example, the following command was used to download Enterococcus assemblies: datasets download genome taxon enterococcus --assembly-level complete --include genome,gff3 --dehydrated --filename enterococcus.zip. Note, this command collects a reference to the relevant genomic data. Data must be downloaded using the ‘rehydrate’ command. After rehydrating, duplicate assemblies were removed, preferring the RefSeq over GenBank as the source database. Assemblies were then filtered according to the same contiguity, size, coverage and sample count thresholds as above, and the final set is likewise available in Supplementary Table 2. We refer to this dataset in the manuscript as NCBI complete genomes. Relevant source code and data can be found in our repository (‘Code Availability’).
SHC long-read Enterococcus isolates: collection and sequencing
A total of 279 bloodstream infection isolates identified by biotyping with a MALDI-TOF based Bruker Biotyper as E. faecalis (n = 170) or E. faecium (n = 109) by the Stanford Health Care Clinical Microbiology Laboratory between the periods 1 October 2020 to 31 December 2021 and 1 January 2023 to 31 December 2023 were collected under a protocol approved by the Stanford University Institutional Review Board (Protocol no. 64591; Principal Investigator: A. Bhatt, Committee IRB-61). Informed consent was obtained from all human research participants. An additional eight isolates collected in 2024 identified as atypical Enterococcus as above were stored as glycerol stocks. Metadata, including date, site of collection and relevant patient information, were extracted. Each isolate was cultured aerobically for approximately 16 h with shaking at 37 °C in 3 ml Brain Heart Infusion (BHI) broth (Sigma-Aldrich). Then, a 1:100 subculture into 10 ml BHI broth was prepared. Subcultures were allowed to reach an OD of approximately 1.0. The subculture was centrifuged at 3,000g for 10 min at room temperature, resuspended in PBS, then centrifuged again. The PBS was poured off and the subsequent pellet was resuspended in 1 ml 1× Zymo DNA/RNA Shield (Zymo, catalogue no. R1100-250) and shipped to Plasmidsaurus for Oxford Nanopore Technologies (ONT) ‘Standard Bacterial Genome with DNA extraction’ sequencing. After inspection of the assemblies, one E. faecium isolate (EF_B_40) was mischaracterized as E. faecalis and one E. faecalis isolate (FM_B_40) was mischaracterized as E. faecium, on the basis of biotyping. Of the 170 E. faecalis isolates, 167 were confirmed as E. faecalis. Of the 109 E. faecium isolates, 107 were confirmed as E. faecium with standard quality control performed by Plasmidsaurus. Finally, assemblies were filtered according to the same contiguity and coverage thresholds as above, resulting in 166 E. faecalis and 100 E. faecium genomes, respectively. We refer to this dataset in the manuscript as SHC long-read Enterococcus isolates.
... continue reading