Dominant contribution of Asgard archaea to eukaryogenesis

Database curation

For prokaryotes, 75 million sequences were curated from 47,545 fully sequenced prokaryotic genomes obtained from the NCBI GenBank (https://ftp.ncbi.nlm.nih.gov/genomes/) in November 2023 (prok2311) and supplemented with 441,150 proteins sequences from 63 Asgard metagenomes4,5 (prok2311As), selected to represent all clades in these two publications where available. Protein sequences were either taken directly from GenBank annotations or produced by Prodigal v2.6.3 (ref. 75) trained on the set of 12 complete or chromosome-level assembled Asgard genomes. To include only those sequences that are present widely within prokaryotic families, and to minimize the possibility of post-LECA horizontal transfer from eukaryotes, soft-core pangenomes were constructed for each of our 26 curated taxonomic groups, conforming mostly to the NCBI taxonomy rank ‘class’ (see below and Extended Data Fig. 1). To construct the soft-core pangenomes, all sequences from each taxonomic group were clustered individually using mmseqs2 (ref. 37) ‘mmseqs cluster -s 7 -c 0.9 --cov-mode 0’37. All clusters that did not contain sequences from at least 50% of all bacterial species (here defined by lowest rank taxonomic ID within the NCBI taxonomy), or 20% of eukaryotic sequences (see below), were rejected from the database. The effects of alternative pangenome definitions, such as 10%, 25%, 50% and 67% proteins per class, are shown in Extended Data Fig. 1d.

The eukaryotic database was constructed starting from a curated set of 72 genomes available from the NCBI GenBank. To provide a more accurate representation of eukaryotic diversity, this dataset was supplemented with EukProt v.3 (ref. 36)—a curated database aiming to provide a sparse representation of eukaryotic biology, to create Euk72Ep containing 30.3 million sequences. As sequences included in EukProt v.3 come from a diverse range of sources, some known to contain contaminant prokaryotic sequences, the database was screened for prokaryotic contamination. This screen was carried out by first constructing candidate prokaryotic and eukaryotic HMM databases as described below and querying both databases against the eukaryotic sequence database using hhblits43. All eukaryotic sequences with a top hit alignment score from a prokaryotic HMM were removed from the database before initial clustering.

Curation of taxonomic labels

To provide a compromise between taxonomic specificity and accuracy of clade assignment, we curated a set of taxonomic labels for both Euk72Ep and Prok2311As to act as representatives. To construct this set, we started by manually assigning all species in EukProt to their closest relatives in the NCBI taxonomy. Then, each species from Prok2311As and Euk72Ep was assigned a taxonomic label corresponding to their ‘Class’ rank (for example, Alphaproteobacteria, Thermococci, Mammalia) based on the NCBI Taxonomy of November 2023 using tools from the ete3 v.3.1.3. toolkit76 as well as custom scripts. For species without a corresponding ‘Class’ rank, the closest relevant rank was assigned manually. This procedure yielded an initial list of 317 unique taxonomic class identifiers. As the rank ‘Class’ does not partition biological diversity evenly, identifiers were further mapped manually to a curated set of 45 eukaryotic and 26 prokaryotic taxonomic superclasses, each present in the NCBI taxonomy (for example, Metazoa, Streptophyta, TACK archaea; Extended Data Fig. 1. Due to the partially unresolved taxonomic relationship within Asgard archaea, all candidate Asgard sequences from Prok2311 or from refs. 6,7 were assigned to the class label ‘Asgard’. Following the recent reclassification of Deltaproteobacteria into Myxococcota, Desulfobacteriota, Bdellovibrionota and SAR324 species that could not be mapped into either of these clades using existing taxonomic annotation were classified as Deltaproteobacteria (0.025% of total data).

Sequence clustering and profile database generation

To transform Euk72Ep and Prok2311As into profile databases suitable for sensitive HMM–HMM searches, we implemented an unsupervised, cascaded sequence-profile clustering pipeline using the tools available in the mmseqs2 software suite37. In brief, each sequence database was initially clustered at 90% sequence identity and 80% pairwise coverage using ‘mmseqs linclust’ and cluster representatives were chosen using ‘mmseqs result2repseq’. This procedure provides a non-redundant set of 6.3 million prokaryotic and 25.1 million eukaryotic sequences for further analysis. The non-redundant sequences are further collapsed into a set of initial profiles and consensus sequence pairs using ‘mmseqs cluster’, ‘mmseqs result2profile’ and ‘mmseqs profile2consensus’. All consensus sequences are queried against their profiles using ‘mmseqs search’ and results clustered using ‘mmseqs clust’. Clusters are mapped to their original non-redundant sequences using ‘mmseqs mergeclusters’ and profiles and consensus sequences are constructed once again. To grow clusters based on their sequence-profile alignments, this procedure of ‘mmseqs profile2consensus - mmseqs search - mmseqs clust - mmseqs result2profile’ is iterated until cluster sizes converge. An 80% pairwise coverage is maintained throughout all steps of the cascaded clustering protocol to avoid homologous overextension. The final clustered databases contain 14.1 million and 91,000 clusters for Euk72Ep and Prok2311As, respectively.

To transform the unsupervised clusters into a set of HMM profiles, we used a mixed MSA construction strategy coupled with automatic data reduction to create accurate alignments for a wide diversity of cluster sizes and sequence diversities. To construct HMMs, all non-singleton sequence clusters are first aligned using FAMSA v.2.0.1 (ref. 77) to produce an initial alignment. Using the initial alignment, we then reduce the sequence space by calculating an approximate tree using FastTree v.2.1.1. using ‘FastTree -gamma’78, remove long branch outliers as described in ‘EPOC tree construction and processing’ and iteratively removing the closest pairwise leaves using ete3 (ref. 76), keeping the leaf closest to the root, until the desired sequence set size is reached. For profile generation, we retain no more than 300 sequences for both Prok2311As and Euk72Ep. This reduced sequence set is then realigned with muscle5 v.5.1.linux64 (ref. 42) using ‘muscle -diversified’ with five replicates, and the maximum column confidence alignment is extracted using ‘muscle -maxcc’. This alignment is then trimmed to keep columns with more than 0.2 bits of Shannon information defined as the difference of log 2 (20) and the Shannon entropy of the column amino acid distribution. This process of data reduction/alignment/trimming is referred to as ‘prune-and-align’ and used later for downstream analysis. Trimmed alignments are turned into an HMM database using HH-suite with ‘hhconsensus -M 50’, hhmake and ‘cstranslate -f -x 0.3 -c 4 -I a3m’. The final HH-suite databases contained 480,000 and 1.6 million profiles for prokaryotes and eukaryotes, respectively.

Cluster annealing

As the initial mmseqs2 based clustering was carried out in an unsupervised manner, we considered it necessary to ensure that the results were not contingent on our unsupervised partitioning of the data. To this end, a further cluster annealing step was developed which aims to redefine clusters based on the notion of clades as well separated monophyletic groups on phylogenetic trees, rather directly by sequence similarity and coverage. Starting with the existing clustered databases Euk72Ep and Prok2311As, HMM profiles were calculated for each non-singleton cluster and all versus all HMM–HMM searches were performed using HHblits v.3.3.0 (HHblits -n 1 -p 80), retaining hits with at least 80% HHblits probability and 50% pairwise profile coverage. The resulting hits were then clustered using greedy set clustering, as described in mmseqs2, implemented in Python. These superclusters represent our limit of detection for sequence-based methods, but might involve overclustering, possibly merging several similar gene families into one set. To mitigate the potential effect of such overclustering, cluster partitioning was performed based on phylogenetic tree analysis.

... continue reading