Data retrieval, curation and proteome database reconstruction
eTOLDB construction
To guide dataset selection and LECA reconstruction, we used the phylogenetic backbone of the eTOL and the nine eukaryotic supergroups classification depicted in Supplementary Fig. 1a. This represented a consensus between recent eTOL reconstructions54,55,56,57,58 and an operational definition of two stems and nine supergroups that largely reflected previous classifications59. Further details are provided in the Supplementary Methods.
We manually selected and retrieved 276 eukaryotic proteomes (Supplementary Table 1; see Zenodo repository24 and Supplementary Information) downloaded from NCBI60, EukProt61, P10K62,UniProt63, Ensembl64 and SGD65, aiming to cover all supergroups represented in our reference phylogenetic backbone (Supplementary Fig. 1). We aimed for a balanced representation of divisions and prioritized the most complete and highest-quality genomes for each division on the basis of information available in the hosting databases. For each proteome, we kept only the longest isoform per gene, as indicated in the GFF files or FASTA headers. The isoform-free proteomes were then analysed for completeness using BUSCO v.5.5 (ref. 66) with default parameters and eukaryotic_odb10, and we extracted basic statistics using seqkit stats v.2.8.2 (ref. 67). To assess for the presence of uncertain amino acids (Xs) and low-complexity regions in the proteomes, we ran SEG68 with default parameters. We used an in-house script (‘Code availability’) to trim the low-complexity ends of the proteins and calculated the coverage of low-complexity regions for each protein. If the coverage of low-complexity regions accounted for more than 25% of a protein, we excluded that protein from downstream analyses. Moreover, we calculated the proportion of low-complexity proteins in the whole proteome as a measure of the quality of the protein set and also discarded proteins that were larger than 10,000 amino acids or smaller than 50 amino acids. This resulted in a set of clean proteomes. To remove redundancy and recent duplications (with respect to eukaryotic evolution), we clustered each proteome individually using MMseqs v.2 (ref. 69) with strict thresholds: minimum amino acid identity percentage of 80%, minimum target coverage of 0.5 and coverage mode 2 (–min-seq-id 0.8 -c 0.5–cov-mode 2). The representative sequences of the clusters constituted the final proteome. To assess the effects of the cleaning and clustering steps, we again ran BUSCO v.5.5 and seqkit stats v.2.8.2. We discarded proteomes that showed a combined reduction in number of proteins greater than 50%, those with a proportion of low-complexity proteins greater than 50% and those with BUSCO completeness reduction greater than 30%. This resulted in exclusion of 30 proteomes, with 256 proteomes remaining.
We built three eTOL datasets of 100 proteomes each, which we named eTOLDBA, eTOLDBB and eTOLDBC. For this, we first classified the 256 proteomes on the basis of taxonomy and distributed them among 9 different supergroups and 28 divisions (Supplementary Fig. 1). When fewer than four species were available for a division, we included all in the three eTOLDB databases. For the remaining divisions, we selected different, partially overlapping species sets for each eTOLDB. eTOLDBA contained, whenever possible, the most complete proteomes and the fewest parasites. More incomplete proteomes were added in turn to eTOLDBB and in eTOLDBC. Proteomes for model species Homo sapiens, Arabidopsis thaliana and Saccharomyces cerevisiae were included in the three databases despite being part of large taxonomic groups. This was done for later annotation purposes. The list of chosen species for each dataset can be found in Supplementary Table 1. Finally, a total of 185 species were present in the three eTOLDBs with overlaps of 58 species between eTOLDBA and eTOLDBB, and 53 species between eTOLDBA and eTOLDBC and between eTOLDBB and eTOLDBC. This resulted in an average protein overlap of 46% (Supplementary Fig. 1).
BroadDB construction
To ensure even sampling of the prokaryotic genome repertoire while minimizing the impact of intradomain HGT as a confounding factor of prokaryotic ancestry, we downsampled available prokaryotic genomes using a pangenome approach. For this, we took all species representatives in GTDB r207 (ref. 22) (62,291 bacteria and 3,412 archaea; see Zenodo repository24) and selected one genome per genus (sensu GTDB), choosing the one with the maximal CheckM70 completeness and, in cases of ties, minimal CheckM contamination. The median CheckM completeness was 94.50%, and the median contamination was 0.990%. The proteomes of each genus representative were then grouped at order level and clustered using mmseqs2 easy-linclust69 (–min-seq-id 0.3 -c 0.5–cov-mode 0–cluster-mode 2 -e 0.001). We kept a sequence representative for clusters present in 15% or more of genus representatives in that order, as 15% is the cutoff for the ‘cloud’ component of the pangenome71. To this dataset we added representative sequences from 1,319,927 viral protein clusters at 95% identity percentage available at RVDB v.25.0 (ref. 23) (see Zenodo repository24). For taxonomic assignment of the sequences, we followed the GTDB and RVDB taxonomies.
Reconstruction of LECA
LECA-OG reconstruction
We ran OrthoFinder v.2.5.5 (ref. 72), using BLAST v2.15.0+73 for the homology search with the filter for low complexity (-seg yes). Before the clustering step, BLAST results were filtered to remove hits with an e-value greater than 1 × 10−5 and an overlap less than 0.5. We also removed hits with disparate sizes (when the hit was more than 2.5 longer or less than 0.5 shorter than the query) to avoid artefactual clustering of families due to domain sharing. The clustering part of OrthoFinder was executed using two different inflation parameters (I = 1.5 and I = 3.0). Comparison between the two datasets showed that I = 3.0 gave 30,000 more groups on average than I = 1.5. When we filtered to retain only LECA groups, there was an increase of 990 groups on average between I = 3.0 and I = 1.5. Given the small difference, we chose to continue with the larger groups (I = 1.5), as they could be split on the basis of tree topology at a later point.
... continue reading