Taxon sampling and HGs inference
We compiled 154 genome samplings from published projects uploaded in UniProt62, NCBI63, Ensembl64 and other resources (Supplementary Table 1). 154 genomes were downloaded, containing 3,934,362 predicted proteins, including 151 metazoan and 3 unicellular organism genomes. A side script provided by OrthoFinder (primary_transcript.py) and Cd-hit v.4.8.1 (ref. 65) (using a similarity threshold of 1.00) were used to extract the canonical proteins from the original data. The quality of canonical proteins from 154 genomes were assessed by BUSCO v.5.4.7 (ref. 66) (Supplementary Table 1). The completeness greater than 85% and fragmentation less than 15% were preferable. It should be noted that we considered not only genome completeness but also the habitat and phylogeny of the species, so we selected some genomes that did not perfectly meet the above standard. We then inferred HGs by Orthofinder v.2.5.5 (ref. 67), using dependencies of MAFFT v.7.505 (ref. 68) and DIAMOND v.2.1.8 (ref. 69). The HGs were then used for analysing gene content.
Guide tree
We drew a guide tree that was used for gene expansions/contractions analyses and the later time tree. This guide tree, based on the species positions inferred from previous literature70,71,72,73, was used to build the phylogeny of metazoans through the following steps. While some nodes are controversial (namely the position of sponges, ctenophors, or acoels), they are far removed from the terrestrial nodes of interest. First, we started from the conserved single-copy genes in the Metazoa_odb10 from BUSCO v.5.4.7 (ref. 66). Homo sapiens contains 943 of these genes, which were extracted to serve as reference orthologous. These identified conserved protein sequences were aligned using MAFFT v.7.505 (ref. 68) and trimmed using trimAl v.1.4.rev.15 (ref. 74) to remove poorly aligned regions. Next, we concatenated the trimmed alignments into a single supermatrix using FASconCAT-G v.1.05.1 (ref. 75). Finally, the concatenated supermatrix was used to build the phylogeny with IQ-TREE v.2.2.2.6 (ref. 76), using C60 + G + I model, using the guide tree as a constraint, and performing 1,000 bootstrap replicates. The resulting phylogeny, with branch lengths representing genetic changes, was subsequently used in CAFE5 for further analysis (see following methods).
Gene content analysis
Novel HGs: HGs that are present in at least one species within the LCA of a lineage (following we called a node), while being absent in all species of outgroup.
Novel core HGs: HGs that are present in all species within a node (or absent only once for node containing more than three species), while being absent in all species of outgroup. For the node with two species, novel HGs are equal to novel core HGs.
Lost HGs: HGs that are lost in all species within a node, while being present in the sister groups and other species in outgroup.
Expanded HGs: the increase in the number of gene copies occurred within HGs, often due to gene duplication events.
Contracted HGs: the reduction in the number of gene copies occurred within HGs.
... continue reading