Assessing phylogenetic confidence at pandemic scales

SPRTA and aBayes

Given an estimated rooted phylogenetic tree T and data D in the form of a multiple sequence alignment, our aim is to assign confidence scores to branches b of T.

We take inspiration from the approximate Bayes (aBayes) approach17. aBayes assigns to b a probability score Pr(b∣D, T\b) based on the ratio of the likelihood Pr(D∣T) of the estimated binary tree T versus the likelihoods of the trees ${T}_{i}^{b}$ obtained using nearest neighbour interchange36 (NNI) moves centred around b (which comprise $T={T}_{1}^{b}$ itself in addition to two tree topologies not containing the clade in T descending from b):

$${\rm{aBayes}}(b)=\Pr (b| D,T\backslash b)=\frac{\Pr (D| T)}{{\sum }_{i=1}^{3}\Pr (D| {T}_{i}^{b})}.$$ (2)

These NNI moves perform small changes to T adjacent to branch b. One of the appeals of aBayes is that it can score not only T, but also the two alternative topologies considered, ${T}_{2}^{b}$ and ${T}_{3}^{b}$, not containing the clade defined by b:

$$\Pr ({T}_{j}^{b}| D,T\backslash b)=\frac{\Pr (D| {T}_{j}^{b})}{{\sum }_{i=1}^{3}\Pr (D| {T}_{i}^{b})}.$$ (3)

aBayes has a topological focus, with score aBayes(b) for branch b interpreted as the support for the existence of the clade of T containing all descendants of b. aBayes(b) is in effect an approximate Bayesian posterior score for this clade, where a flat tree prior is assumed. In contrast to typical Bayesian phylogenetics, however, instead of integrating over branch lengths we define $\Pr (D| {T}_{i}^{b})$ as the maximum-likelihood score of topology ${T}_{i}^{b}$ over all possible branch lengths, and we only consider the alternative topologies obtainable through a single NNI move17 starting from T. Although computationally much faster than Felsenstein’s bootstrap, aBayes can still be too demanding for large genomic epidemiological datasets if implemented within classical maximum-likelihood phylogenetic methods (for example, see Fig. 2). Also, aBayes is defined based only on NNI moves, an approach insufficiently comprehensive for pandemic-scale data9: owing to the existence of many phylogenetic topologies with similar likelihood29, the two alternative topologies obtainable through NNI moves, ${T}_{2}^{b}$ and ${T}_{3}^{b}$, might represent only a very small subset of plausible alternative topologies not containing the clade defined by b.

Here we address these limitations and define a new measure of branch support, SPRTA, that is particularly useful in the context of large-scale genomic epidemiology, but is also applicable more generally in phylogenetics. First, to address the problem of computational demand, we consider trees estimated using methods suitable for pandemic-scale datasets, such as MAPLE9. In the following, we assume that the likelihood of alternative tree topologies is also calculated with MAPLE or a similar method.

We define the SPRTA support of branch b by considering a certain number I b of possible alternative topologies ${T}_{i}^{b}$ (1 ⩽ i ⩽ I b ), obtained by performing single subtree prune and regraft36 (SPR) moves that relocate S b , the subtree of T containing all descendants of b, as a descendant of other parts of T (Fig. 1). Again, we assume for simplicity that ${T}_{1}^{b}=T$ corresponds to the null SPR move. Compared to NNI moves, SPR moves are much more numerous, and can cause long-range changes to the tree; as such, SPR moves create a much more comprehensive set of alternative evolutionary histories than NNI moves. For each branch b, the number of possible SPR moves is linear in the number of sequences in the considered dataset, which means if we do not select which of these SPR moves we focus on, I b can become too large, and the calculation of SPRTA scores too computationally demanding. However, to obtain an accurate evaluation we only need to consider topologies that have non-negligible likelihood scores compared to T.

For this reason, we first perform an initial, approximate evaluation of alternative tree topologies obtained through SPR relocations of S b using fixed branch lengths (we leave the length of the placement branch equal to the length of b, and we assess placements only halfway along branches). From this, we retain only topologies with an initial log-likelihood score difference from T within a threshold corresponding approximately to one extra mutation event (the natural logarithm of the genome length, which for SARS-CoV-2 is about 10.3). The likelihoods of all I b topologies passing this threshold are then more deeply evaluated by optimizing branch lengths. This two-step approach is similar to the ‘baseball’ heuristic of the metagenomic query mapper pplacer19. More precisely, when we consider an alternative placement of S b on branch ${b}^{^{\prime} }$, let us use ${n}^{^{\prime} }$ to denote the new node at the conjunction of ${b}^{^{\prime} }$ with S b . For this placement, the three branches whose length we optimize are the one connecting ${n}^{^{\prime} }$ with S b (length l 1 ), the one within ${b}^{^{\prime} }$ below ${n}^{^{\prime} }$ (length l 2 ), and the one within ${b}^{^{\prime} }$ above ${n}^{^{\prime} }$ (length l 3 ), similarly to the ‘lazy subtree rearrangement’ approach of RAxML37 (see Extended Data Fig. 10a). $\Pr (D| {T}_{i}^{b})$ is defined as the maximum likelihood obtained by optimizing these three branches, while leaving all other model parameters and branch lengths unaltered. This approximation of $\Pr (D| {T}_{i}^{b})$ is the same one made by MAPLE, and is justified by the fact that, at the low levels of divergence typically considered in genomic epidemiology, changes in topology and branch lengths usually only affect nodes near those directly affected by the changes9.

... continue reading