Targeted enzyme discovery using metal-coordination mining

Certain protein architectures have been found to be highly recurrent, arising multiple times in evolution. These superfolds are used by many distinct families with diverse functions. The cupin superfold is found in all domains of life and exhibits an exceptionally wide range of biochemical functions—on par with the versatility of the triose phosphate isomerase (TIM) barrel superfold28. Its evolutionary history is complex, but seems to be centred around a 3His-1Glu motif located within a characteristic thermostable double-stranded β-helix (DSBH or jelly roll) barrel fold29. It has been proposed that cupins were originally part of a group of small-molecule-binding domains, recognizing sugars and nucleotides using the 3His-1Glu motif, before diverging into metal-independent members such as transcription factors and sugar epimerases and isomerases, as well as those with the ability to recruit metal ions using the same consensus motif. However, metal-binding abilities are thought to have also arisen through convergent evolution continually within the metal-independent lineage30,31. Thus, cupins contain a broad array of non-catalytic members, as well as metalloenzymes with different metals (Fe, Mn, Zn, Ni, Co, Mg or Cu) and coordination environments. Like many evolvable protein folds, this non-linear evolutionary path leads to a disconnect in which extremely close cupin members can be phylogenetically grouped but remain functionally distinct, with respect not only to their catalytic function, but also to their metal-binding properties32.

The FeII/αKG-dependent enzymes use the cupin superfold and participate in a broad array of metabolic transformations, from the modification of proteins to key steps in the biosynthesis of antibiotics and other natural products. These enzymes use a mononuclear non-haem FeII cofactor and the αKG co-substrate to bind to and activate O 2 and generate a reactive FeIV-oxo intermediate33 (Fig. 1a). However, this shared intermediate can catalyse a wide range of reactions, such as hydroxylation, desaturation, cyclization and halogenation33 (Extended Data Fig. 1). The prototypical members of this family bind to FeII using a cupin-derived 2His-1Asp/Glu ‘facial triad’ motif and the αKG co-substrate to form the primary coordination sphere of the metal33. The radical halogenase members are distributed at low frequency within various families of FeII/αKG-dependent enzymes, but are mechanistically distinct with regard to their ability to transfer an Fe-bound anion to the substrate5. This step necessitates the presence of an open coordination site for halide coordination to FeII, which is observed as the conversion of the 2His-1Asp/Glu motif to the non-canonical 2His-1Gly/Ala motif in all previously reported members (Fig. 1a and Extended Data Fig. 2). Thus, we reasoned that this structure–function relationship, in which metal coordination dictates reaction outcomes, could serve as a bioinformatic marker for the identification of radical halogenases.

Sequence-based methods typically rely on pairwise comparisons that scale as N2 (where N is the number of sequences) and thus lack both the sensitivity and the scalability to search for a single amino acid change within a large superfamily where N > 105. To address these limitations, we turned to mechanism-guided structural bioinformatics, with the rationale of using the atomic-level 3D features of the metal coordination in radical halogenases: (i) the 2His coordination motif, which is shared across all members in the superfamily, as a simple structural representation to identify the Fe-binding site; and (ii) the absence of additional coordinating ligands, such as the canonical Asp or Glu residues, for the functional classification of a metal-binding structure as a radical halogenase (Fig. 1a). This approach addresses both the challenge of sensitivity, by using geometric clustering rules in 3D space to improve on the lower detection signal from untargeted alignment scores in one-dimensional (1D) space, and that of scalability, by using a structural motif search that scales as N1.

A structural database of the FeII/αKG-dependent enzyme family was compiled by first retrieving protein sequences annotated to contain the cupin domain (n = 1.8 million) from the approximately 220 million sequences in the InterPro database (Fig. 1b and Supplementary Table 2). After an initial filtering step to remove fragments, duplicates and non-enzymes, the corresponding structural models from the AF2 Protein Structure Database were retrieved (n = 530,814) and mined for the presence of a 2His binding site using the unique distance and orientation relationship that would allow two intra-strand His residues to act as a metal-binding site34. We found that three constraints were sufficient to describe this site, with two H-bonds between the backbone amide and carbonyl groups from His A and His B defining their location on adjacent β-strands, and the side-chain N τ –N τ distance defining their preorganization for metal coordination (Fig. 1a). Most of these predicted structures contained a 2His metal-binding site with an additional Asp/Glu residue in coordination proximity to the metal site (n = 456,585); a small minority lacked a nearby coordinating residue and instead contained an Ala/Gly residue (n = 946), suggesting radical halogenase activity (Fig. 1b).

The sequence-similarity network (SSN) of these predicted radical halogenases includes all previously known and experimentally verified radical halogenases5,25,35,36 (Fig. 1c), which serves as an initial positive validation of our methodology. This SSN captures radical halogenase clusters (greater than 30% sequence identity) with much higher coverage than what can be achieved using sequence-based methods alone, as illustrated by dechloroacutamine halogenase (DAH), a rare case of a eukaryotic halogenase37. The previously reported sequence homology searches identified only a single DAH homologue37, consistent with our own BLAST search coupled with multiple sequence alignment, which identified only three homologues (Supplementary Fig. 1). By contrast, our metal-coordination mining approach defined a DAH cluster comprising 40 putative radical halogenases from both eukaryotic (fungal and plant) and bacterial origins (Fig. 1c). More broadly, the SSN for the complete bioinformatic atlas of radical halogenases contains 70 clusters that seem to be completely unexplored, with no previous radical halogenase annotation, and which span a vast taxonomic and phylogenetic protein sequence space, including several new families of eukaryotic halogenases (Fig. 1c and Extended Data Fig. 3). This expanded and refined bioinformatic atlas of radical halogenases (Supplementary Data 1) provides a valuable resource for the wider scientific community that can enable novel chemical transformations to be investigated in unique biological contexts and can accelerate the discovery of new biocatalysts with existing experimental high-throughput pipelines38.