Tech News
← Back to articles

Computational design of metallohydrolases

read original related products more articles

Metallohydrolases catalyse some of the most difficult hydrolysis reactions in biology by using their bound metal ions to activate a water molecule positioned adjacent to the substrate bond to be cleaved16,17,18. Engineering new metallohydrolases is currently of considerable interest for degrading human-generated environmental pollutants, for which there has not been sufficient time for efficient hydrolytic enzymes to evolve19,20,21. Protein engineering has expanded the scope of substrates that can be hydrolysed by metallohydrolases, but this often requires initial promiscuous activity22,23. De novo enzyme design has been used to generate new metallohydrolases6,10,24, but these have had relatively low activity and efficiency, and have required extensive directed evolution to match the activity and efficiency of native enzymes24. Given an ideal metallohydrolase active site, de novo enzyme design seeks to identify or generate a protein scaffold that positions the catalytic residues, metals, and substrates in optimal catalytic geometries with high accuracy25,26. RFdiffusion has been used successfully to scaffold active sites, but the search has been limited by the need to specify the sequence positions and conformations of the catalytic residues8,9,27.

We reasoned that a generative artificial intelligence design method that only required the specification of side-chain functional group positions around a reaction transition state, and was capable of sampling over all possible sequence positions and conformations of these residues, could more readily satisfy complex catalytic constraints14,15,28,29. We set out to develop such an approach, and used it to design new metallohydrolases starting from a quantum chemistry-generated active site description with a bound metal cofactor.

To enable sequence-position and side-chain rotamer-agnostic enzyme design, we developed a generative artificial intelligence flow-matching model called RFdiffusion230. RFdiffusion2 extends the capabilities of RFdiffusion to generate scaffolds that position a set of functional residues (a ‘motif’) in two key ways. First, it enables atomic substructure scaffolding: RFdiffusion can only scaffold backbone-level motifs (with the side-chain and backbone atom N-Cα-C=O positions specified), whereas RFdiffusion2 can scaffold arbitrary atom-level motifs (any subset of amino acid heavy atoms). This is important for enzyme design because it allows users to specify only the positions of the key functional groups that interact with the reaction transition state, rather than the full side-chain and backbone conformation. Second, RFdiffusion2 enables sequence-position-agnostic scaffolding: RFdiffusion requires specification of the primary sequence positions of the motif residues, but RFdiffusion2 can scaffold motifs whose primary sequence positions are unknown. RFdiffusion2 replaces diffusion with flow matching31,32 and achieves sequence-position-agnostic atomic substructure scaffolding by providing randomly selected native atomic coordinates (but not their sequence positions) during training in addition to the partially noised, sequence-labelled atomic coordinates. With these improvements, RFdiffusion2 generates diverse proteins starting directly from catalytic configurations that consist of input functional group positions and substrate coordinates. Allowing the model to resolve the a priori unknown degrees of freedom (that is, the primary sequence positions and side-chain rotamer conformations of the catalytic residues) is considerably more effective at generating self-consistent design solutions than randomly sampling those degrees of freedom before inference, because the space is far too large to enumerate, as was necessitated with RFdiffusion. A detailed description of RFdiffusion2 training and benchmarking results for a wide range of active site scaffolding problems is described elsewhere30.

As an initial test of RFdiffusion2, we chose to design a zinc metallohydrolase for the hydrolysis of a fluorogenic ester, 4-methylumbelliferyl phenylacetate (4MU-PA), as a target reaction (Fig. 1a). We began by using density functional theory (DFT) to identify the transition-state geometry of the rate-determining Zn(II)-OH nucleophilic attack on the substrate ester. Four distinct catalytic arrangements based on the stereochemistry of the tetrahedral intermediate and the nature of the oxyanion hole were considered (Fig. 1b, Supplementary Figs. 1 and 2, Supplementary Data 1 and Supplementary Methods 4.1). These calculations provide the coordinates of the three Zn(II)-binding imidazole rings, the metal, and the transition state. Our previous RFdiffusion approach required the backbone coordinates and residue positions as inputs, which would require upfront sampling of the rotameric states and sequence position for each histidine. This cannot be done exhaustively: even with relatively coarse sampling around the side-chain chi angles χ 1 , χ 2 , and the backbone torsion angle ψ, there are on the order of 1018 possible choices for the side-chain conformations and sequence placements of our full catalytic site (Fig. 1c and Extended Data Fig. 1). Whereas each RFdiffusion run has to be initialized with a specific (and generally randomly selected) choice from this enormous set of combinations, RFdiffusion2 as described above searches the entire space in each trajectory.

Fig. 1: RFdiffusion2 design method. a, Hydrolysis of 4MU-PA yields phenylacetic acid and a fluorescent coumarin product. b, Example theozyme for Zn(II)-hydroxide nucleophilic attack on the 4MU-PA ester. Two-dimensional representation (left) and 3D DFT model (right). Arrows on the 3D model represent sampled conformational flexibility. c, Comparison of scaffold generation around an input theozyme using previous backbone centric RFdiffusion (top row) versus interaction functional group centric RFdiffusion2. RFdiffusion requires explicit upfront sampling of side-chain conformations and residue sequence positions, whereas RFdiffusion2 only requires the transition-state complex and the catalytic side-chain functional groups, implicitly sampling sequence space and rotameric space during inference. d, Snapshots of the global structure and active site from model X T during an RFdiffusion2 inference trajectory. The coordinates of the input transition-state complex and catalytic functional groups stay fixed during inference while the backbone structure, sequence positions, and unspecified atoms of the catalytic side chains are sampled by RFdiffusion2. The Cα atoms that host the catalytic histidines at the end of the trajectory are retrospectively highlighted as red spheres; these Cα atoms are not predetermined but rather move into the frame to host the fixed side chains as the global structure forms around the motifs. Full size image

RFdiffusion2 inference trajectories were used to build protein scaffolds housing the DFT-generated minimal active site configurations, referred to as theozymes2,33. Several snapshots from a representative trajectory are shown in Fig. 1d, transforming random noise on the left into the final backbone on the right (Supplementary Video 1). The Cα atoms of each residue (shown as coloured spheres representing final sequence position) are initially sampled from a Gaussian distribution, and the target functional atom positions (shown in sticks) stay fixed. As the trajectory proceeds from left to right, the global structure takes shape around the motif, with the fixed histidine side chains eventually connecting to Cα atoms of the protein backbone at sequence positions of the network’s choosing. A total of 5,120 RFdiffusion2 inference trajectories were carried out starting from different random seeds and for each of the resulting protein scaffolds, sequences were generated using ProteinMPNN34. The catalytic geometry and interactions with the transition state of those designs for which the AlphaFold235 predicted structure was close to the design model were further optimized using iterative LigandMPNN36 and constrained Rosetta repacking and minimization37 (Extended Data Fig. 2 and Supplementary Methods 4.1). Designs containing a proposed general base positioned to activate the water molecule (that is, Glu, Asp or His within hydrogen bonding distance of the Zn(II)-bound water) and side-chain hydrogen bonds stabilizing the transition-state oxyanion (if applicable), and that AlphaFold2 predicted to adopt the target structure, were characterized with PLACER12 to assess active site preorganization. A total of 96 designs were selected for experimental characterization on the basis of predicted active site geometry and preorganization (Supplementary Fig. 3, Supplementary Data 2 and 3 and Supplementary Methods 4.1).

Linear DNA fragments encoding the 96 designs were cloned into a plasmid encoding a C-terminal Strep-tag and used to transform Escherichia coli, and the resulting proteins were purified using Strep-tag affinity chromatography. Eighty-six out of ninety-six designs were expressed and soluble as judged by SDS–PAGE analysis of the eluants (Supplementary Fig. 4). Purified designs were supplemented with zinc sulfate, and hydrolysis of 4MU-PA was monitored by fluorescence. Five designs (A1, A8, B9, C4 and F7) had activity well above background (Fig. 2b and Supplementary Fig. 5). Sequence-verified single clones for each of these were expressed and purified by affinity chromatography followed by size-exclusion chromatography to obtain pure, monomeric protein fractions (Supplementary Figs. 6 and 7 and Supplementary Table 1). Michaelis–Menten kinetic characterization of the purified variants revealed a k cat /K M of 16,000 ± 2,000 M−1 s−1 for A1, the most active design, and k cat /K M values in the range of 35–140 M−1 s−1 for the other four designs (Fig. 2c,d, Extended Data Fig. 3 and Extended Data Table 1). For comparison, the k cat /K M of previously designed metallohydrolases24 ranged from 3 to 60 M−1 s−1 (Supplementary Table 2). A1 is also a relatively robust enzyme, and retains activity for at least 1,000 turnovers (Fig. 3e and Supplementary Fig. 8). A1 differs considerably from previously described proteins: the most similar structures identified through template modelling (TM) alignment with the Protein Data Bank (PDB) and AlphaFold Protein Structure Database (AFDB) have TM scores38 of 0.41 and 0.49, respectively, and do not have analogous arrangements of catalytic residues (Extended Data Fig. 4a,b). We refer to A1 as zinc metalloesterase 1 (ZETA_1) throughout the remainder of the text.

Fig. 2: Activity characterization and PLACER preorganization assessment. a, Design models of the most active designs. Sequence length (in amino acids (aa)) and the secondary structure harbouring each catalytic histidine are indicated below. b, Reaction progress curves. The dashed black line is the buffer background. c, Michaelis–Menten characterization of A1 (ZETA_1). The y axis shows the initial rate v 0 divided by the total enzyme concentration ([E] 0 ). d, Michaelis–Menten parameters of most active designs. e,f, Distribution of PLACER active site preorganization ensemble metrics for the ordered designs. Average design-prediction substrate r.m.s.d. (e) and average catalytic and binding residue design-prediction r.m.s.d. (f) across all predicted ensembles generated by PLACER for each design. g, For ZETA_1, the substrate position in PLACER ensembles is close to the design model, whereas in the inactive design H7, the substrate position fluctuates widely. h, For ZETA_1, the side chains surrounding the active site are largely fixed in positions close to the original design model, whereas in the inactive H8 design, the side-chain positions vary considerably. Note that only the first five randomly generated, unranked ensemble predictions are shown for PLACER in g,h. Data represent the mean ± s.d. of three independent measurements of initial velocity (c) and Michaelis–Menten parameters (d). Full size image

Fig. 3: Characterization of ZETA_1 activity. a, ZETA_1 design model (left) with close-up view of the active site showing the catalytic residues (middle) and a surface view of the designed pocket revealing high shape-complementarity to the substrate (right). b, Size-exclusion chromatogram of ZETA_1 showing a single peak corresponding to monomeric protein. c, Circular dichroism spectra of ZETA_1 recorded every 10 °C from 25 °C to 95 °C (viridis colour gradient), and after recooling to 25 °C (grey). The spectra suggest that ZETA_1 has an α-helical secondary structure and that it can refold after heating and partial unfolding. MRE, mean residue ellipticity. d, Circular dichroism signal at 222 nm measured every 1 °C and plotted as a function of temperature. e, [Product]:[enzyme] progress curve shows that ZETA_1 hydrolyses more than 1,000 4MU-PA molecules per enzyme. Note that the background reaction has been subtracted from the spectra so that every turnover can be attributed to the enzyme (Supplementary Fig. 8; further details in Supplementary Information, section 4.3). f, Reaction progress curves for the Zn(II) holo- and zinc-free apo ZETA_1 proteins showing zinc-dependent activity. Adding excess Zn(II) to the apo ZETA_1 sample after 30 min re-establishes the activity, demonstrating that zinc is essential for the catalytic mechanism of ZETA_1. WT, wild type. g, Zinc affinity of wild-type and mutant ZETA_1, measured as the dissociation constant (K D ), where a lower value indicates tighter binding. h, Fluorescence progress curves comparing the activity of wild-type and mutant ZETA_1. i, k cat /K M for active ZETA_1 mutants compared with the wild type. Data represent the mean ± s.d. of three independent measurements of turnover number (e), fluorescence progress curves (f,h), Zn(II)-binding dissociation constant (g) and Michaelis–Menten parameters (i). Full size image

Design ZETA_1 not only has remarkably high activity but was also the top-ranked design in our in silico ranking. The structure in the absence of substrate was predicted to be very close to the design model by AlphaFold2 (Extended Data Fig. 5a and Supplementary Figs. 9 and 10), and the designed active site of ZETA_1 was predicted to be highly preorganized by PLACER, with the catalytic side chains fixed in place and the substrate held closely in its designed position, adjacent to the proposed Zn(II) site. PLACER12 is a deep neural network that, given a protein backbone containing a substrate, fully randomizes the positions of the substrate and all side chains within a 600-atom sphere, and then generates new coordinates for these groups12; repeated PLACER trajectories generate an ensemble of possible side-chain conformations and small molecule docks. Design ZETA_1 stood out from the other designs in both the extent of catalytic site preorganization (the catalytic side chains were largely fixed in space in catalytically competent conformations) and the positioning of the substrate–transition state in the active site (in the ZETA_1 ensemble, the substrate remained largely fixed in space in the active site, whereas in the inactive designs H7 and H8, it fluctuated considerably) (Fig. 2e–h and Supplementary Videos 2–5). Seven designs based on the same ZETA_1 backbone family were initially filtered out during the design selection phase, as they had suboptimal PLACER metrics; we retrospectively expressed and purified these designs and found that they had very low or no activity, further highlighting the utility of PLACER ensemble calculations for identifying active designs (Supplementary Fig. 11). These findings suggest that combining global structure prediction with detailed PLACER modelling of the active site provides a powerful approach to assessing the catalytic machinery and substrate binding geometry for design selection (Supplementary Fig. 10).

... continue reading