Language model-guided anticipation and discovery of mammalian metabolites

Training dataset

A training dataset of known human metabolites was obtained from the HMDB, the largest and most comprehensive database of human metabolism21. Chemical structures were downloaded from the HMDB website in XML format (version 4.0; file ‘hmdb_metabolites.xml’). The HMDB assigns each metabolite to one of four classes: quantified, detected, expected, or predicted. Of the 114,222 metabolites recorded in this XML file, the vast majority fell into the ‘expected’ or ‘predicted’ classes (n = 95,202 and 9,929, respectively), indicating that they had not actually been experimentally detected in human tissues or biofluids. These classes instead include structures identified in cell or tissue cultures or in other species, structures predicted based on rule-based enzymatic derivatizations of known human metabolites, and structures predicted based on combinatorial enumeration (for instance, of lipid head groups and acyl/alkyl chains). To avoid conflating predictions made by our chemical language model with an orthogonal set of predictions based on chemical reaction rules or combinatorial enumeration, we trained our language model exclusively on metabolites annotated as ‘detected’ or ‘quantified.’ Moreover, we found that among the 8,970 detected or quantified metabolites, the vast majority (6,791) of these were lipids. Because comprehensive profiling of lipids generally relies on a distinct set of analytical approaches as compared to efforts to comprehensively profile small (polar) metabolites, and because the preponderance of lipids led language models trained on this dataset to almost exclusively generate structurally simple molecules with long acyl chains, we excluded lipids from the training set. This was achieved by removing structures assigned to the ClassyFire superclass ‘Lipids and lipid-like molecules’61. These filters yielded a training set of 2,046 small molecule metabolites that had been experimentally detected in human tissues or biofluids.

The SMILES strings for these 2,046 metabolites were parsed using the RDKit, and stereochemistry was removed. Salts and solvents were removed by splitting molecules into fragments and retaining only the heaviest fragment containing at least three heavy atoms, using code adapted from the Mol2vec package62. Charged molecules were neutralized using code adapted from the RDKit Cookbook, after which duplicate SMILES (for instance, stereoisomers or alternatively charged forms of the same molecule) were discarded. Molecules with atoms other than Br, C, Cl, F, H, I, N, O, P or S were removed, and molecules were converted to their canonical SMILES representations. The resulting canonical SMILES were then tokenized by splitting the SMILES string into its constituent characters, except for atomic symbols composed of 2 characters (Br, Cl) and environments within square brackets, (such as [nH]), and any SMILES containing tokens found in 10 or fewer structures was removed, on the basis that a language model was unlikely to learn how to use these tokens correctly from such a small number of training examples. Metabolites were subsequently split into ten training folds, each with 10% of the structures withheld, and data augmentation was then performed on each fold by enumeration of 30 non-canonical SMILES for each canonical SMILES string63. This approach takes advantage of the fact that a single chemical structure can be represented by multiple different SMILES strings, and was used here on the basis of previous studies showing that this data augmentation procedure led to more robust chemical language models, particularly when training these models on small datasets64,65.

To prospectively evaluate DeepMet predictions, we obtained the structures of a further 313 experimentally detected metabolites that were added to version 5.0 of the HMDB26. These metabolites were extracted by applying the same filters as described above (except the removal of lipids) to the metabolite XML file from version 5.0, and then removing structures also found in the version 4.0. We additionally removed several thousand exogenous and largely synthetic compounds that had been identified through a text mining approach66.

Language model architecture and training

Our approach to generating metabolite-like chemical structures was based on the use of a language model to generate textual representations of molecules in the SMILES format22, a paradigm that has been extensively explored in the setting of molecular design over the past decade. Although recent efforts have introduced generative models based on transformers67,68, state-space models69, and other architectures70, here, as in previous work5,6,71,72, we trained a recurrent neural network (RNN) on the SMILES strings of the molecules in our training set. SMILES were tokenized as described above, such that the vocabulary consisted of all unique tokens detected in the training data, as well as start-of-string and end-of-string characters that were prepended and appended to each SMILES string, respectively. The language model was then trained in an autoregressive manner to predict the next token in the sequence of tokens for any given SMILES, beginning with the start-of-string token. Language models based on the LSTM architecture were selected on the basis of their excellent performance in previous studies, whereby these were found to outperform both alternative models based on RNNs (e.g., gated recurrent units) as well as models based on the transformer architecture65,68,73. LSTMs were implemented in PyTorch, adapting code from the REINVENT package74. The architecture consisted of a three-layer LSTM with a hidden layer of 1,024 dimensions, an embedding layer of 128 dimensions, and a linear decoder layer. Models were trained to minimize the cross-entropy loss of next-token prediction using the Adam optimizer with default parameters, a batch size of 64, and a learning rate of 0.001. Ten percent of the molecules in the training set were reserved as a validation set and used for early stopping with a patience of 50,000 minibatches.

To further address the data-limited context of the human metabolome, we employed a strategy that we reasoned would first allow our model to learn the syntax of the SMILES representation and subsequently adapt this understanding to the generation of new metabolite-like chemical structures. In particular, we first pretrained the LSTM until convergence on drug-like small molecules from the ChEMBL database, using the same early stopping criteria as above75. ChEMBL (version 28) was obtained from ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_28_chemreps.txt.gz and processed as described above, except with a single round of non-canonical SMILES enumeration rather than 30. After pretraining using the same stopping criterion as described above, the model was fine-tuned on the HMDB training set, without freezing any layers. This model generated valid SMILES at a rate of 98.9 ± 0.19%, for models trained on each of the ten splits, and novel SMILES at rates of 34.3 ± 7.7% (with respect to the HMDB training set), 49.7 ± 3.0% (with respect to the ChEMBL pretraining set), and 28.0 ± 6.4% (with respect to both sets).

Metabolite likeness of generated molecules

We carried out a series of analyses to first establish that the language model had indeed learned to generate metabolite-like structures. To this end, we first trained a chemical language model as described above on a single training split of the HMDB, then sampled 500,000 SMILES strings from the trained model, and removed those corresponding to known metabolites from the training set. Duplicate structures were likewise removed. No additional filters were imposed on the generated molecules to explicitly remove those falling outside the chemical space of the training set. To visualize the areas of chemical space occupied by generated molecules and known metabolites, we implemented an approach based on nonlinear dimensionality reduction. Briefly, we computed a continuous, 512-dimensional representation of each molecule using the Continuous and Data-Driven Descriptors (CDDD) package76 (available from http://github.com/jrwnter/cddd). These continuous, 512-dimensional descriptors are derived from a machine translation task in which RNNs are used to translate between enumerated and canonical SMILES in a sequence-to-sequence modelling framework, a task that forces the latent space to encode the information required to reconstruct the complete chemical structure of the input molecule. We then sampled CDDD descriptors for an equal number of known metabolites and generated molecules, then embedded the CDDD descriptors for both sets of molecules into two dimensions with UMAP77, using the implementation provided in the R package uwot with the n_neighbors parameter set to 5.

To more quantitatively evaluate the chemical similarity of generated molecules to known metabolites, we used a supervised machine learning approach to test whether the two sets of molecules could be distinguished from one another on the basis of their structures. This was achieved by again sampling an equal number of known metabolites and generated molecules, computing extended-connectivity fingerprints with a diameter of 3 and a length of 1,024 bits, and then splitting the resulting fingerprints into training and test sets in an 80/20 ratio. Known metabolites and duplicate structures were removed from the generated molecules. A random forest classifier was then trained to distinguish between known metabolites and generated molecules, using the implementation in scikit-learn. The performance of the classifier was measured using the area under the receiver operating characteristic curve (AUROC). To ensure that the observed failure of the classifier to separate known metabolites from generated molecules could not be trivially attributed to a poor classifier, an identical model was trained to separate the known metabolites in the training set from an equal number of structures derived from ChEMBL (version 28) containing only the atoms C, H, N, O, P and S, and was found to accurately differentiate these two groups of structures.

... continue reading