Semantic design of functional de novo genes from a genomic language model

Evo 1.5 pretraining

Evo 1.5 was generated by extending the pretraining of the Evo 1 model, originally trained at a sequence context of 8,192 tokens with an initial learning rate of 0.003 after a warmup of 1,200 iterations; a cosine decay schedule with a maximum decay iteration of 120,000 and a minimum learning rate of 0.0003. Training used a global batch size of 4,194,304 tokens and 75,000 iterations, processing a total of 315 billion tokens. Other details on hyperparameters related to the model architecture and optimizer can be found in ref. 1. Pretraining of the model, including all model states, optimizer states, data loading schedule and learning rate schedule, was resumed from 75,000 iterations to 112,000 iterations (processing a total of 470 billion tokens). The model trained up to 112,000 iterations is referred to as Evo 1.5.

As in Nguyen et al.1, Evo 1.5 was trained on the OpenGenome dataset, a comprehensive collection of prokaryotic genomic sequences. In brief, the dataset consists of approximately 80,000 prokaryotic genomes from bacteria and archaea and over 2 million phages and plasmid sequences, totalling approximately 300 billion nucleotides. OpenGenome was carefully curated to provide diverse, non-redundant genomic sequences from three primary sources: (1) bacterial and archaeal genomes from the Genome Taxonomy Database61, (2) prokaryotic viral sequences from the IMG/VR database62, and (3) plasmid sequences from the IMG/PR database63. To reduce redundancy, only representative genomes for each species, viral operational taxonomic unit, or plasmid taxonomic unit were retained. The dataset was further filtered to exclude potential eukaryotic viruses and sequences with poor taxonomic specificity. The training data include both positive and negative strands of the genomic sequences, allowing the model to learn the complementary nature of DNA. A detailed description of the dataset curation process is provided in Nguyen et al.1.

Autoregressive sampling

To sample from Evo, a standard low-temperature autoregressive sampling algorithm was used. Sampling code (https://github.com/evo-design/evo/) based on the reference implementation (https://github.com/togethercomputer/stripedhyena) that leverages kv-caching of Transformer layers and the recurrent formulation of hyena layers64,65 was used to achieve efficient, low-memory autoregressive generation. Parameter optimization was performed across various temperatures (0.1–1.5, increments of 0.1), top-k values (1–4) and top-p values (0.1–1.0, increments of 0.1) using the Evo 1.5 model. For each parameter combination, 100 1,000-nt sequences were generated from a test set of five gene prompts encoding 50% of a highly conserved protein. Following identification of ORFs in generated sequences using Prodigal (v2.6.3, default parameters, -p meta) with default parameters in metagenome mode (-p meta)66, generated proteins were aligned against the full-length prompt protein sequence using MAFFT (v7.526)67 for sequence identity calculations. For evaluating sequence degeneracy, DustMasker (v2.14.1+galaxy2)68,69,70 was run across the full-length generations using default parameters and the proportion of masked nucleotides was calculated. Final parameters (temperature = 0.7, top-k = 4, top-p = 1.0) were selected based on maximizing sequence completion accuracy while maintaining DustMasker proportions below 0.2, a value chosen to be slightly higher than the typical frequency of non-coding DNA in prokaryotic genomes71,72. All code for sampling and downstream analysis using Evo was written in Python (v3.11.8).

Sequence completion prompt compilation and evaluation

Sequences of highly conserved genes from across prokaryotic biology were downloaded in FASTA format from NCBI GenBank73. Selected genes included rpoS from E. coli K-12 (GenBank: NC_000913.3, coordinates c2866559–2867551), gyrA from S. enterica LT2 (GenBank: NC_003197.2, coordinates c2373710–2376422) and ftsZ from H. volcanii DS2 (GenBank: NC_013967.1, coordinates 643397–644536). Prompts were prepared by extracting 30%, 50% and 80% sequence lengths from the 5′ end. For the analysis of the gene completion ability of Evo 1.5 using sequences with varying levels of conservation, sequences encoding moderately conserved (gloA and pilA) and poorly conserved (tnsA and yagL) genes gloA from E. coli K-12 (GenBank: U57363.1, coordinates 1–408), pilA from Pseudomonas aeruginosa (GenBank: AE004091.2, coordinates c5069082–5069531), tnsA from E. coli O3 (GenBank: NZ_JALKIH010000010.1, coordinates 27218–28039) and yagL from E. coli K-12 (GenBank: U73857.1, coordinates c1018–1716) were used.

Sequence completion performance was evaluated across varying prompt lengths (30%, 50% and 80% of input sequence) using optimal sampling parameters for the Evo 1.5, Evo 1 8K (previous version of Evo trained with context length of 8,192 tokens) and Evo 1 131K (previous version of Evo trained with context length of 131,072 tokens) models (temperature = 0.7, top-k = 4, top-p = 1)1. For each prompt, 500 sequences of length 2,500 nt were generated and filtered to remove generations with DustMasker proportions above 0.2. Prompts were subsequently appended to the start of each generated sequence and ORFs were identified using Prodigal (v2.6.3, default parameters, -p meta) with default parameters in metagenome mode (-p meta). Generated proteins were then aligned against their corresponding natural sequences using MAFFT (v7.526) with default parameters for sequence identity calculations.

Operon completion prompt compilation and evaluation

Sequences encoding the trp operon and modABC operon from E. coli K-12 (GenBank: NC_000913.3) were downloaded in FASTA format from NCBI GenBank. For modABC, prompts were prepared from the full coding sequences for modA (coordinates 795089–795862), modB (coordinates 795862–796551), modC (coordinates 796554–797612) and acrZ (coordinates 794773–794922). For trp, prompts were prepared from the full coding sequences for trpE (coordinates c1321384–1322946), trpD (coordinates c1319789–1321384), trpC (coordinates c1318427–1319788), trpB (coordinates c1317222–1318415) and trpA (coordinates c1316416–1317222). For testing the ability of the model to generate sequences on the antisense strand, the reverse complement of each sense strand-derived prompt sequence was generated using Biopython74.

... continue reading