For all that scientists talk of ‘decoding the genome’, the messy reality is that the genome isn’t even written in a single language. Scientists are fluent in the three-nucleotide codons that make up the protein-coding genes in DNA, but these represent only about 2% of the genomic text. The remainder is written in an entirely distinct language, which researchers are yet to untangle.
“Whenever we sequence a human individual, we get about 3.5 million variants, and only 0.6% of those will be in coding regions,” says Nadav Ahituv, a geneticist at the University of California, San Francisco. That fraction is relatively easy to interpret, Ahituv says, but for the rest, “we really don’t understand what it’s doing — we don’t have a regulatory code”.
AI can write genomes — how long until it creates synthetic life?
But researchers are making progress on decoding the regulatory region of the genome, and learning the underlying grammar of the elements that govern when and where genes are turned on and off. To do so, they have turned to a suite of methods known as massively parallel reporter assays (MPRAs). These tools measure how millions of isolated genetic elements or sequence variants influence the expression of a hand-picked ‘reporter’ gene. That helps researchers to identify the genome’s control knobs and untangle their function without being overwhelmed by all the other parts of the genome. “It’s reducing it down to this synthetic system,” says Ryan Tewhey, a geneticist at the Jackson Laboratory in Bar Harbor, Maine. “But you’re still maintaining enough complexity to probe [genomic] space that we don’t fully understand, and that’s kind of the sweet spot.”
MPRAs could help to clarify the genetic foundations of disease, reveal the changes wrought by evolution and guide development of next-generation therapeutics. They could even be used to train artificial-intelligence systems to design genetic circuits, with applications in health care and other sectors. Customized regulatory elements could, for example, tighten the control of gene-based therapies and ensure that the treatments activate only in specific tissues and under particular conditions, minimizing the ever-present threat of off-target effects. “We’re really trying to engineer things that could be turned on very easily and simply, and even not by a drug,” says Ahituv.
The non-coding genome is not a complete black box, of course. Researchers have identified thousands of proteins, called transcription factors, that have established roles in gene expression and pinpointed the DNA sequences to which they bind.
By mapping these sequences across the genome, scientists can predict which ones initiate gene expression — these are known as promoters — and which act as volume knobs, or enhancers, to amplify that expression under specific conditions. They can also look for signatures of regulatory elements by probing genomic regions in which the DNA is exposed and ready for transcription. Chromosomal DNA is wrapped around proteins, forming material called chromatin. Elements in densely packed chromatin are generally inaccessible to regulatory proteins and thus inactive.
But validating these predictions has historically entailed a painstaking process of testing how different mutations in individual enhancers or promoters affect nearby genes. “I realized that if we wanted to ever computationally really analyse and predict enhancers, we would not need a handful of enhancers — we would need hundreds of thousands,” says Alexander Stark, a computational biologist at the Research Institute of Molecular Pathology in Vienna. Over the past 15 years or so, Stark and others have developed a range of methods, known generally as MPRAs, that enable such functional assessments at a previously unimaginable scale.
How DeepMind’s genome AI could help solve rare-disease mysteries
The core principles of these assays were established in 20091. Jay Shendure and his colleagues at the University of Washington in Seattle used cloning to generate libraries of small, circular pieces of DNA known as plasmids in which a reporter gene was physically linked to hundreds of variations of a particular promoter. These variants represented every possible single-base mutation in that regulatory sequence, allowing the researchers to survey the role of each nucleotide individually. For identification purposes, each variant was linked to a unique DNA ‘barcode’. The researchers incubated their library in a test tube with the molecular components required for transcription and then sequenced the transcribed RNAs to determine the level of expression triggered by each promoter variant from the abundance of its associated barcode.
... continue reading