UTR-LM – a semi-supervised 5’ UTR language model for decoding untranslated regions of mRNA and function predictions

The untranslated regions (UTRs) of messenger RNA (mRNA) molecules often hold the key to understanding how genes are translated into proteins. These regions, located at the beginning and end of mRNA sequences, regulate the translation process and influence the expression levels of proteins—a fundamental aspect of cellular function. Now, a groundbreaking study by researchers at Princeton University has harnessed the power of language models to unlock the secrets hidden within the 5′ UTR, shedding new light on protein expression regulation.

Meet the UTR-LM—a specialized language model trained specifically to decipher the complex language of 5′ UTRs. Just like how language models have revolutionized natural language processing tasks, the UTR-LM has been trained on vast datasets of endogenous 5′ UTRs from various species, allowing it to understand the intricacies of these regulatory regions. But what sets the UTR-LM apart is its ability to go beyond mere sequence analysis—it is augmented with additional information such as secondary structure and minimum free energy, providing a more comprehensive understanding of 5′ UTR function.

Overview of the UTR-LM model for 5′ UTR function prediction and design

Fig. 1

a, The input of the proposed pretrained model is the 5′ UTR sequence, which is fed into the transformer layer through a randomly generated 128-dimensional embedding for each nucleotide and a special [CLS] token. The pretraining phase uses a combination of masked nucleotide (MN) prediction, 5′ UTR SS prediction and 5′ UTR MFE prediction. b, Following pretraining, the [CLS] token is used for downstream task-specific training. c, The UTR-LM is fine-tuned for downstream tasks such as predicting MRL, TE, mRNA EL and IRES. d, Designing an in-house library of 5′ UTRs with highly predicted TE and the wet-laboratory experimental validation using mRNA transfection and luciferase assays.

In a series of rigorous tests, the UTR-LM proved its mettle in predicting key aspects of mRNA translation and expression. It outperformed existing benchmarks by significant margins, accurately predicting ribosome loading, translation efficiency, and mRNA expression levels. Moreover, the UTR-LM demonstrated its versatility by identifying previously unannotated internal ribosome entry sites within UTRs—a discovery with far-reaching implications for our understanding of translation initiation.

But the true test of the UTR-LM’s capabilities came in the form of practical application. The researchers leveraged the model to design a library of 211 novel 5′ UTRs optimized for translation efficiency—a critical factor in protein production. Wet-laboratory assays confirmed the efficacy of these designs, with the top performers achieving a remarkable 32.5% increase in protein production compared to established 5′ UTRs.

This groundbreaking study represents a significant leap forward in our ability to understand and manipulate protein expression at the molecular level. By harnessing the predictive power of language models, these researchers have unlocked new possibilities for optimizing gene expression, with potential applications ranging from biotechnology to therapeutics.

In the ever-evolving field of molecular biology, the UTR-LM stands as a testament to the power of interdisciplinary collaboration and innovation. As scientists continue to push the boundaries of what is possible, we can expect even greater insights into the intricate workings of the cellular machinery—and perhaps, new avenues for addressing complex diseases and unlocking the secrets of life itself.

Availability – The code is freely available at https://github.com/a96123155/UTR-LM

Chu Y, Yu D, Li Y. et al. (2024) A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions. Nat Mach Intell [Epub ahead of print]. [abstract]

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.