MarkTechPost@AI 2024年10月16日
Orthrus: A Mamba-based RNA Foundation Model Designed to Push the Boundaries of RNA Property Prediction
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Orthrus是一种RNA基础模型,通过对比学习目标和生物增强进行预训练。它利用多种数据和方法,在RNA属性预测任务中表现出色,能在低数据环境下实现高性能,为RNA研究提供了新途径。

🧬Orthrus是RNA基础模型,采用对比学习目标与生物增强进行预训练,利用来自多个物种的数据,最大化不同RNA转录本间的相似性。

🎯该模型在低数据环境中表现优异,只需少量微调就能在RNA属性预测中达到先进水平,在mRNA属性预测任务上显著优于现有基因组模型。

🔍研究采用对比学习分析RNA剪接和同源性,引入多种增强方法,使用Mamba编码器从RNA数据中学习,并通过多项评估任务验证其性能。

💪Orthrus通过对比学习构建RNA转录本的结构化表示,增强功能相关序列的相似性,在预测RNA属性方面优于其他自监督模型。

Despite the vast accumulation of genomic data, the RNA regulatory code must still be better understood. Genomic foundation models, pre-trained on large datasets, can adapt RNA representations for biological prediction tasks. However, current models rely on training strategies like masked language modeling and next token prediction, borrowed from domains such as text and vision, which lack biological insights. Experimental methods like eCLIP and ribosome profiling help study RNA regulation but are expensive and time-consuming. Machine learning models trained on genetic sequences provide an efficient, cost-effective alternative, predicting essential cellular processes like alternative splicing and RNA degradation.

Recent research proposes using foundation models in genomics, employing self-supervised learning (SSL) to train on unlabeled data. At the same time, these models aim to generalize well across tasks with fewer labeled samples. Genomic sequences present challenges due to low diversity and high mutual information, as constrained by evolutionary forces. Consequently, SSL models often reconstruct non-informative parts of the genome, leading to ineffective representations for RNA prediction tasks. Despite improvements in model scaling, the performance gap between SSL-based approaches and supervised learning remains wide, indicating the need for better strategies in genomic modeling.

Researchers from institutions including the Vector Institute and the University of Toronto have introduced Orthrus, an RNA foundation model pre-trained using a contrastive learning objective with biological augmentations. Orthrus maximizes the similarity between RNA transcripts from splice isoforms and orthologous genes across species, using data from 10 model organisms and over 400 mammalian species in the Zoonomia Project. By leveraging functional and evolutionary relationships, Orthrus significantly outperforms existing genomic models on mRNA property prediction tasks. The model excels in low-data environments, requiring minimal fine-tuning to achieve state-of-the-art performance in RNA property predictions.

The study employs contrastive learning to analyze RNA splicing and orthology using modified InfoNCE loss. RNA isoforms and orthologous sequences are paired to identify functional similarities, and the model is trained to minimize the loss. The research introduces four augmentations—alternative splicing across species, orthologous transcripts from over 400 species, gene identity-based orthology, and masked sequence inputs. The Mamba encoder, a state-space model optimized for long sequences, is used to learn from RNA data. Evaluation tasks include RNA half-life, ribosome load, protein localization, and gene ontology classification, using various datasets for performance comparison.

Orthrus employs contrastive learning to build a structured representation of RNA transcripts, enhancing the similarity between functionally related sequences while minimizing it for unrelated ones. This dataset is constructed by pairing transcripts based on alternative splicing and orthologous relationships, assuming these pairs are functionally closer than random ones. Orthrus processes RNA sequences through the Mamba encoder and applies decoupled contrastive learning (DCL) loss to distinguish between related and unrelated pairs. Results show Orthrus outperforms other self-supervised models in predicting RNA properties, demonstrating its effectiveness in tasks like RNA half-life prediction and gene classification.

In conclusion, Orthrus leverages an evolutionary and functional perspective to capture RNA diversity by using contrastive learning to model sequence similarities from speciation and alternative splicing events. Unlike prior self-supervised models focused on token prediction, Orthrus effectively pre-trains on evolutionarily related sequences, reducing reliance on genetic diversity. This approach enables strong RNA property predictions like half-life and ribosome load, even in low-data scenarios. While the method excels in capturing shared functional regions, potential limitations arise in cases where isoform variation minimally impacts certain RNA properties. Orthrus demonstrates superior performance over reconstruction-based methods, paving the way for improved RNA representation learning.


Check out the Paper, Model on HF, and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)

The post Orthrus: A Mamba-based RNA Foundation Model Designed to Push the Boundaries of RNA Property Prediction appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Orthrus RNA属性预测 对比学习 Mamba编码器
相关文章