MarkTechPost@AI 2024年12月05日
EvolutionaryScale Releases ESM Cambrian: A New Family of Protein Language Models which Focuses on Creating Representations of the Underlying Biology of Protein
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

EvolutionaryScale发布了ESM Cambrian,一个基于蛋白质序列训练的全新语言模型,旨在更深入地理解蛋白质结构和功能。该模型利用机器学习技术,分析数百万个蛋白质序列,涵盖了地球上丰富的生物多样性,从而揭示蛋白质背后的模式和关系。类似于大型语言模型改变了我们对人类语言的理解,ESM Cambrian专注于蛋白质序列,这些序列是生物过程的基础。它旨在成为一个通用的模型,能够预测蛋白质结构和功能,并在不同物种和蛋白质家族中促进新的发现,为药物研发、生物工程等领域带来革新。

🤔 **ESM Cambrian是一个基于Transformer架构的蛋白质语言模型,通过自注意力机制识别蛋白质序列中的复杂关系,能够预测蛋白质折叠和发现新功能。** 该模型由EvolutionaryScale开发,并提供了不同规模的版本,包括ESM C 300M、ESM C 600M和ESM C 6B,方便科研人员和商业用户使用。

📊 **ESM Cambrian的训练分两个阶段进行,第一阶段使用512的上下文长度,元基因组数据占64%;第二阶段将上下文长度增加到2048,元基因组数据比例降至37.5%。** 这种分阶段训练方式使模型能够有效地从多种蛋白质序列中学习,提高了其泛化能力。

🔬 **早期测试表明,ESM Cambrian在预测蛋白质结构和功能方面与传统的实验方法相当,并且能够显著节省时间和成本。** 模型在预测蛋白质三级结构的无监督学习方面表现出色,并能够很好地推广到研究不足的蛋白质家族,帮助研究人员发现难以分析的序列中的隐藏关系。

🌍 **ESM Cambrian在AWS Sagemaker和NVIDIA BioNemo等平台上可用,方便商业用户将其集成到现有工作流程中。** EvolutionaryScale还公开了ESM C 300M和ESM C 600M的权重,鼓励开放科学,促进合作,加深对地球生命基础的理解。

🚀 **ESM Cambrian有望在蛋白质工程、药物研发和深入理解生物系统等方面发挥关键作用,推动蛋白质研究领域的发展。** 它的出现标志着计算生物学和蛋白质科学领域的重要里程碑,展现了人工智能在生物研究中的巨大潜力。

Understanding protein sequences and their functions has always been a challenging aspect of protein research. Proteins, often described as the building blocks of life, are made up of long, complex sequences that determine their roles in biological systems. Despite advancements in computational biology, making sense of these sequences in a meaningful way is still a difficult task. Traditional methods for analyzing proteins are both time-consuming and expensive. Even with recent technological progress, researchers struggle to map the vast diversity of protein structures and their functional variations found in nature. This gap between available data and practical insights remains a significant hurdle in developing new therapeutics, bioengineering solutions, and tackling broader challenges in health and environmental sciences. The need for a comprehensive tool to analyze proteins at an unprecedented scale has never been more urgent.

EvolutionaryScale has released ESM Cambrian, a new language model trained on protein sequences at a scale that captures the diversity of life on Earth. ESM Cambrian represents a major step forward in bioinformatics, using machine learning techniques to better understand protein structures and functions. The model has been trained on millions of protein sequences, covering an immense range of biodiversity, to uncover the underlying patterns and relationships in proteins. Just as large language models have transformed our understanding of human language, ESM Cambrian focuses on protein sequences that are fundamental to biological processes. It aims to be a versatile model capable of predicting structure, function, and facilitating new discoveries across different species and protein families.

Technical Details

The technical foundation of ESM Cambrian is as impressive as its goals. EvolutionaryScale has released different versions of the model, including ESM C 300M and ESM C 600M, with the weights openly available for the research community. These models strike a balance between scale and practicality, enabling scientists to make powerful predictions without the infrastructure challenges that come with very large models. The largest variant, ESM C 6B, is available on EvolutionaryScale Forge for academic research and on AWS Sagemaker for commercial use, with plans to launch on NVIDIA BioNemo soon. These platforms make it easy for users in both academic and industrial settings to access this tool.

The model, based on the transformer architecture, uses self-attention mechanisms to identify complex relationships within protein sequences, making it well-suited for tasks like predicting protein folding or discovering novel functions. One of the main benefits of ESM Cambrian is its ability to generalize knowledge across different proteins, potentially speeding up the discovery of new drugs and synthetic biology applications.

ESM Cambrian was trained in two stages to achieve its high performance. In Stage 1, for the first 1 million training steps, the model used a context length of 512, with metagenomic data making up 64% of the training dataset. In Stage 2, the model underwent an additional 500,000 training steps, during which the context length was increased to 2048, and the proportion of metagenomic data was reduced to 37.5%. This staged approach allowed the model to learn effectively from a diverse set of protein sequences, improving its ability to generalize across different proteins.

Early Results and Insights

Early testing of ESM Cambrian has shown promising results. The model’s ability to predict the structure and function of protein sequences is comparable to traditional experimental methods, offering significant savings in both time and cost. Evaluations were conducted using the methodology of Rao et al. to measure the unsupervised learning of protein tertiary structure through contact maps. A logistic regression was used to identify contacts, and the precision of the top L contacts (P@L) was evaluated for proteins of length L, with a sequence separation of 6 or more residues. The average P@L was computed on a temporally held-out set of protein structures (with a cutoff date of May 1, 2023) for scaling laws and on the CASP15 benchmark for performance evaluation. Initial insights suggest that ESM Cambrian performs well in generalizing across poorly studied protein families, helping researchers uncover hidden relationships in sequences that are otherwise difficult to analyze. Its predictive accuracy also opens new possibilities in enzyme engineering, where understanding the subtle nuances of protein activity is crucial.

The availability of ESM Cambrian on platforms like AWS Sagemaker and NVIDIA BioNemo will make it easier for commercial users to integrate machine learning tools into their existing workflows. EvolutionaryScale’s decision to release open weights for ESM C 300M and ESM C 600M reflects a commitment to open science, encouraging collaboration to better understand the fundamentals of life on Earth.

Conclusion

The release of ESM Cambrian by EvolutionaryScale marks an important milestone in computational biology and protein science. By providing a model that can analyze protein sequences at a scale that captures the diversity of Earth’s biodiversity, EvolutionaryScale has shown the potential of applying AI in biological research and opened up numerous opportunities for accelerating discovery and innovation. ESM Cambrian is set to play a key role in protein engineering, drug discovery, and gaining a deeper understanding of biological systems. As the scientific community begins to explore the applications of this model, it is clear that the future of protein research is evolving, with tools like ESM Cambrian leading the way.


Check out the Details and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

[Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)

The post EvolutionaryScale Releases ESM Cambrian: A New Family of Protein Language Models which Focuses on Creating Representations of the Underlying Biology of Protein appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ESM Cambrian 蛋白质语言模型 蛋白质结构 机器学习 生物信息学
相关文章