MarkTechPost@AI 2024年10月01日
AMPLIFY: Leveraging Data Quality Over Scale for Efficient Protein Language Model Development
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

AMPLIFY是一种高效的蛋白质语言模型,通过提高数据质量而非扩大模型规模,显著降低训练和部署成本。该模型经过多种策略评估和优化,其开发过程涉及多个方面,包括数据处理、模型架构等,且已开源。研究还探讨了模型性能的影响因素及相关问题。

🎯AMPLIFY是由Chandar Lab、Mila和Amgen的研究人员开发的蛋白质语言模型,能大幅降低训练和部署成本,其性能优于以往一些大规模模型,参数少了43倍。

📋为创建pLM的验证数据序列集,将参考蛋白质组序列与Observed Antibody Space(OAS)和Structural Classification of Proteins(SCOP)数据库的序列相结合,以实现任务特定的验证。

💻AMPLIFY的训练数据来自UniRef、OAS、SCOP和UniProt数据库,经过处理去除了含模糊氨基酸及与验证集相似的序列,其模型架构融合了自然语言处理中大型语言模型的改进。

🔍研究对比了调整pLM大小与训练数据集内容、规模和持续时间等因素的影响,发现数据管理能显著提升模型性能,且较大模型存在过拟合风险,需定期重新训练。

Protein language models (pLMs), trained on protein sequence databases, aim to capture the fitness landscape for property prediction and design tasks. While scaling these models has become common, it assumes that the source databases accurately reflect the fitness landscape, which may not be true. Understanding protein function was historically tied to predicting structure based on physical models. However, as machine learning techniques evolved, they have proven more effective in modeling dynamic protein behaviors. By treating protein sequences like natural language, pLMs can capture structural insights without relying solely on structure databases, revealing deeper functional relationships.

Researchers from Chandar Lab, Mila, and Amgen developed AMPLIFY, an efficient pLM that significantly reduces the cost of training and deployment compared to previous models. Unlike large-scale models like ESM2 and ProGen2, AMPLIFY focuses on improving data quality rather than model size, achieving superior performance with 43 times fewer parameters. The team evaluated three strategies—data quality, quantity, and training steps—finding that improving data quality alone can create state-of-the-art models. AMPLIFY has been open-sourced, including its codebase, data, and models, to make pLM development more accessible.

The validation data sequence sets for the pLM were created by combining reference proteome sequences with sequences from the Observed Antibody Space (OAS) and the Structural Classification of Proteins (SCOP) database. The aim was to enable task-specific validation, particularly for complementarity-determining regions of antibody sequences and sequence-to-structure tasks. High-quality reference proteomes were selected based on their BUSCO completeness scores, ensuring representation across Bacteria, Archaea, and Eukarya. Sequences lacking experimental validation or containing non-canonical amino acids were excluded. The final validation sets included 10,000 randomly selected sequences from each source after clustering to reduce redundancy.

For training data, the UniRef, OAS, SCOP, and UniProt databases were processed to remove sequences with ambiguous amino acids and those similar to validation set sequences. The training dataset specifically utilized paired heavy and light chain antibody sequences formatted with a chain break token. The AMPLIFY model architecture incorporated recent improvements from large language models in natural language processing, including a SwiGLU activation function and a memory-efficient attention mechanism. The optimization process involved enhanced AdamW and a cosine annealing scheduler, with training conducted at lower precision using advanced techniques like DeepSpeed. The vocabulary was streamlined to accommodate better multi-chain proteins, and sequences longer than 512 residues were truncated during training to improve efficiency. After initial training, the context length was expanded to 2048 residues, followed by additional training steps for both AMPLIFY models.

The study compared the impact of adjusting pLM size with factors like training dataset content, size, and duration. The authors improved their validation dataset by using sequences from UniRef100, antibody pairs from OAS, and SCOP domains, aiming for a more representative sample. They found that data curation significantly enhances model performance, independent of model size or training duration. Contrary to previous findings, they observed that performance improved beyond 500K updates, suggesting that using diverse training data is crucial. Additionally, larger models risk overfitting, indicating the need for regular retraining to adapt to evolving data quality and quantity.

Recent advancements in ML have focused on scaling neural networks, particularly in language models for text and proteins. This trend has made training state-of-the-art models prohibitively expensive for many researchers, often leading to restricted access. However, this study suggests that expertise from protein scientists can enhance the curation process, yielding competitive performance without the need for a massive scale. Effective curation relies on a community-wide understanding of proteins, which remains limited. The study emphasizes the importance of collaborative expertise and advocates for open-source methods to facilitate iterative data curation and model development, ultimately aiding therapeutic advancements.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 52k+ ML SubReddit.

We are inviting startups, companies, and research institutions who are working on small language models to participate in this upcoming ‘Small Language Models’ Magazine/Report by Marketchpost.com. This Magazine/Report will be released in late October/early November 2024. Click here to set up a call!

The post AMPLIFY: Leveraging Data Quality Over Scale for Efficient Protein Language Model Development appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AMPLIFY 蛋白质语言模型 数据质量 模型性能
相关文章