MarkTechPost@AI 01月07日
Researchers from USC and Prime Intellect Released METAGENE-1: A 7B Parameter Autoregressive Transformer Model Trained on Over 1.5T DNA and RNA Base Pairs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

METAGENE-1是由南加州大学等机构研究人员推出的一个70亿参数的自回归Transformer模型,专门用于分析宏基因组序列。该模型基于从人类废水中提取的超过1.5万亿DNA和RNA碱基对进行训练,利用先进的测序技术和字节对编码(BPE)分词策略,有效捕捉了宏基因组数据中复杂的基因组多样性。METAGENE-1在病原体检测、异常检测和物种分类等任务中表现出色,其开源特性鼓励了合作和进一步创新,为生物监测和疫情应对提供了强大的工具。

🧬METAGENE-1模型基于Transformer架构,采用因果语言建模目标,能预测序列中的下一个标记,训练数据包含数万种物种的序列,涵盖了人类废水中存在的微生物和病毒多样性。

🧮该模型使用BPE分词策略,能高效处理新的核酸序列,并在分布式训练设置下稳定训练大规模数据集,支持病原体检测、异常检测和物种分类等任务,在宏基因组研究和公共卫生领域具有重要价值。

📊在多个基准测试中,METAGENE-1表现出色,例如在基于人类废水样本的病原体检测中,平均马修斯相关系数(MCC)达到92.96,显著优于其他模型,在Gene-MTEB基准测试中取得了0.59的全局平均分,突显其在零样本和微调场景中的适应性。

In a time when global health faces persistent threats from emerging pandemics, the need for advanced biosurveillance and pathogen detection systems is increasingly evident. Traditional genomic analysis methods, while effective in isolated cases, often struggle to address the complexities of large-scale health monitoring. A significant challenge is identifying and understanding the genomic diversity in environments such as wastewater, which contains a rich mix of microbial and viral DNA and RNA. The rapid advancements in biological research have further emphasized the importance of scalable, accurate, and interpretable models to analyze vast amounts of metagenomic data, aiding in the prediction and mitigation of health crises.

Researchers from the University of Southern California, Prime Intellect, and the Nucleic Acid Observatory have introduced METAGENE-1, a metagenomic foundation model. This 7-billion-parameter autoregressive transformer model is specifically designed to analyze metagenomic sequences. METAGENE-1 is trained on a dataset comprising over 1.5 trillion DNA and RNA base pairs derived from human wastewater samples, utilizing next-generation sequencing technologies and a tailored byte-pair encoding (BPE) tokenization strategy to capture the intricate genomic diversity present in these datasets. The model is open-sourced, encouraging collaboration and further advancements in the field.

Technical Highlights and Benefits

METAGENE-1’s architecture draws on modern transformer models, including GPT and Llama families. This decoder-only transformer uses a causal language modeling objective to predict the next token in a sequence based on preceding tokens. Its key features include:

    Dataset Diversity: The training data encompasses sequences from tens of thousands of species, representing the microbial and viral diversity found in human wastewater.Tokenization Strategy: The use of BPE tokenization enables the model to process novel nucleic acid sequences efficiently.Training Infrastructure: Advanced distributed training setups ensured stable training on large datasets despite hardware limitations.Applications: METAGENE-1 supports tasks like pathogen detection, anomaly detection, and species classification, making it valuable for metagenomic studies and public health research.

These features enable METAGENE-1 to generate high-quality sequence embeddings and adapt to specific tasks, enhancing its utility in the genomic and public health domains.

Results and Insights

The capabilities of METAGENE-1 were assessed using multiple benchmarks, where it demonstrated notable performance. In a pathogen detection benchmark based on human wastewater samples, the model achieved an average Matthews correlation coefficient (MCC) of 92.96, significantly outperforming other models. Additionally, METAGENE-1 showed strong results in anomaly detection tasks, effectively distinguishing metagenomic sequences from other genomic data sources.

In embedding-based genomic analyses, METAGENE-1 excelled on the Gene-MTEB benchmark, achieving a global average score of 0.59. This performance underscores its adaptability in both zero-shot and fine-tuning scenarios, reinforcing its value in handling complex and diverse metagenomic data.

Conclusion

METAGENE-1 represents a thoughtful integration of artificial intelligence and metagenomics. By leveraging transformer architectures, the model offers practical solutions for biosurveillance and pandemic preparedness. Its open-source release invites researchers to collaborate and innovate, advancing the field of genomic science. As challenges related to emerging pathogens and global pandemics continue, METAGENE-1 demonstrates how technology can play a crucial role in addressing public health concerns effectively and responsibly.


Check out the Paper, Website, GitHub Page, and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation IntelligenceJoin this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

The post Researchers from USC and Prime Intellect Released METAGENE-1: A 7B Parameter Autoregressive Transformer Model Trained on Over 1.5T DNA and RNA Base Pairs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

METAGENE-1 宏基因组 Transformer模型 病原体检测 生物监测
相关文章