MarkTechPost@AI 02月22日
Meet Baichuan-M1: A New Series of Large Language Models Trained on 20T Tokens with a Dedicated Focus on Enhancing Medical Capabilities
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Baichuan Inc.推出了Baichuan-M1,一个专为医疗应用设计的大语言模型系列。与通过额外预训练或后训练改进现有架构的传统模型不同,Baichuan-M1从零开始构建,专注于医疗专业知识。该模型经过20万亿个token的训练,包括通用和医学特定数据,平衡了广泛的语言理解和领域特定的精确性。它在编码和数学等一般任务以及诊断和治疗建议等医学应用中表现出色。Baichuan-M1采用优化的Transformer架构,为人工智能驱动的医疗保健进步树立了新的基准。

🏥 Baichuan-M1是专为医疗应用设计的大语言模型系列,它与传统模型不同,从零开始构建,专注于医疗专业知识。

🧠 Baichuan-M1经过20万亿个token的训练,包括通用和医学特定数据,平衡了广泛的语言理解和领域特定的精确性。它擅长编码、数学等一般任务,以及诊断和治疗建议等医学应用。

⚙️ Baichuan-M1的架构遵循Llama框架,结合了预规范RMSNorm、FFN层中的SwishGlu以及旋转位置嵌入。研究还整合了全局和滑动窗口注意力,以优化推理效率,并将头部维度增加到256。

📊 通过各种基准测试,baichuan-M1-14B-Base的代码和数学能力与Qwen2.5系列模型进行了评估。结果表明,Baichuan-M1-14B-Base在某些任务中表现出竞争力,展示了其代码生成和数学推理的优势。

While LLMs have shown remarkable advancements in general-purpose applications, their development for specialized fields like medicine remains limited. The complexity of medical knowledge and the scarcity of high-quality, domain-specific data make creating highly efficient medical LLMs challenging. Although models like GPT-4 and DeepseekR1 have demonstrated impressive capabilities across industries, their adaptation to the medical domain is hindered by the intricate nature of medical terminology, diverse disciplines, and constantly evolving literature. Unlike general applications, medical AI must interpret highly technical language and provide precise, contextually relevant responses, which traditional LLMs struggle to achieve.

One major obstacle in building effective medical LLMs is the limited accessibility of high-quality training data, which is restricted due to privacy concerns and regulatory barriers. Medical datasets consist of structured and unstructured information, including clinical notes, textbooks, and research articles, making comprehensive model training difficult. While approaches like fine-tuning general LLMs on medical datasets and applying transfer learning have been explored, these methods often fail to grasp the depth of medical knowledge fully. As a result, such models may perform well on specific tasks but lack the nuanced understanding necessary for complex medical inquiries, highlighting the need for more refined training strategies.

Researchers at Baichuan Inc. introduced Baichuan-M1, a specialized large language model series designed specifically for medical applications. Unlike traditional models that refine existing architectures through additional pretraining or post-training, Baichuan-M1 is built from scratch with a strong focus on medical expertise. Trained on 20 trillion tokens, including both general and medical-specific data, the model balances broad language understanding with domain-specific precision. It excels in general tasks like coding and mathematics and in medical applications such as diagnostics and treatment recommendations. With an optimized Transformer architecture, Baichuan-M1 sets a new benchmark for AI-driven advancements in healthcare.

The model architecture follows Llama and similar frameworks, incorporating pre-norm RMSNorm, SwishGlu in the FFN layer, and rotary position embeddings. The study integrates global and sliding window attention to optimize inference efficiency, increasing the head dimension to 256 for global layers. Additionally, temporal short convolutions on key-value attention enhance in-context learning. The model employs a hybrid tokenizer for medical and general text, a curriculum-based training strategy with progressive data complexity, and adaptive gradient clipping for stability. Supervised fine-tuning refines general reasoning and medical-specific tasks, ensuring robust language understanding, medical reasoning, and long-document handling capabilities while maintaining inference efficiency.

Using various benchmarks, baichuan-M1-14B-Base’s code and mathematical abilities were evaluated against the Qwen2.5 series models. Code generation performance was tested with the EvalPlus framework and Bigcodebench, while mathematical proficiency was assessed using MATH and CMATH datasets. Although the 14B-Instruct variant still lags behind proprietary models like Claude-3.5-Sonnet and GPT-4o, the gap has narrowed significantly. The results demonstrate that Baichuan-M1-14B-Base performs competitively in certain tasks, showcasing its code generation and mathematical reasoning strengths compared to other advanced models.

In conclusion, Traditional methods for adapting LLMs to specialized fields often involve fine-tuning existing models. However, experiments suggest that further training on pre-existing models can hinder domain-specific improvements without sacrificing general performance. In the medical domain, fine-tuning general models with domain-specific data may be less effective than training from scratch. Baichuan-M1 was developed with this approach, using 20 trillion tokens to enhance medical expertise while maintaining general capabilities. Open-sourcing Baichuan-M1-14B allows further research, though challenges remain in rare disease diagnosis and real-world applications. Its continued evolution could significantly advance AI-driven medical decision-making.


Check out the Paper, Baichuan-M1-14B-Base and Baichuan-M1-14B-Instruct. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

The post Meet Baichuan-M1: A New Series of Large Language Models Trained on 20T Tokens with a Dedicated Focus on Enhancing Medical Capabilities appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Baichuan-M1 医疗AI 大语言模型
相关文章