MarkTechPost@AI 06月25日 16:25
ByteDance Researchers Introduce Seed-Coder: A Model-Centric Code LLM Trained on 6 Trillion Tokens
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

字节跳动研究人员推出了Seed-Coder,这是一个基于模型驱动的代码LLM系列,旨在减少人工干预,通过使用LLM来过滤和整理来自GitHub和网络的大规模代码数据。Seed-Coder包含基础模型、指令模型和推理模型,在代码生成、编辑和多步推理任务中表现出色,甚至超越了更大规模的模型。该模型采用了6万亿tokens的训练数据,并通过指令调优和LongCoT技术进行优化。Seed-Coder的开源发布促进了社区在代码LLM领域的进一步研究和发展。

💻 Seed-Coder采用了模型驱动的方法,减少了人工干预,使用LLM来评估和过滤代码数据。这种方法避免了耗时且容易产生偏见的人工规则,从而提高了数据质量。

📊 Seed-Coder的训练数据来源于GitHub代码、提交历史和代码相关网站等,总计约6万亿tokens。在预训练阶段,首先进行基本过滤,去除语法错误或不当内容,然后使用LLM评估和评分,确保数据质量。

🚀 预训练完成后,Seed-Coder通过指令调优和LongCoT技术进一步优化。指令模型通过监督微调,更好地理解和遵循人类指令;推理模型则通过LongCoT强化学习,增强处理多步编码问题的能力。

Reframing Code LLM Training through Scalable, Automated Data Pipelines

Code data plays a key role in training LLMs, benefiting not just coding tasks but also broader reasoning abilities. While many open-source models rely on manual filtering and expert-crafted rules to curate code datasets, these approaches are time-consuming, biased, and hard to scale across languages. Proprietary models like Claude 3.7 and OpenAI o3 excel at coding tasks but don’t share details about their data. Even open-source models like DeepSeek and Qwen2.5 still depend heavily on human-designed filters. However, this reliance limits progress, echoing “The Bitter Lesson” that real breakthroughs come from scalable, data-driven methods, not handcrafted heuristics. 

Seed-Coder’s Model-First Pipeline Minimizes Human Dependency in Pretraining

Researchers at ByteDance introduce Seed-Coder, a family of 8B open-source LLMs including base, instruction, and reasoning models, designed to reduce human involvement in code data curation. Instead of relying on manual rules, their model-centric pipeline utilizes LLMs to score and filter large-scale code data from sources such as GitHub and code-related websites, resulting in a 6-trillion-token dataset. The instruction model is fine-tuned using synthetic data and preference optimization, while the reasoning model enhances multi-step code logic via Long-Chain-of-Thought reinforcement learning. Seed-Coder achieves top performance for its size, often surpassing larger models, and is openly shared to encourage further research and development. 

6-Trillion Token Corpus Built with LLM Quality Filters across GitHub and Web Data

Seed-Coder is trained using a model-driven approach that minimizes manual intervention. The pretraining corpus comprises approximately 6 trillion tokens, sourced from various sources, including GitHub code, commit histories, and code-related web data. Initially, basic filtering removes files with syntax issues or inappropriate content. Then, large language models are used to evaluate and score the remaining code, ensuring high-quality data without relying on hand-crafted rules. Pretraining occurs in two stages: first, with core code and web data, and later, with more complex structures, such as full repositories and long-context tasks, like fill-in-the-middle, to enhance the model’s coding capabilities. 

Post-Training via Instruction Tuning and LongCoT Enables Multi-Step Code Understanding

After pretraining, Seed-Coder undergoes further refinement through two post-training stages. First, the instruction model is trained using supervised fine-tuning on a diverse set of synthetic instruction data generated and filtered by LLMs, helping it better understand and follow human prompts. Then, its performance is enhanced using direct preference optimization (DPO), which aligns model responses more closely with human preferences. For complex reasoning tasks, the reasoning model is improved using LongCoT reinforcement learning, which strengthens its ability to handle multi-step coding challenges. These steps significantly boost Seed-Coder’s performance across various code generation and reasoning tasks. 

Seed-Coder Excels in Code Generation, Editing, and Multi-Step Reasoning Benchmarks

The evaluation reveals that the three Seed-Coder models, Base, Instruct, and Reasoning, perform exceptionally well across a range of coding tasks. The Base model outperforms other open-source models of similar size on code generation tasks, achieving strong scores on benchmarks like HumanEval and MultiPL-E. The Instruct model excels in tasks requiring code editing and instruction-following, leading in evaluations such as CodeEditorBench and FullStack. The Reasoning model, trained with long-chain-of-thought techniques, demonstrates outstanding multi-step problem-solving skills, particularly on challenging benchmarks like LiveCodeBench and Codeforces, even surpassing models that are several times larger in size. 

Open-Source Release Encourages Community-Driven Advancements in Code LLMs

In conclusion, Seed-Coder is a family of efficient and high-performing open-source language models designed specifically for coding tasks. These models stand out by relying largely on LLMs rather than humans to filter and curate training data, significantly reducing manual effort. Despite being trained on fewer tokens compared to some larger models, Seed-Coder exhibits exceptional performance in tasks such as code generation, completion, editing, and reasoning. However, its abilities in general language understanding are still limited due to the absence of broad web data and mathematical content. Future updates aim to expand the model family and improve its capabilities across different model sizes. 


Check out the Paper, Model Series, GitHub Page and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post ByteDance Researchers Introduce Seed-Coder: A Model-Centric Code LLM Trained on 6 Trillion Tokens appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Seed-Coder 代码LLM 模型驱动 字节跳动
相关文章