MarkTechPost@AI 2024年07月12日
Microsoft Research Introduces AgentInstruct: A Multi-Agent Workflow Framework for Enhancing Synthetic Data Quality and Diversity in AI Model Training
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软研究团队推出了一个名为AgentInstruct的多代理工作流程框架,旨在解决AI模型训练中合成数据质量和多样性问题。AgentInstruct通过自动生成高质量的合成数据,减少了人工干预,提高了数据生成效率和数据质量,从而提升了AI模型的性能和可靠性。

🤔 AgentInstruct是一个多代理工作流程框架,它利用内容转换、指令生成和细化流程来生成多样化和高质量的合成数据。AgentInstruct通过利用原始数据源,如文本文档和代码文件作为种子,自动生成数据,减少了人工干预的需求。

🚀 AgentInstruct通过使用强大的模型和工具,如搜索API和代码解释器,生成高质量的提示和响应,确保生成的合成数据是复杂且多样化的。该框架能够生成各种各样的数据,包括文本编辑、创意写作、工具使用、编码和阅读理解等方面的技能。

📈 研究人员使用AgentInstruct创建了一个包含2500万对的合成后训练数据集,用于训练Orca-3模型。与其他使用相同基础模型的指令微调模型相比,Orca-3在多个基准测试中表现出色,例如在AGIEval上提高了40%,在GSM8K上提高了54%,在AlpacaEval上提高了45%。

💡 AgentInstruct通过自动生成高质量的合成数据,有效地解决了手动数据整理和数据质量问题,为AI模型训练提供了更强大的支持。AgentInstruct的成功应用为AI模型训练开辟了新的可能性,并为未来AI发展奠定了坚实的基础。

Large language models (LLMs) have been instrumental in various applications, such as chatbots, content creation, and data analysis, due to their capability to process vast amounts of textual data efficiently. The rapid advancement in AI technology has heightened the demand for high-quality training data, which is essential for effectively functioning and improving these models.

One of the significant challenges in AI development is ensuring that the synthetic data used to train these models is diverse and of high quality. Synthetic data generation often requires extensive human effort for curation and filtering to ensure it meets the necessary standards. Without this quality control, there is a substantial risk of model collapse, where the models degrade over time due to the lack of variety and quality in the training data. This can lead to ineffective learning outcomes and biased results, limiting the models’ applicability in real-world scenarios.

Generating synthetic data involves using powerful models, such as GPT-4, to create responses to a set of prompts. Although effective, this method still necessitates significant human intervention to ensure the data’s relevance and quality. Researchers have developed techniques like step-by-step instructions and complex prompting to improve the quality of the generated data. Despite these efforts, the process remains labor-intensive and prone to inconsistencies.

Researchers from Microsoft Research introduced a novel framework known as AgentInstruct to address these challenges. This agentic framework automates the creation of diverse and high-quality synthetic data using raw data sources like text documents and code files as seeds. By leveraging advanced models and tools, AgentInstruct significantly reduces the need for human curation, streamlining the data generation process and enhancing the overall quality and diversity of the training data.

AgentInstruct employs a multi-agent workflow comprising content transformation, instruction generation, and refinement flows. This structured approach allows the framework to autonomously produce a wide variety of data, ensuring the generated content is complex and diverse. The system can create prompts and responses using powerful models and tools like search APIs and code interpreters. This method ensures high-quality data and introduces significant variety, which is crucial for comprehensive training.

The researchers demonstrated the efficacy of AgentInstruct by creating a synthetic post-training dataset of 25 million pairs to teach various skills to language models. These skills included text editing, creative writing, tool usage, coding, and reading comprehension. The dataset was used to post-train a model called Orca-3, based on the Mistral-7b model. The results showed significant improvements across multiple benchmarks. For instance, Orca-3 exhibited a 40% improvement on AGIEval, a 19% improvement on MMLU, a 54% improvement on GSM8K, a 38% improvement on BBH, and a 45% improvement on AlpacaEval. Additionally, the model showed a 31.34% reduction in hallucinations across various summarization benchmarks, highlighting its enhanced accuracy and reliability.

The content transformation flow within AgentInstruct converts raw seed data into intermediate representations that simplify the creation of specific instructions. The seed instruction generation flow then takes these transformed seeds and generates diverse instructions following a comprehensive taxonomy. Finally, the instruction refinement flow iteratively enhances the complexity and quality of these instructions, ensuring the generated data’s robustness and applicability.

The performance of Orca-3, trained with the AgentInstruct dataset, significantly outperformed other instruction-tuned models using the same base model. It consistently showed better results than models such as LLAMA-8B-instruct and GPT-3.5-turbo. These benchmarks indicate the substantial advancements made possible by AgentInstruct in synthetic data generation.

In conclusion, AgentInstruct represents a breakthrough in generating synthetic data for AI training. Automating the creation of diverse and high-quality data addresses the critical issues of manual curation and data quality, leading to significant improvements in the performance and reliability of large language models. The substantial improvements observed in the Orca-3 model, such as the 40% improvement on AGIEval and the 54% improvement on GSM8K, underscore the effectiveness of this framework.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post Microsoft Research Introduces AgentInstruct: A Multi-Agent Workflow Framework for Enhancing Synthetic Data Quality and Diversity in AI Model Training appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AgentInstruct 合成数据 AI模型训练 多代理工作流程
相关文章