MarkTechPost@AI 2024年09月19日
Source2Synth: A New AI Technique for Synthetic Data Generation and Curation Grounded in Real Data Sources
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Source2Synth是一种由Meta、牛津大学和伦敦大学学院的研究人员开发的新技术,旨在通过生成基于真实数据源的合成数据来增强大型语言模型(LLMs)的能力。这项技术旨在解决LLMs在复杂任务中遇到的困难,例如使用工具解决问题、处理结构化数据或执行复杂的多步骤推理。Source2Synth通过生成模拟真实场景和思维过程的合成数据来实现这一目标,从而避免了对昂贵且耗时的手动标注的依赖。

🤔 Source2Synth 利用真实数据源(如网络上的表格或相关文章)来创建具有中间推理步骤的合成实例。由于这些实例基于真实数据,因此可以确保合成数据的多样性、真实性和事实准确性。该方法的核心步骤是创建种子主题(可以是实体或事实陈述),然后将其扩展成完整的示例。示例包含任务说明、使用推理解决问题的必要步骤以及解决方案。通过这种方式,Source2Synth能够生成复杂的、真实的、模拟LLMs应该如何处理结构化数据或执行多步骤任务的数据点。

🧐 Source2Synth 采用过滤策略来提高数据集质量,因为并非所有生成的数据点都具有同等价值。为了解决这个问题,Source2Synth 使用过滤策略,根据合成实例的可回答性来决定是否保留它们。例如,如果生成的数据在一定次数的尝试中没有产生正确答案,则会丢弃该示例。这种质量控制程序确保只有高质量的示例(有助于LLMs学习所需技能的示例)被保留到最后的微调阶段。

🚀 Source2Synth 已在两个独特且具有挑战性的领域得到应用:多跳问答(MHQA)和表格问答(TQA)。在多跳问答中,LLMs 需要分析和综合来自多个来源的数据来回答单个问题。当 Source2Synth 在 HotPotQA(为多跳推理创建的数据集)上进行评估时,它在传统方法微调的基线模型上取得了 22.57% 的改进。在表格问答中,LLMs 需要使用 SQL 查询来与表格进行交互,以回答有关表格的问题。WikiSQL 是一个专注于使用 SQL 回答有关表格的问题的数据集。Source2Synth 在其上进行测试,取得了比基线模型高 25.51% 的改进。

🌟 Source2Synth 的结果表明,它能够在不依赖大量手动标注数据集的情况下提高 LLMs 在具有挑战性任务上的性能。对于需要复杂推理和工具使用的领域,Source2Synth 通过生成基于真实世界的来源并严格过滤数据集以确保高质量来提供一种可扩展的方法。

🌟 Source2Synth 是一种独特的方法,用于为 LLMs 传授新知识,特别是在手动标注不可行的情况下。通过确保仅使用高质量的示例进行微调,并通过将合成数据生成扎根于真实世界的来源来进行验证,这种策略解决了 LLMs 在复杂任务(如多步骤推理和结构化数据操作)中的当前限制。

Large Language Models (LLMs) have demonstrated impressive performance in tasks like Natural Language Processing, generation, and text synthesis. However, they still encounter major difficulties in more complicated circumstances. These are assignments that call for using tools to solve problems, dealing with structured data, or carrying out complex multi-step reasoning. For instance, although LLMs are adept at comprehending unstructured text, they have trouble utilizing and interpreting organized data, such as spreadsheets, tables, and databases. In addition, subpar performance is frequently achieved on tasks like multi-hop question answering (MHQA), which calls for combining data from several sources. Similarly, LLMs still find it challenging to complete duties that require the use of tools, including using SQL to answer tabular inquiries.

To overcome these issues, a new technique called Source2Synth has been introduced by researchers from Meta, Oxford University, and University College London. The primary benefit of Source2Synth is its capacity to impart new skills to LLMs without the need for expensive and time-consuming human annotations. Conventional approaches to enhancing LLM performance frequently call for a great deal of manual annotation, which is costly and difficult to scale, particularly for complicated jobs. This requirement has been removed by Source2Synth, which creates synthetic data that imitates actual situations and thought processes.

In order to create synthetic instances with intermediate reasoning steps, Source2Synth uses a specific data source, such as tables from the internet or relevant articles. Since these examples are based on actual data, the synthetic data is guaranteed to be diversified, realistic, and factually correct. The method’s main step is creating a seed topic, which might be an entity or a factual statement, and then developing it into a comprehensive example. The example contains the instructions for the task, the steps needed to solve the problem using reasoning, and the solution. Through this procedure, Source2Synth is able to generate intricate, realistic data points that mimic the way LLMs ought to handle structured data or carry out multi-step activities.

The method that Source2Synth uses to enhance dataset quality is an essential component. Low-quality examples can deteriorate model performance, and not all generated data points are equally valuable. In order to address this, Source2Synth uses filtering strategies determined by how answerable the synthetic instances are. For example, the example is discarded if the generated data does not result in the right response within a certain number of trials. This quality control procedure ensures that only excellent examples, those that help in the LLM’s acquisition of the necessary skills, are kept for the last round of fine-tuning.

The technique has been implemented in two unique and demanding fields, which are as follows,

    Multi-Hop Question Answering (MHQA): To respond to a single question, the LLM in this domain analyzes and synthesizes data from several sources. When Source2Synth was evaluated on HotPotQA, a dataset created for multi-hop reasoning, it outperformed baseline models that were adjusted by conventional techniques by 22.57%.
    Answering questions with structured data is known as tabular question answering (TQA), and it frequently calls for SQL queries to communicate with tables. WikiSQL is a dataset that focuses on using SQL to answer questions about tables. Source2Synth was tested on it and achieved a 25.51% improvement over baseline models.

The results have demonstrated how Source2Synth can increase LLM performance on challenging tasks without requiring large amounts of human annotations on datasets. For training LLMs in domains requiring sophisticated reasoning and tool usage, Source2Synth offers a scalable method by producing grounded, realistic examples and rigorously filtering the dataset to ensure high quality.

In conclusion, Source2Synth is a unique method for imparting new knowledge to LLMs, particularly in situations where human annotation is not feasible. This strategy solves the current constraints of LLMs in complicated tasks like multi-step reasoning and structured data manipulation by guaranteeing that only high-quality examples are utilized for fine-tuning and by rooting synthetic data generation in real-world sources for validation. 


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)

The post Source2Synth: A New AI Technique for Synthetic Data Generation and Curation Grounded in Real Data Sources appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Source2Synth 合成数据 大型语言模型 人工智能 数据生成
相关文章