Scaling LLM Outputs: The Role of AgentWrite and the LongWriter-6k Dataset

MarkTechPost@AI 2024年08月17日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

当前的LLM在生成超过2000字的文本时存在局限性。本文介绍了AgentWrite框架，它利用现成的LLM来生成更长的连贯输出，并扩展了输出长度。研究人员还构建了LongWriter-6k数据集，其中包含6000个监督微调数据点，输出长度范围从2000到32000字。研究表明，AgentWrite框架能够有效地扩展LLM的输出能力，并生成高质量的超长文本。

🤔 **AgentWrite框架**：为了解决现有语言模型在生成长篇文本方面的局限性，研究人员提出了AgentWrite框架。该框架利用现有的LLM来生成更长的连贯输出，并扩展了输出长度。它通过将超长生成任务分解成子任务，使现有LLM能够生成超过20000字的连贯输出。

📚 **LongWriter-6k数据集**：为了提升模型的训练效果，研究人员构建了LongWriter-6k数据集，其中包含6000个监督微调数据点，输出长度范围从2000到32000字。该数据集包含各种类型的写作指令，并指定了所需的输出长度。

📈 **LongBench-Write基准测试**：研究人员还创建了LongBench-Write基准测试，用于评估AgentWrite框架的有效性。该基准测试包含各种用户写作指令，并指定了所需的输出长度。通过使用LLM作为评判者，研究人员评估了模型在多个维度上的输出质量，包括相关性、准确性、连贯性和阅读体验。

🚀 **实验结果**：实验结果表明，AgentWrite框架成功地将GPT-4o模型的输出长度从2000字扩展到大约20000字。通过使用LongWriter-6k数据集训练的模型在LongBench-Write基准测试中获得了更高的质量分数，尤其是在需要生成2000到4000字输出的任务中。

💡 **未来方向**：研究人员建议未来研究方向包括扩展AgentWrite框架、改进数据质量以及解决推理效率方面的挑战。他们强调，现有的长上下文LLM在更大输出窗口方面拥有未开发的潜力，可以通过使用长输出数据进行战略性训练来释放这种潜力。

🧠 **总结**：本文通过提出AgentWrite框架来解决现有LLM在生成长篇文本方面的局限性。LongWriter-6k模型通过在模型对齐过程中加入长输出数据，成功地生成了超过10000字的高质量输出。广泛的实验和消融研究证明了这种方法的有效性。该研究标志着超长文本生成方面的重大进展，为该领域的进一步发展奠定了基础。

🙌 **AgentWrite框架**：它通过将超长生成任务分解成子任务，使现有LLM能够生成超过20000字的连贯输出。

📚 **LongWriter-6k数据集**：它包含6000个监督微调数据点，输出长度范围从2000到32000字，包含各种类型的写作指令，并指定了所需的输出长度。

📈 **LongBench-Write基准测试**：它包含各种用户写作指令，并指定了所需的输出长度，通过使用LLM作为评判者，研究人员评估了模型在多个维度上的输出质量，包括相关性、准确性、连贯性和阅读体验。

🚀 **实验结果**：实验结果表明，AgentWrite框架成功地将GPT-4o模型的输出长度从2000字扩展到大约20000字，并获得了更高的质量分数。

💡 **未来方向**：研究人员建议未来研究方向包括扩展AgentWrite框架、改进数据质量以及解决推理效率方面的挑战。

Long-context LLMs require sufficient context windows for complex tasks, akin to human working memory. Research focuses on extending context length, enabling better handling of longer content. Zero-shot methods and fine-tuning enhance memory capacity. Despite advancements in input length (up to 100,000 words), existing LLMs have a 2,000-word output limit, highlighting a capability gap. Alignment training helps LLMs prioritize instructions and adhere to length constraints.

The underexplored area of aligning LLMs for ultra-long outputs represents a critical research gap. Previous work establishes the foundation for understanding long-context LLMs’ limitations and potential, setting the stage for advancements in ultra-long output generation. These studies have laid the groundwork for enhancing LLMs’ capabilities in generating extensive and coherent text, addressing the discrepancy between input and output lengths.

Current long context LLMs process inputs up to 100,000 tokens but struggle to generate outputs beyond 2,000 words, limiting applications requiring extensive text generation. Analysis reveals a consistent failure in producing longer outputs across state-of-the-art models. User interaction logs indicate over 1% of prompts request outputs exceeding 2,000 words, highlighting a demand for models capable of generating longer texts. This limitation stems from the lack of long-output examples in existing supervised fine-tuning datasets.

To address this, AgentWrite, an agent-based pipeline, decomposes ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to produce coherent outputs exceeding 20,000 words. The authors construct LongWriter-6k, a dataset with 6,000 SFT data points ranging from 2,000 to 32,000 words. Their 9B parameter model, enhanced through DPO, achieves state-of-the-art performance on a new benchmark for ultra-long generation capabilities, demonstrating the potential of existing long-context LLMs with appropriate training data.

The paper introduces the AgentWrite framework, designed to address the limitations of existing language models in producing lengthy text. This framework utilizes off-the-shelf LLMs to generate extended and coherent outputs. To enhance model training, the authors developed the LongWriter-6k dataset, consisting of 6,000 supervised fine-tuning data points with output lengths ranging from 2,000 to 32,000 words. They also created the LongBench-Write benchmark to evaluate the effectiveness of their approach, including diverse user writing instructions with specified output lengths.

The methodology employs a LLM-as-a-judge approach, using GPT-4o to evaluate output quality across multiple dimensions such as relevance, accuracy, coherence, and reading experience. The process involves training existing models with the new LongWriter-6k dataset and utilizing preference data to improve adherence to long writing instructions. By combining innovative data generation techniques, comprehensive evaluation benchmarks, and advanced training strategies, the authors aim to significantly improve the long-output capabilities of LLMs, enabling the generation of high-quality content exceeding 10,000 words.

The AgentWrite framework successfully extended the GPT-4o model’s output length from 2,000 to approximately 20,000 words, demonstrating its effectiveness in handling ultra-long generation tasks. Evaluation using the LongBench-Write benchmark showed a 5% increase in overall quality scores for the model trained with the LongWriter-6k dataset, particularly in tasks requiring 2,000 to 4,000-word outputs. The most significant improvement was observed in the “Breadth and Depth” dimensions, with an 18% absolute improvement compared to the ablated model.

Ablation studies revealed that while including a writing plan before content generation didn’t significantly enhance performance, training with the LongWriter-6k dataset was crucial for achieving longer outputs without compromising quality. The LongWriter-9B model outperformed the GLM-4-9B model on the LongBench-Write benchmark, highlighting the effectiveness of the proposed methodology in enhancing existing long-context LLMs. Overall, the experiments confirmed significant improvements in both output length and quality, demonstrating the potential of the LongWriter framework for ultra-long text generation tasks.

In conclusion, this paper addresses a significant limitation in current LLMs by proposing the AgentWrite framework to extend output capacity beyond the typical 2,000-word cap. The LongWriter-6k model, developed using this framework, successfully generates high-quality outputs exceeding 10,000 words by incorporating long-output data during model alignment. Extensive experiments and ablation studies demonstrate the effectiveness of this approach. The authors suggest future directions for expanding the framework, refining data quality, and addressing inference efficiency challenges. They emphasize that existing long-context LLMs have untapped potential for larger output windows, which can be unlocked through strategic training with long-output data. This research marks a significant advancement in ultra-long text generation, providing a foundation for further developments in the field.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 48k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Introduces Arcee Swarm: A Groundbreaking Mixture of Agents MoA Architecture Inspired by the Cooperative Intelligence Found in Nature Itself

The post Scaling LLM Outputs: The Role of AgentWrite and the LongWriter-6k Dataset appeared first on MarkTechPost.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签