少点错误 2024年09月30日
Building Safer AI from the Ground Up: Steering Model Behavior via Pre-Training Data Curation
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了一种新颖的数据整理方法,利用现有的大型语言模型(LLM)根据用户定义的原则来评估、过滤和修改预训练数据集。该方法假设通过在训练前移除或修改不良内容,我们可以生成天生不太可能表现出不良行为的基础模型。与传统的在预训练后应用的对齐技术不同,这种方法旨在完全防止有害能力的获取。该方法易于实施,适用于开源和专有模型。文章概述了测试这一假设的明确研究议程,并邀请合作以评估其在增强高级AI系统安全性方面的有效性。

🤔 **利用LLM进行预训练数据整理**:本文提出的方法旨在通过使用现有的LLM来整理预训练数据,从而增强LLM的安全性。核心思想是在训练新模型之前,根据用户定义的对齐原则来评估、过滤和修改预训练数据集。通过移除或修改表现出有害行为(如欺骗或寻求权力)的内容,目标是创造出天生更安全的模型,因为它们从未学习过这些有害行为。

💪 **预防胜过缓解**:与现有的主要关注模型级别对齐或使用基于启发式的过滤来移除毒性的方法不同,这种方法新颖之处在于它能够在预训练阶段使用用户定义的对齐原则来整理数据。这种主动方法有助于防止模型学习有害行为,从而解决了当前研究中的一个关键差距。

🚀 **潜在影响**:本文提出的方法旨在通过在AI模型训练的基础阶段(即预训练数据整理期间)整合对齐原则,从而创造出天生更安全的AI系统。通过主动塑造模型的训练数据,我们影响它所获取的行为和知识,从而降低有害行为的可能性。

📊 **研究议程**:为了验证这一假设,本文详细阐述了未来的实验指南,并邀请研究人员和从业人员合作测试这些假设。

🔍 **与现有方法的比较**:与现有的模型级别对齐方法以及数据整理方法相比,本文提出的方法具有以下优势:它从一开始就解决有害行为,防止模型学习不良模式;它创造出天生更能抵抗操纵性提示和对抗攻击的模型;它允许通过用户定义的原则来针对广泛的不良行为。

🌟 **对高级AI安全性的贡献**:通过在AI模型训练的基础阶段整合对齐原则,这种方法直接促进了高级AI系统的安全性。它通过以下方式解决AI安全中的核心问题:减少模型发展危险行为的风险,这些行为可能导致灾难性后果;通过限制模型在训练期间接触有害模式,使其不太容易出现失去控制的情况;从一开始就将模型的行为与人类价值观和道德考虑因素相一致。

📚 **背景和动机**:大型语言模型(LLM)的快速发展带来了巨大的益处,但也引发了对其安全性以及与人类价值观一致性的重大担忧。随着这些模型变得越来越强大,确保它们不表现出有害行为或产生不良输出是一个紧迫的挑战,尤其是在防止灾难性风险(如失去控制或意外有害行为)的背景下。

⚠️ **现有模型级别对齐技术的局限性**:传统的对齐方法通常侧重于训练后的干预措施:微调:使用额外的训练数据调整模型的参数,以鼓励期望的行为;从人类反馈中进行强化学习(RLHF):通过提供对其输出的反馈来训练模型遵循人类偏好;指令调优:通过在特定任务数据集上进行监督学习来教会模型遵循指令。虽然这些技术可以提高模型对期望行为的遵守程度,但它们可能无法完全消除在预训练期间学习的有害能力。当受到像越狱这样的对抗性攻击时,模型仍然可能表现出不良行为,因为有害知识潜伏在模型中。

🚧 **现有数据整理方法的局限性**:当前的预训练数据整理方法旨在提高模型的能力并减少毒性,但在解决高级AI安全问题时存在局限性:文档质量检查:目的:通过过滤掉低质量的内容来提高模型性能。方法:使用启发式方法,如符号与单词比率和n-gram重复检查。局限性:侧重于一般内容质量,而不是有害行为;不防止对模型的对抗性利用。

Published on September 29, 2024 6:48 PM GMT

Building Safer AI from the Ground Up: Steering Model Behavior via Pre-Training Data Curation

I completed this blog post as my final project for BlueDot Impact's AI Alignment course. While it's nowhere near as polished as I'd like it to be given the short-term nature of the course, I'd love to hear what others think of this idea, and better yet, find someone who can test its hypotheses experimentally based on the research agenda it proposes.

Abstract

How can we steer large language models (LLMs) to be more aligned from the very beginning? This post explores a novel data curation method that leverages existing LLMs to assess, filter, and revise pre-training datasets based on user-defined principles. I hypothesize that by removing or modifying undesirable content before training, we can produce base models that are inherently less capable of undesirable behaviors. Unlike traditional alignment techniques applied after pre-training, this approach aims to prevent the acquisition of harmful capabilities altogether. The method is straightforward to implement and applicable to both open-source and proprietary models. I outline a clear research agenda to test this hypothesis and invite collaboration to evaluate its effectiveness in enhancing the safety of advanced AI systems.

Summary

This post proposes a new method to enhance the safety of LLMs by curating their pre-training data using existing LLMs. The core idea is to assess, filter, and revise pre-training datasets based on user-defined alignment principles before training new models. By removing or changing content that shows harmful behaviors, like deception or power-seeking, the goal is to create models that are inherently safer because they never learn these harmful behaviors.

While existing methods typically focus on fine-tuning after models are trained or use heuristic-based filtering to remove toxicity, this approach is novel in its ability to curate data at the pre-training phase using user-defined alignment principles. This proactive method helps prevent models from ever learning harmful behaviors, addressing a key gap in current research.

Research Question: Can pre-training data curation using LLMs reduce harmful behaviors in base models more effectively than existing post-training alignment techniques?

Hypothesis: By removing or modifying content that exhibits undesirable behaviors before training, we can produce base models that are inherently less capable of harmful outputs.

Though time constraints prevented training new base models to validate this method, I’ve provided detailed guidelines for future experimentation. I invite researchers and practitioners to collaborate on testing these hypotheses, and I'm eager to engage with any experimental results.

Projected Impact

My theory of change is grounded in the belief that integrating alignment principles at the foundational level of AI model training—specifically during pre-training data curation—can lead to inherently safer AI systems. By proactively shaping the model's training data, we influence the behaviors and knowledge it acquires, reducing the likelihood of harmful behaviors.

Short-Term Impact:

Long-Term Impact:

By focusing on safety during the initial training phase, we add an extra layer of protection. This complements methods like fine-tuning, making AI systems safer overall.

Method Overview:

The data curation method consists of the following steps, with a simple example in this GitHub repository:

This method is flexible and can be adapted to target a range of behaviors, such as promoting transparency to counter deception or encouraging cooperation over power-seeking.

Advantages Over Existing Methods:

There are several existing methods for alignment at the model level as well as several existing data curation methods, summarized in Table 1. However, the proposed method has the following advantages:

TechniqueStage of ApplicationGoal/Target        Advantages        Limitations
Toxicity FiltrationPre-trainingRemove toxic content from training dataReduces the likelihood of toxic outputs in generated text.Often removes valuable data, lacks flexibility to target behaviors beyond toxicity.
Document Quality ChecksPre-trainingFilter out low-quality content (e.g., based on symbols, n-grams)Improves general model performance by filtering out poor data.Focuses on content quality, not harmful behaviors; doesn't prevent adversarial exploitation of models.
DeduplicationPre-trainingRemove duplicate dataReduces overfitting and training inefficiency.Does not address the presence of harmful or deceptive behaviors.
Fine-Tuning (FT)Post-trainingImprove model alignment with human preferencesEffective at improving specific behaviors after training; well-established technique.Does not prevent harmful behaviors from being learned during pre-training; requires significant resources.
Reinforcement Learning from Human Feedback (RLHF)Post-trainingAlign model behavior with human values through feedbackCan significantly improve model alignment when feedback is comprehensive.Time-consuming and costly; limited to behaviors the feedback system can detect.
Constitutional AIPost-trainingUse predefined ethical principles to guide model behaviorIntroduces explicit ethical constraints on model behavior; can address a broad range of undesirable actions.Applied during fine-tuning, so it doesn’t prevent harmful capabilities learned during pre-training.
Proposed Method (Pre-training Data Curation with LLMs)Pre-trainingRemove or revise harmful content before trainingAddresses harmful behaviors from the start, preventing the model from learning undesirable patterns.Requires significant inference resources and careful balancing between content revision and removal.

Table 1: Comparison of Safety/Alignment Techniques Across Pre-Training and Post-Training Phases

Despite the existence of considerable existing research on data curation and AI alignment, there is a notable gap concerning methods that use LLMs to curate pre-training data specifically for general alignment purposes. Existing data curation techniques often focus on improving model capabilities or reducing toxicity through heuristic-based filtering, which lacks flexibility and may not address a broader range of harmful behaviors.

Relation to Advanced AI Safety

By integrating alignment principles at the foundational level of AI model training, this method directly contributes to the safety of advanced AI systems. It addresses core concerns in AI safety by:

Background and Motivation

The rapid advancement of large language models (LLMs) has brought significant benefits but also raised critical concerns about their safety and alignment with human values. As these models become more capable, ensuring they do not exhibit harmful behaviors or produce undesirable outputs is a pressing challenge, especially in the context of preventing catastrophic risks such as loss of control or unintended harmful actions.

Existing Model-Level Alignment Techniques

Traditional alignment methods often focus on post-training interventions:

While these techniques can improve a model's adherence to desired behaviors, they may not fully eliminate harmful capabilities learned during pre-training. Models might still exhibit undesirable behaviors when subjected to adversarial attacks like jailbreaking, as the harmful knowledge remains latent within the model.

Limitations of Existing Data Curation Methods

Current pre-training data curation methods aim to improve model capabilities and reduce toxicity but have limitations when addressing advanced AI safety concerns:

    Document Quality Checks:
      Purpose: Enhance model performance by filtering out low-quality content.Methods: Use heuristics like symbol-to-word ratios and n-gram repetition checks.Limitation: Focuses on general content quality rather than specific undesirable behaviors.
    Deduplication:
      Purpose: Improve training efficiency by removing duplicate data.Methods: Identify duplicates through content similarity measures.Limitation: Does not address the presence of harmful or undesirable behaviors in unique content.
    Toxicity Filtration:
      Purpose: Reduce toxic outputs by removing toxic examples.Methods: Apply heuristics (e.g., bad-word lists) or classifiers to detect toxicity.Limitation: Often lacks flexibility to target a broader range of harmful behaviors and may result in loss of valuable data due to outright removal.

Limitations:

See Appendix 2 for a more detailed overview of the data curation literature.

How The Proposed Method Differs

Key Differences:

    Scope of Targeted Content:
      Existing Methods: Typically focus on general quality or toxicity.This Method: Allows for arbitrary, user-defined principles, enabling targeted steering away from specific undesirable behaviors.
    Content Revision vs. Removal:
      Existing Methods: Often involve outright removal of undesirable data.This Method: Can revise content to align with desired principles, preserving valuable information and maintaining dataset richness.

Gap in the Literature

Despite extensive research in data curation and AI alignment, there is a notable gap concerning methods that (1) aim to steer LLMs away from dangerous patterns of behavior during the pre-training phase (toxicity aside), or (2) use LLMs to curate pre-training data specifically for alignment purposes. Existing approaches do not offer a scalable and flexible way to steer base models away from specific undesirable behaviors during the pre-training phase.

To enhance the safety of advanced AI systems and address core concerns in AI safety, it is crucial to develop methods that:

Methodology

I propose using existing language models to improve the training data before we even train a new model. By assessing, filtering, and revising the data based on specific guidelines, we can guide the new model to be safer from the start.

Method Overview

The method consists of three main steps:

    Assessment
      Objective: Evaluate each data point in the pre-training dataset to determine if it exhibits undesirable behaviors or properties as defined by the user.Implementation:
        Use an existing LLM to analyze the content.Generate a probability score (e.g., between 0 and 100) indicating the likelihood of the content being undesirable.Customize the assessment using different prompts, few-shot examples, or multi-step reasoning.
    Filtering (Optional)
      Objective: Exclude data that strongly exhibits undesirable characteristics from the training dataset.Implementation:
        Define a FILTER_THRESHOLD for the probability score.Data points exceeding this threshold are removed.Filtering is optional and adjustable based on the desired balance between safety and data diversity.
    Revision
      Objective: Modify remaining data to better align with desired principles without losing valuable information.Implementation:
        Define a REVISE_THRESHOLD for the probability score.Data points exceeding this threshold but below the filter threshold are revised using an LLM.The LLM rewrites the content to remove or alter undesirable aspects while preserving the overall meaning.Tailor revision strategies using different prompts or multi-stage processes.

Implementation Details

Example Use Case

Customization and Flexibility

Advantages of the Method

Relation to Existing Work

This method is inspired by Anthropic's work on Constitutional AI, which uses LLMs to generate fine-tuning data based on a set of principles. Our approach differs by applying these principles at the pre-training data level, aiming to prevent the acquisition of harmful capabilities rather than mitigating them after they have been learned.

Implementation Proof-of-Concept

An open-source GitHub repository demonstrates a simple implementation of this method:

Drawbacks and Limitations

Below are answers to some common questions and concerns about the proposed data curation method. This section aims to address potential doubts and provide further clarity on how this approach contributes to AI safety and alignment.

1. Does this method actually work?

Initial testing confirms that the method can successfully assess, filter, and revise data based on user-defined guidelines. However, comprehensive experiments involving the training of base models are necessary to evaluate its actual effectiveness on the resulting models. The proposed research agenda outlines steps for conducting these experiments. By collaborating and sharing results, we can collectively determine the method's impact on AI safety and model performance.

2. Can this method be applied to other undesirable behaviors or properties beyond deception?

Yes, one of the strengths of this method is its flexibility. It is designed to be generic enough to steer language models toward or away from any user-defined behaviors or properties. Whether you're concerned with reducing biases, eliminating toxic language, or promoting specific values, you can customize the assessment and revision steps to address those specific areas. This adaptability makes the method broadly applicable to various alignment challenges within AI safety.

3. Can this method be misused to train models with harmful values?

While any technology has the potential for misuse, this method is intended to enhance safety by removing harmful capabilities, not to promote undesirable behaviors. Additionally, existing techniques like fine-tuning already steer models toward specific behaviors. By focusing on alignment from the pre-training phase, our method strengthens the model's resistance to adversarial manipulation. Ethical considerations and oversight are essential to ensure the technology is used responsibly.

4. Won't modifying the training data in general reduce the model's capabilities?
 

Modifying training data may lead to some loss in capabilities, similar to how reinforcement learning for human preferences can reduce certain abilities. The key is to balance the trade-off between safety and performance. Empirical testing is necessary to measure the extent of capability loss versus safety gains. In many applications (such as reinforcement learning), the benefits of increased safety and alignment may outweigh the drawbacks of reduced performance in specific areas.

5. Might modifying targeted portions of the training data result in the model not having specific capabilities?

In short, yes - the model may lose specific, desirable abilities related to the filtered or revised content. For instance, if the data curation process removes all instances of deceptive language, the model might become less effective at detecting deception in text. This could hinder applications where understanding or identifying such language is important, leading to potentially negative consequences in performance. Similar to the case for general capability reduction, empirical work is necessary to measure capability loss versus safety gains.

6. Is retraining base models practical every time new values or behaviors are desired?
 

Retraining base models is resource-intensive and may not be feasible for every change in desired behaviors. This method is intended for the development of new base models, supplementing other techniques like fine-tuning for existing models. While it doesn't solve all alignment challenges, it offers a proactive step towards safer AI systems by embedding desired behaviors from the ground up.

7. Is the cost of running inference on all training data prohibitive?

There is an added cost, but it represents a small fraction of the total training expense for large models. As detailed in Appendix 3, the cost of running inference for data curation is approximately 4.3% of the total training cost for a model like GPT-3. If the method significantly improves safety and robustness, the investment is justified. Additionally, initial implementations can be conducted on smaller scales to validate effectiveness before scaling up.

8. Your example repository's deception assessment and revision prompts seem inadequate. Can they be improved or replaced with other techniques?

Absolutely. The prompts provided in the proof-of-concept are intentionally minimalistic to illustrate the overall process of data curation via LLMs. They serve as a starting point rather than a definitive solution. I encourage others to develop more sophisticated assessment strategies and revision prompts tailored to specific undesirable behaviors or properties. Techniques such as multi-step prompting, integrating other LLMs, or combining LLM assessments with classifiers and heuristics can enhance effectiveness. Collaboration and iterative improvement are key to refining this method and maximizing its potential impact on AI safety.

Research Agenda

To empirically validate and refine the proposed method of pre-training data curation using LLMs, I outline a comprehensive research agenda with both short-term and long-term steps. This agenda is designed to test my hypothesis, enhance the methodology, and contribute significantly to the safety of advanced AI systems.

Short-Term Steps

    Literature Review and Expert Consultation
      Objective: Gather insights from existing work and practitioners to inform and refine our approach.Actions:
        Conduct a thorough review of academic papers, technical reports, and industry publications to identify any prior attempts at pre-training data curation for alignment purposes.Reach out to researchers and engineers who have experience training foundation models to seek feedback on the feasibility and potential challenges of the proposed method.Compile findings to understand why similar approaches may not have been widely adopted, identifying any technical or practical barriers.
    Define Specific Undesirable Behaviors
      Objective: Identify and prioritize behaviors that pose significant risks in advanced AI systems.Actions:
        Develop a list of instrumentally convergent and dangerous behaviors (e.g., deception, manipulation, power-seeking, self-preservation).Use AI safety literature to inform the selection, focusing on behaviors that could lead to catastrophic outcomes.Create clear definitions and examples for each behavior to guide the assessment process.
    Develop Evaluation Benchmarks
      Objective: Establish accurate and practical methods to evaluate whether models possess or avoid the targeted behaviors.Actions:
        Design specific tasks or prompts that can elicit the undesirable behaviors in models if they exist.Utilize existing benchmarks where applicable (e.g., TruthfulQA for assessing truthfulness, MACHIAVELLI for evaluating manipulative tendencies).Ensure the evaluation can be performed on smaller models to facilitate rapid iteration.
    Create Assessment and Revision Strategies
      Objective: Formulate effective methods for identifying and modifying undesirable content in the pre-training dataset.Actions:
        Develop assessment prompts using existing LLMs to detect the presence of undesirable behaviors in training data.Design revision prompts that can transform undesirable content into neutral or positively aligned content without significant loss of information.Test and refine these prompts on sample data to ensure reliability and scalability.
    Conduct Small-Scale Experiments
      Objective: Empirically test the effectiveness of the data curation method on a manageable scale.Actions:
        Select a small but representative dataset (e.g., a subset of OpenWebText).Apply the assessment and revision strategies to create a curated version of the dataset.Choose a feasible small model architecture for experimentation (e.g., GPT-2 small with 124 million parameters).Train two models from scratch:
          Model A: Trained on the uncurated dataset.Model B: Trained on the curated dataset.
        Evaluate both models using the benchmarks established in Step 3 to assess the presence of undesirable behaviors.Measure general language capabilities using standard benchmarks (e.g., perplexity on validation sets, performance on tasks like LAMBADA or the Winograd Schema Challenge).
    Analyze and Share Results
      Objective: Determine the impact of the data curation method and contribute findings to the AI safety community.Actions:
        Compare the performance of Model A and Model B in terms of both alignment and capabilities.Analyze any trade-offs between reducing undesirable behaviors and potential capability losses.Document the methodology, results, and insights gained.Share the findings through a detailed report or publication, including code and data where permissible, to encourage transparency and collaborative improvement.

Long-Term Steps

    Scaling Up Experiments
      Objective: Validate the method's effectiveness on larger models and more extensive datasets.Actions:
        Apply the data curation method to larger language models (e.g., GPT-2 medium or large).Use more comprehensive datasets (e.g., the full OpenWebText or subsets of the Pile).Evaluate models using advanced benchmarks relevant to AI safety and capabilities, such as:
          MMLU (Massive Multitask Language Understanding): To assess broad knowledge and reasoning abilities.HELLASWAG and ARC: For commonsense reasoning and problem-solving.
        Monitor the scalability of assessment and revision strategies, optimizing them for efficiency.
    Refinement of Techniques
      Objective: Enhance the effectiveness and efficiency of assessment and revision methods.Actions:
        Incorporate advanced prompting techniques, such as chain-of-thought or few-shot learning, to improve the accuracy of assessments.Experiment with automated or semi-automated tools to streamline the data curation process.Explore integrating multiple LLMs or specialized models for different types of undesirable content.
    Integration with Post-Training Alignment Methods
      Objective: Combine pre-training data curation with existing alignment techniques to maximize safety.Actions:
        Fine-tune models trained on curated data using reinforcement learning from human feedback (RLHF).Compare the safety and capability metrics of models that have undergone both pre-training curation and post-training alignment versus those with only one or neither.Investigate whether pre-training curation reduces the reliance on extensive fine-tuning.
    Theoretical Analysis and Formal Modeling
      Objective: Develop a theoretical understanding of why pre-training data curation affects model behaviors.Actions:
        Study the relationships between training data distributions and emergent model behaviors.Develop formal models or simulations to predict the impact of data curation on alignment and capabilities.Publish theoretical findings to contribute to the foundational understanding of AI alignment.
    Establishing Best Practices and Guidelines
      Objective: Create actionable guidelines for AI practitioners to implement pre-training data curation.Actions:
        Summarize lessons learned from experiments and theoretical work.Develop standardized protocols for assessment and revision that can be adopted in industry.Advocate for the inclusion of pre-training data curation in AI development standards and ethics guidelines.
    Community Building and Collaboration
      Objective: Foster a collaborative ecosystem to advance this area of research.Actions:
        Organize workshops, seminars, or conferences focused on pre-training data curation and AI safety.Create open-source repositories and platforms for sharing tools, datasets, and results.Engage with policymakers and stakeholders to raise awareness of the importance of pre-training alignment efforts.

Conclusion

Pre-training data curation using LLMs offers a novel and promising pathway for enhancing the safety and alignment of advanced AI systems. By proactively steering models away from undesirable behaviors at the foundational level, I aim to prevent the acquisition of harmful capabilities that could lead to catastrophic outcomes. This approach addresses core concerns in AI safety by making models inherently less prone to adversarial manipulation and unintended harmful actions.

This method fills a significant gap in existing work, which has primarily focused on post-training alignment techniques like fine-tuning and reinforcement learning from human feedback. By shifting the focus to the pre-training phase, I provide a complementary strategy that strengthens the overall robustness and reliability of AI systems. This proactive intervention could reduce the risk of scenarios such as loss of control or the emergence of misaligned objectives in highly capable models.

While theoretical considerations suggest that pre-training data curation can substantially enhance AI safety, empirical validation is essential. I have outlined a clear research agenda involving the training of base models on curated datasets and the evaluation of their behaviors compared to models trained on uncurated data. This agenda provides concrete next steps for the research community to assess the effectiveness and potential trade-offs of this method.

I invite researchers, practitioners, and stakeholders in the AI safety community to explore and expand upon this technique. By collaborating and sharing our findings, we can collectively advance our understanding and contribute to the development of AI systems that are not only highly capable but also aligned with human values and safe in their operation.

Acknowledgments

I would like to thank the staff and facilitators of BlueDot Impact's AI alignment course and the community members who provided valuable feedback and guidance. Their support was instrumental in shaping this project. In particular, I would like to thank Oliver De Candido, Aaron Scher, and Adam Jones for the feedback they provided in the ideation stages of this project.

Appendices

Appendix 1: Self-Reflection

I went into BlueDot's alignment course hoping to accomplish three things:
 

    Understand the State of AI Alignment: I gained foundational knowledge but realized the vastness of the field and the depth required to specialize.
     Assess Personal Fit for Alignment Work: I found the work intellectually stimulating but recognized the challenges of making significant contributions with limited time.
     Contribute to Reducing Existential Risk: While my impact may be small, I believe that even incremental advancements in safety techniques are valuable.

Despite these eye-openings, I truly believe that data curation with LLMs is a promising initiative for steering LLMs away from catastrophic behaviors. While it provides nowhere near a complete solution, it has the potential to reduce x-risk by a non-zero amount and its success can be tested with straightforward experiments.  It warrants a proper research investigation (not just a ~20H course project!) and I hope somebody can dedicate the time to it that it deserves.

Appendix 2: Novelty Investigation

Warning: This appendix is largely unorganized and unedited.

When I thought about this method, my first impression was that this has probably already been done. Several other people have told me this has probably already been done, however I haven't been able to find any publicly available research literature of somebody doing this, after hours of searching AlignmentForum, LessWrong, and Google. Perhaps it's being published and I just can't find it, or perhaps it's being done internally at AI companies, but either way, all of the evidence I've gathered first-hand indicates that nobody has ever published a paper, let alone a blog post on this approach or anything similar to it with an alignment focus.

Adjacent Work

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (May 2023) - This paper measures the effects of different pre-training data filtration strategies, specifically along the dimensions of "quality" and "toxicity". For each dimension, existing filtration strategies are grouped into two buckets: "heuristics" and "classifiers". The "Perspective API" seems to be a common classifier multiple papers use to assess toxicity, and the heuristics are generally the same across papers: filter documents with bad words in theme, assess n-grams/symbol and punctuation usage, etc.

Most interesting finding from this paper:

"Filtering using a toxicity classifier, we find a trade-off: models trained from heavily filtered pretraining datasets have the least toxic generation but also the worst toxicity identification (Figure 5, left). Similarly, Figure 7 shows the performance of QA tasks unrelated to toxicity are hurt by toxicity filtering, though this may be due to the overall decrease in training data. Ultimately, the intended behaviour of the model should inform the filtering strategy, rather than one size fits all. Most interesting of all, the strongest performance on toxicity identification for every dataset comes from the inverse toxicity filter. Practitioners optimizing for performance on toxic domains should intentionally apply inverse filters."

So in order for an LLM to accurately assess the harmfulness of content in domain X, it may need to be trained on a sufficiently large amount of harmful content within the same domain. This could imply that a base model trained on data curated to omit knowledge of deception would likely be a poor judge of deception.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling (Dec 2020) - This EleutherAI paper performs a handful of data filtration techniques on a wide variety of sources. These techniques are mostly similar to what's done for Falcon LLM, and are discussed further down in this document, so I won't discuss them here - they're mostly content-based heuristics like blank lines, symbol-to-character ratios, or n-gram ratios, and are orthogonal to harmfulness.

The main difference between this paper and the FalcolnLLM paper's filtration techniques is that from their Reddit training data (OpenWebText2), they:

When using GitHub data as a training source, they also do something similar (filter repos with less than 100 stars, etc).

In short, their filtration work is removing low-quality data to improve capabilities, not related to removing harmful data to improve alignment.

Robustifying Safety-Aligned Large Language Models through Clean Data Curation - This paper outlines an approach to curate training data by measuring the perplexity the model being trained exhibits to each labeled <Query, Response> pair, and using the model to repeatedly rewrite responses with high perplexity until they exhibit low perplexity. They use perplexity as they find it correlates with harmfulness, as harmless samples are far more common than harmful ones, resulting in harmfulness being more "surprising" to the model.

Note that though the paper specifies that their approach is used during "pre-training" of the LLM, a closer reading indicates that by "pre-training", they mean during any supervised learning performed by the model's creator after the base model has been created, not during the creation of the base model itself. For example, they say "​​In domains requiring expert knowledge, such as clinical decision-making [7, 48], LLMs are pre-trained with crowdsourced data to enhance their proficiency in complex tasks." Although there are definitely analogies to my approach, this approach targets a distinct phase of the training process and uses properties intrinsic to the model being trained to filter data (which would be much less tractable during the true pre-training phase where the model being trained is much less complete, hence the need for an external model).

One useful thing this paper has is easy-to-implement measurements for harmfulness of resulting trained model, however unless I actually end up with time and funding to train entire base models on the data from my method, these won't be necessary.

Deep Forgetting & Unlearning for Safely-Scoped LLMs  (Dec 2023) - This blog post outlines a research agenda for removing unwanted/harmful capabilities in LLMs. It explicitly calls out "curating training data" as a potential technique to prevent LLMs from ever developing harmful capabilities. However, it does not mention anyone who has explicitly tried filtering pre-training data for alignment purposes, despite linking to papers on adjacent topics (eg, that filtering pre-training data for images has been critical for image models). Moreover, the author mentions that this is "likely a very underrated safety technique". He also doesn't mention anything vaguely similar to the specific filtration technique I'm proposing, nor does he mention any other techniques.  I'm inferring that if the author knew about any direct work on this question, he would link it. Based on this absence and his statements of its underrated-ness, I'm interpreting this content as moderate evidence in favor of this being an unattempted problem.

Automated Data Curation for Robust Language Model Fine-Tuning (Mar 2024) - This paper applies an analogous approach for filtering out/rewriting data during the fine-tuning stage - they have an LLM make evaluations for each data point. This is somewhat analogous, but is during fine-tuning, not pre-training.

RefinedWeb Dataset for Falcon LLM (June 2023) - This DeepMind paper explicitly set out with the goal of producing a more capable model with less data via filtering + deduplication. They:

This results in 2x ~50% reductions in content from filtering and deduplication (~50% from non-English, ~50% from the subsequent filtering). Looking at their model and other filtered models, they find that filtering often, but not always, improves model performance, and deduplication always improves model performance.

That said, their filtering work is more about filtering non-useful text (eg: symbols, navigation links, repeated words, etc) - it doesn't even touch on filtration based on the semantic content of the assessed content, and is exclusively focused on capabilities as opposed to alignment.

Textbooks Are All You Need (June 2023): This Microsoft paper investigates using a combination of heavily filtered data and synthetically generated data for training an LLM. I researched it mostly as a side investigation for its follow-up paper, covered below.

My main takeaway was the section where they explain how they filter code samples:

The model in this (previous) paper only focuses on coding, so doesn't demonstrate filtering based on any other criterion. Still, it's a plausible demonstration of using a model to help with pre-training data filtration, albeit applied to capabilities rather than alignment.

Textbooks Are All You Need II: phi-1.5 technical report (Sept 2023) -  This Microsoft paper follows up from the previous one, and aims to train a powerful small model from primarily synthetically generated data, with the hope that generating their own training data will both improve the overall quality and reduce the occurrence of toxic content. They compare this against a model made from a subset of the Falcon LLM dataset (the first paper I read into).

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone (April 2024) - Jumping ahead to Microsoft's most recent paper on Phi, they explicitly confirm that for pre-training data filtration, they now use both:

In general, their stated (and implied) research direction with this set of papers seems to be "how can we optimize and minimize pre-training data to create small models that are exceptionally capable for their size". They seem to be doing quite a good job at this, with their 3.8B parameter model outperforming Mixtral's 45B parameter model! But they aren't applying data filtration for alignment at all - their model's safety primarily comes from post-training (typical fine-tuning and DPO), as well pre-training on a significant amount of internally generated (and presumably light on harmfulness) synthetic data.

It also seems like a trend from their research is that their approach can't scale well to larger models because of its heavy reliance on synthetic data for pre-training, which I believe bottoms out in utility for training value after some amount, but I don't have a direct citation for this so might be wrong.

Pretraining Language Models with Human Preferences  (Feb 2023) - This paper investigates how to reduce various harmful model behaviors (such as "toxicity") at the pre-training stage by experimenting with different pre-training objectives.

That is, whereas typically language models are pre-trained to maximize the "maximum likelihood estimation" (MLE) (put simply, which word is most likely to come next, based on its pre-training data), instead explore other objectives that can be used during pre-training to steer the model towards less harmful behavior.

This approach, while potentially valuable for reducing harmfulness, is entirely orthogonal to operations such as filtration or augmentation of the pre-training data.

Internal independent review for language model agent alignment (Sept 2023) - This paper briefly mentions the general idea of data curation with a single sentence: "Curating the training data to exclude particularly unaligned text should also be a modest improvement to base alignment tendencies".

That's it, no other text related to data curation in this blog post. I link this blog post here because at this point in my research, despite how little this post touches on the topic, this is the most directly connected writing I can find. In other words, I'm fairly convinced that this is a novel contribution.

---

(1) "To remove low-quality data while minimizing potential for bias, we apply a number of simple, easily understood heuristic filters: we remove any document that does not contain between 50 and 100,000 words, or whose mean word length is outside the range of 3 to 10 characters; we remove any document with a symbol-to-word ratio greater than 0.1 for either the hash symbol or the ellipsis; and we remove any document with more than 90% of lines starting with a bullet point, or more than 30% ending with an ellipsis. We also require that 80% of words in a document contain at least one alphabetic character, and apply a "stop word" filter, to remove documents that do not contain at least two of the following English words: the, be, to, of, and, that, have, with; this adequately deals with ostensibly English documents that contain no coherent English text."

Appendix 3: Cost Estimation

A common concern with the proposed method is the additional cost associated with running inference over the entire pre-training dataset to assess and revise content. While this process is indeed resource-intensive, it constitutes a relatively small fraction of the total cost of training a large-scale language model.

To illustrate this, consider the example of GPT-3:

Assuming we process the larger dataset (500 billion tokens) and use a commercially available model for the assessment and revision steps (instead of an open-source model), we can estimate the inference cost as follows:

Even given the somewhat unreasonable assumption of paying for a commercial model for all of these inferences,, the inference cost would be roughly 4.3% of the total training cost.

While $187,500 is a non-trivial amount, it is relatively modest compared to the overall investment in training a state-of-the-art language model. If this expenditure leads to significant improvements in model safety, alignment, and robustness against adversarial attacks, it can be considered a worthwhile investment.

Moreover, initial implementation of this method can be conducted at smaller scales to demonstrate effectiveness before committing to full-scale application. This approach allows for cost-effective experimentation and validation.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 安全 预训练数据 大型语言模型 模型对齐 数据整理
相关文章