MarkTechPost@AI 05月14日 12:00
Rethinking Toxic Data in LLM Pretraining: A Co-Design Approach for Improved Steerability and Detoxification
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

该研究重新评估了LLM预训练中数据质量的问题,探讨了一种协同设计方法,将预训练和后训练相结合。研究发现,预训练中使用有毒数据虽然会增加模型的基础毒性,但同时也会增强模型对毒性的内部表示,从而更容易在后训练阶段进行抑制。通过使用在不同清洁和有毒数据混合物上训练的Olmo-1B模型,研究表明毒性变得更易于线性分离和控制。实验结果表明,在不损害一般性能的情况下,可以更好地进行解毒,这表明包含有毒数据可以产生更可控和鲁棒的语言模型。

🧪研究人员通过训练一系列Olmo-1B模型,这些模型包含不同比例的有毒内容(从0%到25%),同时保持清洁数据不变,从而研究有毒数据对LLM预训练的影响。他们发现,适度地包含有毒数据可以提高通用语言能力(通过MMLU衡量)和毒性检测能力(通过ToxiGen衡量)。

🧠探测实验表明,用有毒数据训练的模型形成了更强、更可分离的毒性内部表示。统计分析和token级别可视化进一步证实,这些模型可以更准确地识别有毒内容,这表明接触有毒示例可以增强概念学习,而不会显著损害一般性能。

🛡️该研究探讨了在预训练期间接触有毒数据是否可以提高模型通过后训练方法进行解毒的能力。研究人员使用推理时间干预(ITI)、提示、监督微调(SFT)和DPO发现,使用高达10%的有毒数据(例如,4chan)训练的模型表现出更好的对齐能力。

🎯这些模型对解毒技术的反应更好,以最小的性能损失实现了更低的毒性。此外,在针对对抗性红队攻击进行测试时,使用ITI进行引导的预训练有毒数据模型表现出更强的鲁棒性,这表明这种暴露可以增强模型对有害内容的内部表示。

In the pretraining of LLMs, the quality of training data is crucial in determining model performance. A common strategy involves filtering out toxic content from the training corpus to minimize harmful outputs. While this approach aligns with the principle that neural networks reflect their training data, it introduces a tradeoff. Removing toxic content can reduce the diversity and richness of data, potentially weakening the model’s ability to understand or identify toxicity and degrading performance in downstream tasks like question answering. This creates a dilemma: retaining too much toxic data increases harmful outputs, while excessive filtering restricts the model’s overall capabilities. However, with the growing emphasis on post-training interventions, fewer models are deployed directly after pretraining, suggesting that data quality and quantity balance may be managed more effectively in later stages.

Approaches to detoxifying LLMs typically fall into two categories: finetuning-based and decoding-based. Finetuning methods, such as reinforcement learning with human feedback (RLHF) and Direct Preference Optimization (DPO), align model behavior with human values or curated datasets. While effective, they often compromise the model’s original abilities and can be bypassed or undone through further training. Controlled generation techniques, on the other hand, adjust outputs during inference, using methods like vocabulary shifting, self-debiasing, or external expert models. These strategies may reduce toxicity but often incur high computational costs and impair language fluency. A newer line of work explores modifying internal representations, assuming linear structures in hidden states can be manipulated for specific behavioral outcomes. 

Researchers from Harvard University re-evaluate data quality in LLM training by exploring a co-design approach that integrates pre- and post-training. They find that pretraining on toxic data, while increasing base model toxicity, enhances the model’s internal representation of toxicity, making it easier to suppress during post-training. Using Olmo-1B models trained on varied mixes of clean and toxic data, they show that toxicity becomes more linearly separable and easier to control. Experiments with prompting and inference-time intervention reveal improved detoxification without compromising general performance, suggesting that incorporating toxic data can lead to more controllable and robust language models. 

To study the effects of toxic data on LLM pretraining, researchers trained a series of Olmo-1B models with increasing proportions of toxic content (from 0% to 25%) while keeping clean data constant. They found that moderate toxic data inclusion improves general language capability (measured by MMLU) and toxicity detection (via ToxiGen). Probing experiments revealed that models trained with toxic data formed stronger, more separable internal representations of toxicity. Statistical analysis and token-level visualization further confirmed that such models identify toxic content more accurately, supporting that exposure to poisonous examples enhances concept learning without significantly harming general performance. 

The study explores whether exposure to toxic data during pretraining can improve a model’s ability to be detoxified through post-training methods. Using Inference-Time Intervention (ITI), prompting, supervised finetuning (SFT), and DPO, the researchers find that models trained with up to 10% toxic data (e.g., 4chan) show improved alignability. These models respond better to detoxification techniques, achieving lower toxicity with minimal performance loss. Additionally, when tested against adversarial red-teaming attacks, models pretrained with toxic data. They steered using ITI showed greater robustness, indicating that such exposure may enhance the model’s internal representation of harmful content. 

In conclusion, the study revisits the assumption that excluding toxic data during pretraining improves language model quality. Through theoretical and empirical analyses using Olmo-1B models, the authors show that increasing toxic data in pretraining leads to more disentangled representations of toxicity, making it easier to control during post-training. While base models trained on toxic data generate more harmful content initially, detoxification techniques like ITI are more effective on them. Results on benchmark datasets show a better balance between reducing toxicity and retaining general capabilities. The work suggests that some “bad” data can enhance model steerability and alignment. 


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.

Here’s a brief overview of what we’re building at Marktechpost:

The post Rethinking Toxic Data in LLM Pretraining: A Co-Design Approach for Improved Steerability and Detoxification appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM预训练 有毒数据 协同设计 模型解毒
相关文章