MarkTechPost@AI 02月14日
Meta AI Introduces CoCoMix: A Pretraining Framework Integrating Token Prediction with Continuous Concepts
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta AI 提出了 CoCoMix,一种结合 token 预测与连续概念建模的预训练框架。该方法利用稀疏自编码器 (SAE) 从预训练模型的隐藏状态中提取高级语义表示,并将这些概念融入到训练过程中,与 token 嵌入交织。这种设计既保留了 token 学习的优点,又增强了模型识别和处理更广泛概念结构的能力。CoCoMix 旨在通过概念层面的信息丰富 token 学习,提高推理效率和模型的可解释性,实验表明,CoCoMix 在多个基准测试中表现出色,包括提高样本效率、增强泛化能力、有效知识迁移和更大可解释性。

💡CoCoMix 通过稀疏自编码器 (SAE) 提取隐藏状态中的潜在语义特征,捕捉超越单个 token 的信息,从而实现概念提取。

🎯CoCoMix 采用归因方法来确定哪些概念对预测影响最大,并保留这些概念,并非所有提取的概念都同等重要。

🔗CoCoMix 将选定的概念压缩成连续向量,并与 token 嵌入一起集成到隐藏状态中,使模型能够同时利用 token 级别和概念信息。

📈实验结果表明,CoCoMix 在 OpenWebText、LAMBADA 等多个基准测试中表现出色,在下游任务中表现出持续的改进,在知识迁移上优于传统知识蒸馏技术。

The dominant approach to pretraining large language models (LLMs) relies on next-token prediction, which has proven effective in capturing linguistic patterns. However, this method comes with notable limitations. Language tokens often convey surface-level information, requiring models to process vast amounts of data to develop deeper reasoning capabilities. Additionally, token-based learning struggles with capturing long-term dependencies, making tasks that require planning and abstraction more difficult. Researchers have explored alternative strategies, such as knowledge distillation and structured input augmentation, but these approaches have not fully addressed the limitations of token-based learning. This raises an important question: Can LLMs be trained in a way that combines token-level processing with conceptual understanding? Meta AI introduces Continuous Concept Mixing (CoCoMix) as a potential solution.

CoCoMix: A Different Approach to Pretraining

CoCoMix integrates token prediction with the modeling of continuous concepts derived from hidden states of a pretrained model. The method employs a Sparse Autoencoder (SAE) to extract high-level semantic representations, which are then incorporated into the training process by interleaving them with token embeddings. This design allows the model to maintain the benefits of token-based learning while enhancing its ability to recognize and process broader conceptual structures. By enriching the token-based paradigm with concept-level information, CoCoMix aims to improve reasoning efficiency and model interpretability.

Technical Details and Benefits

CoCoMix operates through three main components:

    Concept Extraction via Sparse Autoencoders (SAEs): A pretrained SAE identifies latent semantic features from a model’s hidden states, capturing information that extends beyond individual tokens.Concept Selection with Attribution Scoring: Not all extracted concepts contribute equally to predictions. CoCoMix employs attribution methods to determine which concepts are most influential and should be retained.Interleaving Continuous Concepts with Token Representations: The selected concepts are compressed into a continuous vector and integrated into the hidden states alongside token embeddings, allowing the model to utilize both token-level and conceptual information.

This approach improves sample efficiency, enabling models to achieve comparable performance with fewer training tokens. Additionally, CoCoMix enhances interpretability by making it possible to inspect and adjust the extracted concepts, offering a clearer view of how the model processes information.

Performance and Evaluation

Meta AI evaluated CoCoMix across multiple benchmarks, including OpenWebText, LAMBADA, WikiText-103, HellaSwag, PIQA, SIQA, Arc-Easy, and WinoGrande. The findings indicate:

Conclusion

CoCoMix presents an alternative approach to LLM pretraining by combining token prediction with concept-based reasoning. By incorporating structured representations extracted via SAEs, CoCoMix enhances efficiency and interpretability without disrupting the underlying next-token prediction framework. Experimental results suggest that this method provides a balanced way to improve language model training, particularly in areas requiring structured reasoning and transparent decision-making. Future research may focus on refining concept extraction methods and further integrating continuous representations into pretraining workflows.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System(Promoted)

The post Meta AI Introduces CoCoMix: A Pretraining Framework Integrating Token Prediction with Continuous Concepts appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

CoCoMix 预训练框架 连续概念 Token预测
相关文章