MarkTechPost@AI 前天 13:58
How Much Do Language Models Really Memorize? Meta’s New Framework Defines Model Capacity at the Bit Level
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了Meta、Google DeepMind、康奈尔大学和NVIDIA的研究人员提出的新方法,用于评估现代语言模型对特定数据点的“记忆”程度,从而衡量其容量。研究将记忆分为无意记忆和泛化两部分,通过计算总记忆量来提供准确的模型容量估计,并发现GPT系列模型的容量约为3.6 bits/参数。研究还构建了一系列与模型容量和数据集大小相关的缩放定律,并深入探讨了模型在不同设置下的记忆行为,以帮助理解和改进模型评估、隐私和可解释性。

🧠 研究核心在于区分语言模型的记忆和泛化能力,传统的评估方法难以区分记忆和泛化。

💡 研究者开发了一种新方法,将记忆分解为无意记忆(模型包含的关于数据集的信息)和泛化(关于真实数据生成过程的信息)两个部分。

📊 通过实验,研究团队训练了数百个GPT-2架构的模型,发现模型存储容量约为3.5-3.6 bits/参数,且存储容量受训练精度影响。

📈 研究表明,随着训练数据集的增大,模型对样本级别的无意记忆会减少,而模型容量与数据集大小之间存在明确的缩放关系。

🔬 研究结果为理解Transformer模型如何编码训练数据提供了新框架,有助于改进模型评估、隐私和可解释性。

Introduction: The Challenge of Memorization in Language Models

Modern language models face increasing scrutiny regarding their memorization behavior. With models such as an 8-billion parameter transformer trained on 15 trillion tokens, researchers question whether these models memorize their training data in a meaningful way. Common techniques, including data extraction and membership inference, fall short as they often fail to distinguish between memorization and generalization.

Limitations of Existing Approaches

Previous frameworks like extraction-based methods or differential privacy operate at the dataset level, not accounting for instance-specific memorization. Language modeling through compression and assessments of capacity through fact memorization (as in RNNs and quantized transformers) offer partial insight but lack scalability and precision, especially for deep transformer architectures.

A Novel Approach to Measuring Memorization

Researchers from FAIR at Meta, Google DeepMind, Cornell University, and NVIDIA have proposed a novel method for estimating how much a model “knows” about specific datapoints to measure the capacity of modern language models. They separate memorization into two components: unintended memorization, which represents the information a model contains about a dataset, and generalization, which captures the information about the true data-generation process. They calculate total memorization to provide accurate estimates of model capacity by removing generalization, showing that GPT family models have an approximate capacity of 3.6 bits-per-parameter. Researchers also developed a series of scaling laws that relate model capacity and data size to membership inference by training hundreds of transformer language models.

Experimental Framework and Training Methodology

Using the GPT-2 architecture, the team trained hundreds of models ranging from 100K to 20M parameters, varying depths (1-8 layers), and hidden sizes (32-512). Training involved:

These models were trained on both synthetic sequences and deduplicated 64-token text sequences from the FineWeb dataset. The experiments ensured minimal interference from generalization through careful dataset construction.

Model Capacity Insights and Key Findings

Disentangling Memorization and Generalization

Switching from synthetic to real-text datasets, the team observed:

Membership Inference Scaling Laws

The researchers modeled the success rate (F1 score) of loss-based membership inference as a function of the ratio between model capacity and dataset size. Key observations:

Conclusion: A Better Understanding of Model Behavior

This work establishes a principled framework for measuring memorization in language models. By introducing quantifiable metrics and scalable experiments, it deepens our understanding of how transformer models encode training data and draws a clear boundary between memorization and generalization. The resulting insights can guide future developments in model evaluation, privacy, and interpretability.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.

Want to promote your product/webinar/service to 1 Million+ AI Engineers/Developers/Data Scientists/Architects/CTOs/CIOs? Lets Partner..

The post How Much Do Language Models Really Memorize? Meta’s New Framework Defines Model Capacity at the Bit Level appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

语言模型 记忆 泛化 模型容量 Transformer
相关文章