少点错误 03月05日
Distillation of Meta's Large Concept Models Paper
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta发布了名为“Large Concept Models”(LCM)的新型语言模型,旨在通过更具层次结构的方式改进语言建模。LCM模型将文本分割成句子或概念,并使用SONAR等语义嵌入模型将句子嵌入到“概念向量”中。该模型通过传递和输出语义向量,从而可能在处理长上下文方面表现更好。文章介绍了LCM的架构,包括BASE LCM、Quant LCM和Diffusion LCM,并对其性能进行了初步评估。结果显示,Diffusion LCM的两种变体在性能上优于其他LCM方法,但与同等规模的SmaLLAMA模型相比仍有差距。

💡LCM的核心在于将文本分解为句子或“概念”,并利用SONAR等语义嵌入模型将这些句子转化为“概念向量”,从而实现对语义信息的处理。

🧩LCM尝试了多种架构,包括BASE LCM、Quant LCM和Diffusion LCM。其中,Diffusion LCM通过“去噪”过程多次迭代,最终获得更清晰的预测结果,表现优于其他方法。

📊基准测试显示,在长文本生成任务中,Diffusion LCM的两种变体(ONE TOWER和TWO TOWER)表现突出,但在与SmaLLAMA等模型的比较中,仍有提升空间。

🏢Two Tower架构包含一个5层编码器和一个13层解码器,编码器将语义向量转换为隐藏状态,解码器通过交叉注意力机制处理随机初始化的状态,最终生成新的句子嵌入向量。由于其计算密集度较低,Meta选择了Two Tower架构进行进一步研究。

Published on March 4, 2025 5:33 PM GMT

Note: I had this as a draft for a while. I think it is accurate, but there may be errors. I am not in any way affiliated with the authors of the paper. 

Below I briefly discuss the "Large Concept Models" paper released by Meta, which tries to change some of the paradigm of doing language modelling. It has some limitations that are not present for normal language models, but I read spent the time to read the paper in relative depth so I am here to provide a brief summary of it.

"Large Concept Models" (LCM) Paper

Large Concept Models aim to be a way to "improve language modelling" by "being more hierarchical". I think the easiest way to explain is to compare to normal decoder-only language models.

A normal LLM works by passing in and getting out single tokens. For the LCM, we instead pass and get out semantic vectors. The key model that makes this possible is SONAR, which is a text auto-encoder.

The main benefit I can see is that it likely is much better at long-contexts. Otherwise, it comes with some disadvantages.

ARCHITECTURES

The key difficulty: How to output new sentence embed vectors, since it a continuous space. They try a few approaches.

They find diffusion model work significantly better. They had two variations that seemed to work equally well 

These two models performed very similarly, so they decided to focus on the "Two Towers" architecture as it's less compute intensive.[3]

In the rest of the paper, they try various methods for optimising hyperparameters and such, and they try scaling up the model, and compare to other normal LLM models.

 

BENCHMARKS

Disappointingly, they do not benchmark the model on any "normal benchmarks" like MMLU or anything similar. They state: "As longform text generation is the main challenge for LCM, our benchmarking is mainly focused on generative tasks". I will just provide two representitive benchmark results from the paper

First, they compare the different approaches for Large Concept Models that are instruction fine tuned. For the 1.6B size models, we see that the two diffusion models significantly outperform the other methods for Large Concept Models. However, we also see that the "SmaLLAMA" model of the same size performs better.

MetricCoherence ↑R-L ↑
What it measuresHow naturally text flows and connects (0-1 scale)Overlap between generated & reference text
BASE-LCM0.48223.69
QUANT-LCM0.84730.87
ONE-TOWER0.96833.40
TWO-TOWER0.93833.64
SMALLAMA0.98434.88

 

They scale up the model to 8B to compare against other models. 

Here is a representative benchmark that they used, which compares summary quality from LCM vs some similar sized models.

LCFO 10% - Summarise text in 10% of original length 
Model

Word

Ratio

R-L(↑)OVL-3 (↑)REP-4 (↓)CoLA (↑)SH-4 (↑)SH-5 (↑)
GEMMA-7B-IT0.15029.250.1646.4270.6670.3770.194
MISTRAL-7B-V0.3-IT0.54925.000.5376.2890.8480.6600.306
LLAMA-3.1-8B-IT0.12842.850.2433.8040.9070.4860.310
TWO-TOWER-7B-IT0.08929.380.2023.000.7910.6230.183

This table shows performance on the LCFO.10% task (long context summarization, where output should be 10% of input length). I don't intuitively understand most of the metrics that well, but they are:

The model seem to be OK but maybe not spectacular. 

 

CONCLUSION

Overall, the LCM seems like an interesting model in some ways, and perhaps has the benefit of using context much more slowly than in other models, but at the moment doesn't seem like much of an improvement to other models. It loses some of the properties you get from tokenization that make training models easy. 

 

  1. ^

    Note that they actually have two methods for LCM-QUANT but as they don't decide to pursue either approach, I won't go into much detail here. You can see the original paper for details on that.

  2. ^

    Note that the diffusion decoder actually has: self-attention + cross-attention + MLP, but they do not let the token attend to any tokens other than itself, so it is pointless. They state:

    The self-attention layers in the denoiser do only attend to the current position i.e., we do not attend to the preceding noised context. The self-attention layers were kept for consistency with a standard Transformer block and for the possible extension of denoising multiple vectors at once.

  3. ^

    It is less compute intensive since for one-tower, you need to pass through all 32 layers for all N diffusion steps, but for the two-tower you only need to pass through the encoder once and through the decoder for the N diffusion steps.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Large Concept Models 语言模型 Meta SONAR Diffusion Model
相关文章