Falcon LLM Team Releases Falcon-H1 Technical Report: A Hybrid Attention

MarkTechPost@AI 08月01日 17:11

Falcon LLM Team Releases Falcon-H1 Technical Report: A Hybrid Attention–SSM Model That Rivals 70B LLMs

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

Falcon-H1系列大语言模型由技术创新研究所（TII）开发，通过结合Transformer注意力机制与Mamba状态空间模型（SSM）的混合并行架构，实现了卓越的性能、内存效率和可扩展性。该模型提供多种参数规模和版本，其参数效率超越了Qwen2.5-72B和LLaMA3.3-70B等模型。Falcon-H1采用了新颖的并行混合架构，允许注意力与SSM模块同时运作，并通过优化通道分配、块配置和RoPE基频等关键设计，显著提升了模型在训练和推理效率上的表现。此外，模型在tokenizer策略、预训练语料库、数据策略及训练基础设施方面也进行了精心设计和优化，以支持长上下文处理和多语言能力，并在各项基准测试中展现出优异的性能。

🌟 Falcon-H1采用创新的混合并行架构，将Transformer的注意力机制与Mamba的状态空间模型（SSM）并行结合，打破了传统模型串行设计的局限。这种设计允许注意力与SSM模块同时工作，并通过连接它们的输出再进行投影，提供了极大的灵活性，可以独立调整注意力与SSM的通道数量，默认采用2:1:5的比例进行优化，从而在效率和学习动态上取得平衡。

💡 在模型优化方面，Falcon-H1进行了深入的实验探索。研究表明，增加注意力通道会损害性能，而平衡SSM和MLP的通道数则能带来显著的性能提升。在块配置上，SA_M（半并行，注意力与SSM同时运行后接MLP）配置在训练损失和计算效率上表现最佳。此外，使用异常高的RoPE基频（10^11）能有效提升长上下文训练的泛化能力，而深度模型在固定参数预算下优于宽模型，例如Falcon-H1-1.5B-Deep（66层）的表现已可媲美许多3B和7B模型。

🌐 Falcon-H1在Tokenizer策略和预训练语料库方面也进行了精细化设计。其定制的BPE tokenizer套件支持从32K到261K的词汇量，并优化了数字、标点符号的处理，以及LaTeX token的注入，以提升在代码和数学领域的表现。预训练语料库高达18T tokens，涵盖高质量网页数据、18种语言的Common Crawl、Wikipedia等，以及专门的代码和数学数据集，并利用合成数据和长上下文技术（如Fill-in-the-Middle）进行增强，展现了强大的多语言和长文本处理能力。

🚀 Falcon-H1在性能评估上取得了突破性进展，实现了前所未有的每参数性能。其34B-Instruct版本在推理、数学、指令遵循和多语言任务上，能够媲美甚至超越Qwen2.5-72B和LLaMA3.3-70B等70B规模的模型。1.5B-Deep模型表现可与7B-10B模型相提并论，而0.5B模型则达到了2024年7B模型的水平。这些模型通过SFT和DPO进行了强有力的对齐，在MMLU、GSM8K、HumanEval等基准测试中表现出色。

🛠️ Falcon-H1的成功离不开其先进的训练基础设施和方法。模型采用了定制的Maximal Update Parametrization (µP)以支持跨模型规模的平滑扩展。在并行策略上，Mixer Parallelism (MP)和Context Parallelism (CP)被用于提升长上下文处理的吞吐量。同时，模型提供了bfloat16和4-bit量化版本，极大地促进了其在边缘设备的部署和应用。

Introduction

The Falcon-H1 series, developed by the Technology Innovation Institute (TII), marks a significant advancement in the evolution of large language models (LLMs). By integrating Transformer-based attention with Mamba-based State Space Models (SSMs) in a hybrid parallel configuration, Falcon-H1 achieves exceptional performance, memory efficiency, and scalability. Released in multiple sizes (0.5B to 34B parameters) and versions (base, instruct-tuned, and quantized), Falcon-H1 models redefine the trade-off between compute budget and output quality, offering parameter efficiency superior to many contemporary models such as Qwen2.5-72B and LLaMA3.3-70B.

Key Architectural Innovations

The technical report explains how Falcon-H1 adopts a novel parallel hybrid architecture where both attention and SSM modules operate concurrently, and their outputs are concatenated before the projection. This design deviates from traditional sequential integration and provides the flexibility to tune the number of attention and SSM channels independently. The default configuration uses a 2:1:5 ratio for SSM, attention, and MLP channels respectively, optimizing both efficiency and learning dynamics.

To further refine the model, Falcon-H1 explores:

Channel allocation

Block configuration

RoPE base frequency

Width-depth trade-off

Tokenizer Strategy

Falcon-H1 uses a customized Byte Pair Encoding (BPE) tokenizer suite with vocabulary sizes ranging from 32K to 261K. Key design choices include:

Digit and punctuation splitting

LATEX token injection

Multilingual support

Pretraining Corpus and Data Strategy

Falcon-H1 models are trained on up to 18T tokens from a carefully curated 20T token corpus, comprising:

High-quality web data

Multilingual datasets

Code corpus

Math datasets

Synthetic data

Long-context sequences

Training Infrastructure and Methodology

Training utilized customized Maximal Update Parametrization (µP), supporting smooth scaling across model sizes. The models employ advanced parallelism strategies:

Mixer Parallelism (MP)

Context Parallelism (CP)

Quantization

Evaluation and Performance

Falcon-H1 achieves unprecedented performance per parameter:

Falcon-H1-34B-Instruct

Falcon-H1-1.5B-Deep

Falcon-H1-0.5B

Benchmarks span MMLU, GSM8K, HumanEval, and long-context tasks. The models demonstrate strong alignment via SFT and Direct Preference Optimization (DPO).

Conclusion

Falcon-H1 sets a new standard for open-weight LLMs by integrating parallel hybrid architectures, flexible tokenization, efficient training dynamics, and robust multilingual capability. Its strategic combination of SSM and attention allows for unmatched performance within practical compute and memory budgets, making it ideal for both research and deployment across diverse environments.

Check out the Paper and Models on Hugging Face. Feel free to check our Tutorials page on AI Agent and Agentic AI for various applications. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Falcon LLM Team Releases Falcon-H1 Technical Report: A Hybrid Attention–SSM Model That Rivals 70B LLMs appeared first on MarkTechPost.

Introduction

Key Architectural Innovations

Tokenizer Strategy

Pretraining Corpus and Data Strategy

Training Infrastructure and Methodology

Evaluation and Performance

Conclusion

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签