MarkTechPost@AI 03月04日
NeoBERT: Modernizing Encoder Models for Enhanced Language Understanding
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NeoBERT通过融合解码器模型的先进技术,革新了长期停滞的编码器设计。它解决了传统BERT模型在架构、数据和上下文长度上的局限性。NeoBERT采用旋转位置嵌入(RoPE)、优化深度与宽度比、RMSNorm和SwiGLU等技术,并在更大的RefinedWeb数据集上进行训练。实验结果表明,NeoBERT在GLUE和MTEB基准测试中表现出色,同时支持更长的上下文窗口,且效率更高。它的开源特性使其成为检索、分类和实际应用的理想选择,为未来的高效、可扩展的语言理解研究奠定了基础。

🔄 架构现代化:NeoBERT采用旋转位置嵌入(RoPE),替换了绝对位置嵌入,从而更好地泛化到更长的序列。RoPE将位置信息直接集成到注意力机制中,减少了在分布外长度上的退化。

📚 数据与训练:NeoBERT在包含6000亿token的RefinedWeb数据集上进行训练(是RoBERTa数据的18倍),使模型接触到多样化的真实文本。同时,采用两阶段上下文扩展:首先在1024个token的序列上进行预训练,然后使用标准和长上下文数据的混合在4096个token的批次上进行微调。

🚀 性能与评估:NeoBERT在GLUE上获得了89.0%的分数,与RoBERTa-large的性能相匹配,但参数减少了1亿。在MTEB上,NeoBERT的表现优于GTE、CDE和jina-embeddings,提升了+4.5%,证明了其卓越的嵌入质量。

Encoder models like BERT and RoBERTa have long been cornerstones of natural language processing (NLP), powering tasks such as text classification, retrieval, and toxicity detection. However, while decoder-based large language models (LLMs) like GPT and LLaMA have evolved rapidly—incorporating architectural innovations, larger datasets, and extended context windows—encoders have stagnated. Despite their critical role in embedding-dependent applications, BERT-family models rely on outdated architectures, limited training data, and short context lengths, leading to suboptimal performance on modern benchmarks. In this paper, the researchers have presented NeoBERT to revitalize encoder design by integrating advancements from decoder models while addressing inherent limitations of existing encoders.

Traditional encoders like BERT and RoBERTa use absolute positional embeddings, Gaussian Error Linear Unit (GELU) activations, and a fixed 512-token context window. While newer models like GTE and CDE improved fine-tuning strategies for tasks like retrieval, they rely on outdated backbone architectures inherited from BERT. These backbones suffer from inefficiencies:

    Architectural Rigidity: Fixed depth-to-width ratios and positional encoding methods limit adaptability to longer sequences.Data Scarcity: Pre-training on small datasets (e.g., Wikipedia + BookCorpus) restricts knowledge diversity.Context Constraints: Short sequence lengths (512–2,048 tokens) hinder applications requiring long-context understanding.

Recent fine-tuning advancements masked these issues but failed to modernize the core models. For example, GTE’s contrastive learning boosts retrieval performance but cannot compensate for BERT’s obsolete embeddings. NeoBERT addresses these gaps through architectural overhauls, data scaling, and optimized training:

    Architectural Modernization:
      Rotary Position Embeddings (RoPE): Replaces absolute positional embeddings with relative positioning, enabling better generalization to longer sequences. RoPE integrates positional information directly into attention mechanisms, reducing degradation on out-of-distribution lengths.Depth-to-Width Optimization: Adjusts layer depth (28 layers) and width (768 dimensions) to balance parameter efficiency and performance, avoiding the “width-inefficiency” of smaller models.RMSNorm and SwiGLU: Replaces LayerNorm with RMSNorm for faster computation and adopts SwiGLU activations, enhancing nonlinear modeling while maintaining parameter count.
    Data and Training:
      RefinedWeb Dataset: Trains on 600B tokens (18× larger than RoBERTa’s data), exposing the model to diverse, real-world text.Two-Stage Context Extension: First pre-trains on 1,024-token sequences, then fine-tunes on 4,096-token batches using a mix of standard and long-context data. This phased approach mitigates distribution shifts while expanding usable context.Efficiency Optimizations:
        FlashAttention and xFormers: Reduces memory overhead for longer sequences.AdamW with Cosine Decay: Balances training stability and regularization.Performance and Evaluation

NeoBERT’s improvements are validated across following benchmarks:

    GLUE: Scores 89.0%, matching RoBERTa-large’s performance despite having 100M fewer parameters. Key drivers include the RefinedWeb dataset (+3.6% gain) and scaled model size (+2.9%).MTEB: Outperforms GTE, CDE, and jina-embeddings by +4.5% under standardized contrastive fine-tuning, demonstrating superior embedding quality. The evaluation isolates pre-training benefits by applying identical fine-tuning protocols to all models.Context Length: NeoBERT4096 achieves stable perplexity on 4,096-token sequences after 50k additional training steps, whereas BERT struggles beyond 512 tokens. Efficiency tests show NeoBERT processes 4,096-token batches 46.7% faster than ModernBERT, despite larger size.

In conclusion, NeoBERT represents a paradigm shift for encoder models, bridging the gap between stagnant architectures and modern LLM advancements. By rethinking depth-to-width ratios, positional encoding, and data scaling, it achieves state-of-the-art performance on GLUE and MTEB while supporting context windows eight times longer than BERT. Its efficiency and open-source availability make it a practical choice for retrieval, classification, and real-world applications requiring robust embeddings. However, reliance on web-scale data introduces biases, necessitating ongoing updates as cleaner datasets emerge. NeoBERT’s success underscores the untapped potential of encoder modernization, setting a roadmap for future research in efficient, scalable language understanding.


Check out the Paper and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

The post NeoBERT: Modernizing Encoder Models for Enhanced Language Understanding appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NeoBERT 编码器模型 自然语言处理
相关文章