NeoBERT: Modernizing Encoder Models for Enhanced Language Understanding

Encoder models like BERT and RoBERTa have long been cornerstones of natural language processing (NLP), powering tasks such as text classification, retrieval, and toxicity detection. However, while decoder-based large language models (LLMs) like GPT and LLaMA have evolved rapidly—incorporating architectural innovations, larger datasets, and extended context windows—encoders have stagnated. Despite their critical role in embedding-dependent applications, BERT-family models rely on outdated architectures, limited training data, and short context lengths, leading to suboptimal performance on modern benchmarks. In this paper, the researchers have presented NeoBERT to revitalize encoder design by integrating advancements from decoder models while addressing inherent limitations of existing encoders.

Traditional encoders like BERT and RoBERTa use absolute positional embeddings, Gaussian Error Linear Unit (GELU) activations, and a fixed 512-token context window. While newer models like GTE and CDE improved fine-tuning strategies for tasks like retrieval, they rely on outdated backbone architectures inherited from BERT. These backbones suffer from inefficiencies:

Architectural Rigidity:

Data Scarcity:

Context Constraints:

Recent fine-tuning advancements masked these issues but failed to modernize the core models. For example, GTE’s contrastive learning boosts retrieval performance but cannot compensate for BERT’s obsolete embeddings. NeoBERT addresses these gaps through architectural overhauls, data scaling, and optimized training:

Architectural Modernization:

Rotary Position Embeddings (RoPE):

Depth-to-Width Optimization:

RMSNorm and SwiGLU:

Data and Training:

RefinedWeb Dataset:

Two-Stage Context Extension:

Efficiency Optimizations:

FlashAttention and xFormers: Reduces memory overhead for longer sequences.AdamW with Cosine Decay: Balances training stability and regularization.Performance and Evaluation

NeoBERT’s improvements are validated across following benchmarks:

GLUE:

MTEB:

Context Length:

In conclusion, NeoBERT represents a paradigm shift for encoder models, bridging the gap between stagnant architectures and modern LLM advancements. By rethinking depth-to-width ratios, positional encoding, and data scaling, it achieves state-of-the-art performance on GLUE and MTEB while supporting context windows eight times longer than BERT. Its efficiency and open-source availability make it a practical choice for retrieval, classification, and real-world applications requiring robust embeddings. However, reliance on web-scale data introduces biases, necessitating ongoing updates as cleaner datasets emerge. NeoBERT’s success underscores the untapped potential of encoder modernization, setting a roadmap for future research in efficient, scalable language understanding.

Check out the Paper and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post NeoBERT: Modernizing Encoder Models for Enhanced Language Understanding appeared first on MarkTechPost.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签