MarkTechPost@AI 03月24日 14:50
SuperBPE: Advancing Language Models with Cross-Word Tokenization
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

SuperBPE是一种创新的tokenization算法,由华盛顿大学、英伟达和艾伦人工智能研究所的研究人员提出。它通过改进标准的字节对编码(BPE)算法,引入了包含传统子词token和跨多个单词的“superword”token的词汇表。SuperBPE旨在解决传统tokenization中将空格视为语义边界的问题,从而更好地捕捉多词表达的语义。实验结果表明,SuperBPE在多种基准测试中均优于BPE,且无需修改底层模型架构,为现代语言模型开发提供了更高效的tokenization方案。

💡 传统子词tokenization的局限性:现有子词tokenizers将文本分割成无法跨越空格的token,忽略了多词表达(如“a lot of”)作为单个语义单元的事实。不同的语言中,相同的概念可能由单个或多个词表示,这给tokenization带来了挑战。

✨ SuperBPE的核心理念:SuperBPE通过在BPE算法中引入“superword”token来解决上述问题。它允许token跨越多个单词,从而更好地捕捉多词表达的语义。SuperBPE通过两阶段的训练过程实现,先保留空格边界学习子词token,然后移除这些约束形成superword token。

🚀 SuperBPE的优势与表现:SuperBPE在30项基准测试中表现出色,所有SuperBPE模型均优于BPE基线,在多项选择题中表现尤为突出。在推理计算方面,SuperBPE也展现出优势,实现了性能提升的同时降低了计算成本。SuperBPE无需修改底层模型架构,可无缝替代传统BPE。

Language models (LMs) face a fundamental challenge in how to perceive textual data through tokenization. Current subword tokenizers segment text into vocabulary tokens that cannot bridge whitespace, adhering to an artificial constraint that treats space as a semantic boundary. This practice ignores the reality that meaning often exceeds individual words – multi-word expressions like “a lot of” function as single semantic units, with English speakers mentally storing thousands of such phrases. Cross-linguistically, the same concepts may be expressed as single or multiple words, depending on the language. Notably, some languages like Chinese and Japanese use no whitespace, allowing tokens to span multiple words or sentences without apparent performance degradation.

Previous research has explored several approaches beyond traditional subword tokenization. Some studies investigated processing text at multiple granularity levels or creating multi-word tokens through frequency-based n-gram identification. Other researchers have explored multi-token prediction (MTP), allowing language models to predict various tokens in a single step, which confirms models’ capability to process more than one subword simultaneously. However, these approaches require architectural modifications and fix the number of tokens predicted per step. Some researchers have pursued tokenizer-free approaches, modeling text directly as byte sequences. However, this significantly increases sequence lengths and computational requirements, leading to complex architectural solutions.

Researchers from the University of Washington, NVIDIA, and the Allen Institute for AI have proposed SuperBPE, a tokenization algorithm that creates a vocabulary containing both traditional subword tokens and innovative “superword” tokens that span multiple words. This approach enhances the popular byte-pair encoding (BPE) algorithm by implementing a pretokenization curriculum by initially maintaining whitespace boundaries to learn subword tokens, then removing these constraints to allow for superword token formation. While standard BPE quickly reaches diminishing returns and begins using increasingly rare subwords as vocabulary size grows, SuperBPE continues discovering common multi-word sequences to encode as single tokens, improving encoding efficiency.

SuperBPE operates through a two-stage training process that modifies the pretokenization step of traditional BPE, mentioned above. This approach intuitively builds semantic units and combines them into common sequences for greater efficiency. Setting t=T (t is transition point and T is target size) produces standard BPE, while t=0 creates a naive whitespace-free BPE. Training SuperBPE requires more computational resources than standard BPE because, without whitespace pretokenization, the training data consists of extremely long “words” with minimal deduplication. However, this increased training cost a few hours on 100 CPUs and occurs only once, which is negligible compared to the resources required for language model pretraining.

SuperBPE shows impressive performance across 30 benchmarks spanning knowledge, reasoning, coding, reading comprehension, etc. All SuperBPE models outperform the BPE baseline, with the strongest 8B model achieving an average improvement of 4.0% and surpassing the baseline on 25 out of 30 individual tasks. Multiple-choice tasks show substantial gains, with a +9.7% improvement. The only statistically significant underperformance occurs in the LAMBADA task, where SuperBPE experiences a final accuracy drop from 75.8% to 70.6%. Moreover, all reasonable transition points yield stronger results than the baseline. The most encoding-efficient transition point delivers a +3.1% performance improvement while reducing inference computing by 35%.

In conclusion, researchers introduced SuperBPE, a more effective tokenization approach developed by enhancing the standard BPE algorithm to incorporate superword tokens. Despite tokenization serving as the fundamental interface between language models and text, tokenization algorithms have remained relatively static. SuperBPE challenges this status quo by recognizing that tokens can extend beyond traditional subword boundaries to include multi-word expressions. SuperBPE tokenizers enable language models to achieve superior performance across numerous downstream tasks while reducing inference computational costs. These advantages require no modifications to the underlying model architecture, making SuperBPE a seamless replacement for traditional BPE in modern language model development pipelines.


Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post SuperBPE: Advancing Language Models with Cross-Word Tokenization appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SuperBPE Tokenization 语言模型 BPE算法
相关文章