ByteByteGo 前天 23:39
How LLMs See the World
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入浅出地介绍了大型语言模型(LLM)处理文本的核心单元——Token。文章解释了Token的构成,以及为什么它们对LLM的理解、生成和成本至关重要。通过BPE、WordPiece、SentencePiece等常见分词方法,揭示了它们如何影响模型的词汇管理、处理未知词汇、效率和性能。此外,文章还探讨了Token与模型上下文窗口的关系,并分析了分词在跨语言、数值推理、代码生成等方面带来的挑战与优化方向,强调了理解Token对于有效使用LLM的重要性。

💬 **Token是LLM处理文本的基础单元**:Token是将文本转化为LLM可理解的数字形式的第一步,它们可以是单个字符、子词、单词、标点符号或特殊符号。例如,“Hello world!”可能被分解为[‘Hello’, ‘ world’, ‘!’],这直接影响模型对输入的解析。

🔧 **分词方法影响模型性能与效率**:BPE、WordPiece、SentencePiece和Unigram等分词方法各有特点,例如BPE通过合并高频字符对构建词汇,WordPiece则优先考虑数据似然度。这些方法决定了模型处理新词、罕见词的能力以及文本表示的效率,进而影响模型的泛化能力。

💡 **Token数量影响成本与上下文限制**:LLM的API调用和模型处理能力均受Token数量的制约。用户需要为每个Token付费,且模型有最大上下文窗口限制,这直接关系到输入文本的长度、模型的响应能力以及对话的连贯性。

⚠️ **分词的“怪癖”导致模型行为异常**:非英语文本、特殊字符、表情符号、数字和代码在分词过程中可能被过度拆解,导致Token数量增加、含义模糊或计算错误。例如,“3.11”和“3.9”的比较问题,模型并非进行数值比较,而是基于Token序列的模式匹配。

🚀 **理解Token是优化LLM使用的关键**:深入了解Tokenization有助于用户更好地设计Prompt、预估API成本、排查模型问题,并洞察LLM在处理不同语言、数学运算和代码时的能力边界,从而更有效地利用AI技术。

AI Agents Need Monitoring. Now They Have It. (Sponsored)

Your agent misfired. Then called a tool that never responded. Your user hit retry again.

Sentry caught it, and traced it through the full execution flow so you can see what broke, where, and why.

Forget scattered logs. Sentry's Agent Monitoring shows you what your AI is doing in production to debug:

Your AI isn’t isolated. Your monitoring shouldn’t be either.

See agent behavior in context — in the same Sentry already tracking your frontend, backend, and everything in between.

Try Sentry for Free


For LLMs, tokens are the fundamental units of text that the model processes. When you type 'Hello world!' to ChatGPT, it doesn't see two words and punctuation, but rather it sees perhaps four distinct tokens: ['Hello', ' world', '!', '\n'].

Tokens are what rule the world of LLMs. You send tokens to models, you pay by the token, and models read, understand, and breathe tokens.

What Are Tokens?

Tokens are the fundamental units of text that an LLM processes.

However, tokens are not always equivalent to words. Depending on the tokenization approach used, a token could represent:

For example, the sentence "I love machine learning!" might be tokenized as: ["I", "love", "machine", "learning", "!"] or ["I", " love", " machine", " learn", "ing", "!"] depending on the tokenization method.

Why Tokenization Matters

Tokenization is very important for several reasons:

    Vocabulary Management: LLMs have finite vocabularies (typically between 30K-100K tokens). Tokenization allows these finite vocabularies to express an open-ended language space. By breaking rare or complex words into reusable subword units (e.g., "extraordinary" → "extra" + "ordinary"), the model avoids needing a separate token for every possible word in every language.

    Handling Unknown Words: Good tokenization strategies can break down unfamiliar words into familiar subword units, enabling models to handle words they've never seen before. For example, a model that has never seen the word “biocatalyst” might still recognize “bio” and “catalyst” as separate tokens and draw useful meaning from them.

    Efficiency: The length of text sequences directly impacts computational requirements. Efficient tokenization reduces the number of tokens needed to represent text.

    Model Performance: The quality of tokenization affects how well LLMs understand and generate text, especially for non-English languages or specialized domains. Poorly tokenized input can fragment meaning or distort structure.

How Tokens are Read By LLMs

Once text is tokenized, there is one more step that transforms these symbolic tokens into something the neural network can actually process: numerical representation. Each token in the vocabulary is assigned a unique integer ID (called a token ID). For example.

"Hello" → token ID 15496

" world" → token ID 995

These token IDs are then converted into high-dimensional numerical vectors called embeddings through an embedding layer. Each token ID maps to a dense vector of real numbers (typically 512, 1024, or more dimensions). For instance, the token "Hello" might become a vector like [0.23, -0.45, 0.78, ...].

This numerical transformation is necessary because neural networks can only perform mathematical operations on numbers, not on text symbols. The embedding vectors capture semantic relationships between tokens, where similar tokens have similar vector representations in this high-dimensional space. This is how models "understand" that "king" and "queen" are related, or that "run" and "running" share meaning.

Common Tokenization Methods

1. Byte Pair Encoding (BPE)

BPE is one of the most widely used tokenization methods in modern LLMs, used by models like GPT-2, GPT-3, and GPT-4.

How it works:

BPE creates a flexible subword vocabulary that efficiently represents common words while still being able to break down rare ones. This helps models handle misspellings, compound words, and unknown terms without resorting to an "unknown token."

A key variant is byte-level BPE, which works directly with UTF-8 bytes rather than Unicode characters. This makes sure that any possible character can be represented (even those not seen during training), avoiding the "unknown token" problem.

2. WordPiece

WordPiece was introduced by Google and is used in models like BERT, DistilBERT, and Electra.

How it works:

For example, "unhappy" might be tokenized as ["un", "##happy"] in WordPiece.

3. SentencePiece

SentencePiece is a tokenizer developed by Google that works directly on raw text without requiring language-specific pre-tokenization. It’s used in models like T5, XLNet, and ALBERT.

How it works:

For example, the phrase "Hello world" might be tokenized as ["▁Hello", "▁world"], where the ▁ indicates a word boundary.

4. Unigram

Unigram is often used together with SentencePiece and takes a probabilistic approach rather than a merge-based one.

How it works:

Unlike BPE or WordPiece, which build their vocabulary by merging, Unigram works more like sculpting, starting big and pruning down. This allows it to maintain a broader set of tokenization options and gives it more flexibility during inference.

Tokens and Context Windows

LLMs have a limited "context window," which is the maximum number of tokens they can process at once. This limit directly affects:

Older models like GPT-2 were limited to ~1,024 tokens. GPT-3 increased this to 2,048. Today, cutting-edge models have limits of 1M+, such as Gemini 2.5 Pro.

What To Know About Tokenization

Token Counting

Understanding token counts is important for:

As a rough estimate for English text (this varies!):

Tokenization Quirks

Tokenization can lead to some unexpected behaviors:

    Non-English Languages: Many LLMs tokenize non-English text inefficiently, using more tokens per word than English.

    Special Characters: Unusual characters, emojis, or specific formatting might consume more tokens than expected. For instance, a single emoji like “🧠” might consume several tokens, depending on the tokenizer, which can distort meaning or unexpectedly bloat token usage.

    Numbers and Code: Some tokenizers handle numbers and programming code in counter-intuitive ways, breaking them into multiple tokens. This fragmentation makes it harder for models to reason numerically or generate accurate code, since the logical or mathematical unit is not preserved as a whole.

How Tokenization Impacts LLM Performance

Many challenges and quirks in large language models stem not from the model itself, but from how text is tokenized. Here’s how tokenization affects different areas of performance:

The Infamous 3.11 vs 3.9 Problem

Large language models often fail at seemingly simple numerical comparisons like “What is bigger: 3.11 or 3.9?”. Tokenization provides insight into how numbers are processed under the hood.

Let's look at these numbers: 3.11 and 3.9. When tokenized, these are broken into separate components. For simplicity, let's say that “3.11” is split into tokens like "3", ".", and "11", while “3.9” is split into "3", ".", and "9". To a language model, these aren’t numerical values but symbolic fragments. The model isn’t comparing 3.11 against 3.9 as floating-point values. It’s pattern-matching based on the statistical likelihood of what text should come next, given how it has seen these tokens appear in its training data.

There are multiple ways for models today to answer these correctly:

Conclusion

Tokenization is how LLMs break down text into processable units before converting them to numbers. Text like "Hello world!" becomes tokens like ['Hello', ' world', '!'], then gets converted to numerical vectors that neural networks can understand. Common methods include BPE (used by GPT models), WordPiece (BERT), and SentencePiece (T5).

Tokenization directly impacts costs (you pay per token), context limits (models can only process so many tokens), and performance quirks. It explains why LLMs struggle with math (numbers get split up), why non-English text is less efficient (more tokens needed), and why models fail at "3.11 vs 3.9" comparisons (they see fragmented symbols, not numbers).

Understanding tokenization helps you write better prompts, estimate API costs, troubleshoot issues, and grasp both the capabilities and fundamental limitations of modern AI. It gives you deeper insight into both the capabilities and limitations of modern AI, as it's the lens through which LLMs see everything.


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM Tokenization AI模型 自然语言处理 分词
相关文章