少点错误 07月21日 04:07
LLMs Can't See Pixels or Characters
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

当前大型语言模型(LLMs)在计数、数学运算及视觉推理等方面存在的“神秘”难题,其根本原因在于模型处理信息的基本方式——分词(tokenization)。LLMs通常将连续的字符或像素组合成“token”进行处理,而非逐个分析。例如,在计数“strawberry”中的“r”时,模型可能将其视为一个整体token,难以精确识别其中的字母。这种分词机制在早期数学问题处理中尤为突出,模型需要记忆大量的数字组合。虽然现代LLMs在数学方面有所改进,但分词方式仍可能影响其对图像细节的精确捕捉,尤其是在处理需要精细像素级理解的任务时。根本上,模型缺乏对底层数据(字符、像素)的直接、精细“视觉”能力,限制了其在某些复杂推理任务上的表现。

💡 **分词机制限制了LLM的精细识别能力**:LLMs在处理文本时,通常将多个字符组合成一个“token”。这导致模型在面对诸如“strawberry”中计数“r”这样的任务时,难以像人类一样逐个识别字符,因为模型看到的是预设的token,而不是原始的字符序列。这种机制虽然提高了处理效率,却牺牲了对底层细节的精确把握,迫使模型依赖记忆而非直接分析来回答问题。

🔢 **早期数学难题与分词方式的关联**:过去LLMs在数学运算上的困难,也与分词方式密切相关。模型需要记忆大量的数字组合才能进行计算,而非通过模拟人类的逐位运算。虽然现代LLMs已通过优化分词规则(如固定长度的数字token和从右到左的切分)在数学上有所改善,但这种对底层数据处理方式的依赖依然存在。

🖼️ **视觉推理的挑战源于像素处理方式**:对于多模态模型而言,图像也被分解成“patches”(像素块)来处理。尽管这些patches理论上包含足够的像素信息,但将图像分割成固定大小的块,可能会导致模型在识别跨越边界的对象时遇到困难。此外,模型训练侧重于图像的宏观语义理解(如“识别出一棵树”),而较少关注像素级别的细微差异,这限制了其在需要精细视觉推理的任务上的表现。

🚀 **解决之道在于更灵活的数据处理**:文章指出,解决这些问题的关键在于改进LLM处理底层数据的方式。对于文本,可以考虑逐字符处理,尽管计算成本更高。对于图像,动态选择处理单元(如Byte Latent Transformers)而非固定大小的patches,可能是一种更有效的解决方案。根本上,需要让模型拥有更精细、更灵活的数据“视觉”能力,以应对复杂推理挑战。

Published on July 20, 2025 8:00 PM GMT

This might be beating a dead horse, but there are several "mysterious" problems LLMs are bad at that all seem to have the same cause. I wanted an article I could reference when this comes up, so I wrote one.

What do these problems all have in common? The LLM we're asking to solve these problems can't see what we're asking it to do.

How many tokens are in 'strawberry'?

Current LLMs almost always process groups of characters, called tokens, instead of processing individual characters. They do this for performance reasons[1]: Grouping 4 characters (on average) into a token reduces your effective context length by 4x.

So, when you see the question "How many R's are in strawberry?", you can zoom in on [s, t, r, a, w, b, e, r, r, y], count the r's and answer 3. But when GPT-4o looks at the same question, it sees [5299 ("How"), 1991 (" many"), 460 (" R"), 885 ("'s"), 553 (" are"), 306 (" in"), 101830 (" strawberry"), 30 ("?")].

Good luck counting the R's in token 101830. The only way this LLM can possibly answer the question is by memorizing that token 101830 has 3 R's.

You thought New Math was confusing...

Ok, so why were LLMs initially so bad at math? Would you believe that this situation is even worse?

Say you wanted to add two numbers like 2020+1=?

You can zoom in on the digits, adding left-to-right[2] and just need to know how to add single-digit numbers and apply carries.

When an older LLM like GPT-3 looks at this problem...

It has to memorize that token 41655 ("2020") + token 16 ("1") = tokens [1238 ("20"), 2481 ("21")]. And it has to do that for every math problem because the number of digits in each number is essentially random[3].

Digit tokenization has actually been fixed and modern LLMs are pretty good at math now that they can see the digits. The solution is that digit tokens are always fixed length (typically 1-digit tokens for small models and 3-digit tokens for large models), plus tokenizing right-to-left to make powers of ten line up. This lets smaller models do math the same way we do (easy), and lets large models handle longer numbers in exchange for needing to memorize the interactions between every number from 0 to 999 (still much easier than the semi-random rules before).

Why can Claude see the forest but not the cuttable trees?

Multimodal models are capable of taking images as inputs, not just text. How do they do that?

Naturally, you cut up an image and turn it into tokens! Ok, so not exactly tokens. Instead of grouping some characters into a token, you group some pixels into a patch (traditionally, around 16x16 pixels).

The original thesis for this post was going to be that images have the same problem that text does, and patches discard pixel-level information, but I actually don't think that's true anymore, and LLMs might just be bad at understanding some images because of how they're trained or some other downstream bottleneck.

Unfortunately, the way most frontier models process images is secret, but Llama 3.2 Vision seems to use 14x14 patches and processes them into embeddings with dimension 4096[4]. A 14x14 RGB image is only 4704 bits[5] of data. Even pessimistically assuming 1.58 bits per dimension, there should be space to represent the value of every pixel.

It seems like the problem with vision is that the training is primarily on semantics ("Does this image contain a tree?") and there's very little training similar to "How exactly does this tree look different from this other tree?".

That said, cutting the image up on arbitrary boundaries does make things harder for the model. When processing an image from Pokemon Red, each sprite is usually 16x16 pixels, so processing 14x14 patches means the model constantly needs to look at multiple patches and try to figure out which objects cross patch boundaries.

Visual reasoning with blurry vision

LLMs have the same trouble with visual reasoning problems that they have playing Pokemon. If you can't see the image you're supposed to be reasoning from, it's hard to get the right answer.

For example, ARC Prize Puzzle 00dbd492 depends on visual reasoning of a grid pattern.

If I give Claude a series of screenshots, it fails completely because it can't actually see the pattern in the test input.

But if I give it ASCII art designed to ensure one token per pixel, it gets the answer right.

Is this fixable?

As mentioned above, this has been fixed for math by hard-coding the tokenization rules in a way that makes sense to humans.

For text in general, you can just work off of the raw characters[6], but this requires significantly more compute and memory. There are a bunch of people looking into ways to improve this, but the most interesting one I've seen is Byte Latent Transformers[7], which dynamically selects patches to work with based on complexity instead of using hard-coded tokenization. As far as I know, no one is doing this in frontier models because of the compute cost, though.

I know less about images, but you can run a transformer on the individual pixels of an image, but again, it's impractical to do this. Images are big, and a single frame of 1080p video contains over 2 million pixels. If those were 2 million individual tokens, a single frame of video would fill your entire context window.

I think vision transformers actually do theoretically have access to pixel-level data though, and there might just be an issue with training or model sizes preventing them from seeing pixel-level features accurately. It might also be possible to do dynamic selection of patch sizes, but unfortunately the big labs don't seem to talk about this, so I'm not sure what the state of the art is.

  1. ^

    Tokenization also causes the model to generate the first level of embeddings on potentially more meaningful word-level chunks, but the model could learn how to group (or not) characters in later layers if the first layer was character-level.

  2. ^

    Adding numbers written left-to-right is also hard for transformers, but much easier when they don't have to memorize the whole thing!

  3. ^

    Tokenization usually uses how common tokens are, so a very common number like 1945 will get its own unique token while less common numbers like 945 will be broken into separate tokens.

  4. ^

    If you're a programmer, this means an array of 4096 numbers.

  5. ^

    14 x 14 x 3 (RGB channels) x 8 = 4704

  6. ^

    Although this doesn't entirely solve the problem, since characters aren't the only layer of input with meaning. Try asking a character-level model to count the strokes in "罐".

  7. ^

    I should mention that I work at the company that produced this research, but I found it on Twitter, not at work.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 分词 视觉推理 AI局限 模型优化
相关文章