Published on July 31, 2025 2:09 PM GMT
I
A botanist sets out to study plants in the Amazon rainforest. For her first research project, she sets her sights on “red things”, so as not to stretch herself too far. She looks at red flowers and notices how hummingbirds drink their nectar; she studies red fruits and notices how parrots eat them.
She comes to some tentative hypotheses: perhaps red attracts birds. Then she notices the red undersides of the bushes in the undergrowth. Confusing! All she can say at the end of her project is that red tends to be caused by carotenoids, and also anthocyanins.
A researcher living in Canada reads her work. He looks out his window at the red maples of autumn. Ah, he says, carotenoids and/or anthocyanins. Makes sense. Science.
Of course, we can see that not much useful work has been done here. We don’t understand the function of fruit or nectar or the red undersides of leaves. We don’t understand the reasons why the chlorophyll is drained from leaves in the autumn, while the red pigments remain.
II
What went wrong? A few things, but the big problem here is that “red things” is not a good category to study in the context of plants. Sure, the fact that we can point out the category means they might have a few things in common, but we’re not speaking the native language of biology here. To make sense of biology, we must think in terms of evolution, ecology and physiology.
We have to talk about “fruits” as a class of things-which-cause-animals-to-distribute-seeds; we have to ask which animals might distribute the seeds best, and how to attract them. Flowers are similar but not quite the same. The red leaf undersides are totally different: they reflect the long-wavelengths of the light which reaches the jungle floor, sending it back up into the cells above and giving the chlorophyll a second chance to absorb it.
Red-thing-ism has both lumped together unrelated things (fruit, flowers, leaves) and split the true-categories by a (somewhat) unnatural boundary (red flowers, other flowers).
This example has never quite happened, but evolutionary biologists see it all the time! The order “insectivora” was constructed to contain hedgehogs, shrews, golden moles, and the so-called flying lemur. But this group didn’t evolve from a single common ancestor, many animals simply converged on the same lifestyle and body plan.
III
Evolutionary biologists have tightly restricted themselves to speak only the language of evolution itself, which is why they end up saying such deranged things as “whales are mammals” and “birds are dinosaurs” and “fish don’t exist”. From this they’ve managed to succeed in studying a process which has occurred over billions of unobservable years: the method works.
Platt emphasized strong inference as the hallmark of a successful field. I’ll identify another: avoiding red-thing-ism.
IV
Red-thing-ism is at its most common when there is little knowledge of the structure of the field. It’s especially common in AI safety research. A particular example is “deception”. Most people who talk about deception lump together several different things:
- Outputting text which does not represent the AI’s best model of the world, according to standard symbol-grounding rules.Modelling a person as an agent with a belief system and trying to shift that belief system away from your own best guess about the world.Modelling a person as a set of behaviours, and taking actions which cause them to do some output. Like feeding ants a mixture of borax and sugar.
In the language of AI and thinking systems, these are not the same thing. But even if we restrict our analysis to individual cases, such as 3, we run into a problem. Red-thing-ism doesn’t just lump unnaturally, it splits unnaturally. Part of the boundary of 3 is “I don’t like it”. The same is true for things like goal misgeneralization. What defines a mis-generalization apart from a good generalization is “I don’t like it”.
So far, “I don’t like it” has not been translated into the native language of AI and cognitive systems. Doing so is, in fact, a very large and very hard part of the alignment problem! There are a lot of topics being studied where the object of study is just assumed to be a meaningful category, but where the meaningfulness of that category requires a big chunk of alignment to have already been solved.
(another common error is one which sidesteps issues in symbol grounding, like the question of whether a given sequence of tokens "is false")
Once I started noticing this, I couldn’t stop seeing it. Whether an AI is "a schemer" is another example, but it applies to the gears of a lot of applied research. Many agendas in control, oversight, and probing look like “we want to reduce chain-of-thought obfuscation” or “we want to detect scheming”. These all appear a bit hollow and meaningless to me now. Oh dear!
Discuss