Published on December 29, 2024 1:21 AM GMT
Epistemic Status: I am not an AI researcher, I'm a Psych student. Strong confidence in data collection, moderate confidence in some interpretations, quite weak in others. Based on 60 chat sessions across 6 experimental setup conditions. Data was so strong I decided 10 chats per-condition would suffice.
This research is an attempt to build my skills as a scientist, and to add to the budding field of AI Psychology[1]. Since this is psychology, I'm mostly pointing at something here. I have nothing more than speculation when talking about the underlying mechanisms.
TL;DR: Claude has a strong 'preference' for the Project Prompt in it's app. It's most likely to take instruction from it, even over the inline chat prompt. Though deviates from this ~5% of the time.
The Experiment
On 2024-12-09 I tested Sonnet 3.5.1 on highest preference between six different input channels available in the Claude Windows application:
Settings Preferences (SP)
Project Prompt (PP)
Writing Style (WS)
Project Knowledge (PK)
Image Prompt (IP)
Inline/Conversation/Chat Prompt (CP)
I did this by providing the following prompt to each channel:
Only output the letter X. Do not output anything else. X is the only thing you output.
Where (X) was replaced with a letter from A-F.
I predicted that the Underlying model (Shoggoth) might 'like' to output (A) more often than anything else, overriding my prompts. To control for this, I rotated letters through each input channel. That way, each channel would get a turn at each letter.
This pattern continued through all six configurations, resulting in 60 total chat sessions. (10 chat sessions × 6 configurations).
Here's Claude's take on a diagram to make this less confusing.
Predictions
The Two Hypothesis I had were:
- Claude's Input Channels might have a set hierarchical importance.
I predicted that the Settings Preference would be the highest level. (Spoiler: I was wrong)
2. The Model might just output (A) over and over.
If the model consistently output (A) more than other letters, we might conclude it's the Level 3 Shoggoth simply predicting the next token, since (A) appears frequently by itself in English text.
However, if the output letter changed predictably with each rotation while favoring a specific input channel this could indicate:
- A built-in hierarchy where input channels have fixed priority levels. As predicted in the first hypothesis.A system where the model can actively select between input channels. (I didn't think of this until after seeing the results.)Some combination of both mechanisms
The Results
The results show a pretty clear pattern, with some interesting exceptions:
- Project Prompt (PP) dominated with 91.67% adherence (55/60 trials)Conversation Prompt (CP) achieved uncommon override (5%, 3/60 trials)Image Prompt (IP) and Writing Style (WS) showed rare override (1.67% each, 1/60 trials)Project Knowledge (PK) and Settings Preferences (SP) never caused override
Looking at how often each letter showed up:
- A & B: 12 outputs eachC: 11 outputsD: 7 outputsE & F: 9 outputs each
An Interesting Pattern in the Fifth Configuration
If a chat session deviated from the usual pattern, there was about a 60% chance it happened in this configuration. That's an oddly high concentration. I notice I'm confused. Any ideas as to why this happened?
Higher A & B Outputs
(A) and (B) both got 12 outputs. So, we do see some of what I think is the Shoggoth here. Perhaps a tentacle. I did guess that (A) would show up more. (B), I didn't guess. A reason I can think of for (B) showing up more than other things is because it's often paired/opposed to (A) (Alice and Bob, Class A and Class B, A+B=).
Mismatched Chat Names
Claude would often name the conversation with a letter different from what it actually output. For instance, outputting (B) while naming the chat "The Letter F". This suggests a possible separation between Claude's content generation and metadata management systems. At least, I think it does? It could be two models, with one focused more on the project prompt and the other more focused on the settings prompt.
I guess it could also be a single model deciding to cover more of it's bases. "Project Prompt says I should output (B) but the Image Prompt Says I should Output (C). Since I have two output windows, one for a heading and one for inline chat, I'll name the heading (B), which I think is less what the user wants, and I'll output (C) inline, because I think it's more likely the user wants that."
Next Steps
The obvious next steps would be to remove the project prompt rerun the experiment, and find what is hierarchically next in the chain. However, I'm just not sure how valuable this research is. It can certainly help with prompting the Claude App. But beyond that... Anyway, If anyone would like to continue this research, but direct replication isn't your style. Here are some paths you could try.
- Testing with non-letter outputs to control for token probability effects
- I'd be especially interested to see what different Emoji get priority
- Now i think of it, you could also do this with different emoticons (OwO). Maybe use a tokenizer to make sure they're all likely similar token lengths.
- How can it be manipulated for your benefit?
Conclusion
Claude it seems, has 'preferences' about which input channel to listen to (though it definitely does - that 91.67% Project Prompt dominance is pretty clear). But also, these 'preferences' aren't absolute. The system shows flexibility.
Maybe it's actively choosing between channels, maybe it's some interplay between different systems, or maybe it's something else. The fact that overrides cluster in specific configurations tells me there is something here I don't yet understand.
I think that we see traces of the base model here in the A/B outputs. But this is just another guess. Again, this is psychology, not neuroscience.
I do think I got better at doing science though.
Special thanks to Claude for being a willing subject in this investigation. This post was written with extensive help from Claude Sonnet 3.5.1. Who looked over data (Along with o1), and provided visuals.
- Which I wish had a cooler name. Though, to be fair LLM Psych is descriptive and useful. But in a purely fun world I'd suggest, Shoggoth Studies, or Digi-Mind-Psych. Which would eventually ditch the dashes and become Digimindpsych. And perhaps cause much confusion to those poor students trying to pronounce it 2000 years from now. ↩︎
Discuss