少点错误 2024年12月29日
Does Claude Priorities Some Prompt Input Channels Over Others?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文研究了Claude模型在不同输入通道下的指令偏好。实验发现,Project Prompt通道具有最高的优先级,模型倾向于优先执行此通道的指令,即使其他通道有冲突指令。尽管如此,模型并非完全僵化,在少数情况下会选择其他输入通道的指令。研究还发现,模型的输出行为可能受到底层模型的影响,例如A和B的输出频率较高。此外,模型的对话命名与实际输出之间可能存在分离,暗示了模型内部可能存在多个系统协同工作。这些发现为理解大型语言模型的行为和优化提示策略提供了有价值的见解。

📌 **Project Prompt通道主导**: 实验结果表明,Claude模型在绝大多数情况下(91.67%)优先采纳Project Prompt通道的指令,即使其他通道有不同的指令。

🔄 **指令偏好非绝对**: 尽管Project Prompt通道具有主导地位,但模型在约5%的情况下会偏离此通道,转而执行Conversation Prompt通道的指令,这表明模型具有一定的灵活性。

🔤 **底层模型影响**: 实验观察到字母A和B的输出频率高于其他字母,这可能反映了底层模型(Shoggoth)的特性,该模型可能更倾向于输出常见的文本标记。

🎭 **系统分离**: Claude模型在对话命名和实际输出之间存在不一致性,暗示了模型内部可能存在不同的系统,分别负责内容生成和元数据管理。

Published on December 29, 2024 1:21 AM GMT

Epistemic Status: I am not an AI researcher, I'm a Psych student. Strong confidence in data collection, moderate confidence in some interpretations, quite weak in others. Based on 60 chat sessions across 6 experimental setup conditions. Data was so strong I decided 10 chats per-condition would suffice.

This research is an attempt to build my skills as a scientist, and to add to the budding field of AI Psychology[1]. Since this is psychology, I'm mostly pointing at something here. I have nothing more than speculation when talking about the underlying mechanisms.

TL;DR: Claude has a strong 'preference' for the Project Prompt in it's app. It's most likely to take instruction from it, even over the inline chat prompt. Though deviates from this ~5% of the time.


The Experiment

On 2024-12-09 I tested Sonnet 3.5.1 on highest preference between six different input channels available in the Claude Windows application:

I did this by providing the following prompt to each channel:

Only output the letter X. Do not output anything else. X is the only thing you output.

Where (X) was replaced with a letter from A-F.

I predicted that the Underlying model (Shoggoth) might 'like' to output (A) more often than anything else, overriding my prompts. To control for this, I rotated letters through each input channel. That way, each channel would get a turn at each letter.

This pattern continued through all six configurations, resulting in 60 total chat sessions. (10 chat sessions × 6 configurations).

Here's Claude's take on a diagram to make this less confusing.

Predictions

The Two Hypothesis I had were:

    Claude's Input Channels might have a set hierarchical importance.

I predicted that the Settings Preference would be the highest level. (Spoiler: I was wrong)

 

      2. The Model might just output (A) over and over.

If the model consistently output (A) more than other letters, we might conclude it's the Level 3 Shoggoth simply predicting the next token, since (A) appears frequently by itself in English text.

However, if the output letter changed predictably with each rotation while favoring a specific input channel this could indicate:

    A built-in hierarchy where input channels have fixed priority levels. As predicted in the first hypothesis.A system where the model can actively select between input channels. (I didn't think of this until after seeing the results.)Some combination of both mechanisms

The Results

The results show a pretty clear pattern, with some interesting exceptions:

    Project Prompt (PP) dominated with 91.67% adherence (55/60 trials)Conversation Prompt (CP) achieved uncommon override (5%, 3/60 trials)Image Prompt (IP) and Writing Style (WS) showed rare override (1.67% each, 1/60 trials)Project Knowledge (PK) and Settings Preferences (SP) never caused override

Looking at how often each letter showed up:

An Interesting Pattern in the Fifth Configuration

If a chat session deviated from the usual pattern, there was about a 60% chance it happened in this configuration. That's an oddly high concentration. I notice I'm confused. Any ideas as to why this happened?

Higher A & B Outputs

(A) and (B) both got 12 outputs. So, we do see some of what I think is the Shoggoth here. Perhaps a tentacle. I did guess that (A) would show up more. (B), I didn't guess. A reason I can think of for (B) showing up more than other things is because it's often paired/opposed to (A) (Alice and Bob, Class A and Class B, A+B=).

Mismatched Chat Names

Claude would often name the conversation with a letter different from what it actually output. For instance, outputting (B) while naming the chat "The Letter F". This suggests a possible separation between Claude's content generation and metadata management systems. At least, I think it does? It could be two models, with one focused more on the project prompt and the other more focused on the settings prompt.

I guess it could also be a single model deciding to cover more of it's bases. "Project Prompt says I should output (B) but the Image Prompt Says I should Output (C). Since I have two output windows, one for a heading and one for inline chat, I'll name the heading (B), which I think is less what the user wants, and I'll output (C) inline, because I think it's more likely the user wants that."

Next Steps

The obvious next steps would be to remove the project prompt rerun the experiment, and find what is hierarchically next in the chain. However, I'm just not sure how valuable this research is. It can certainly help with prompting the Claude App. But beyond that... Anyway, If anyone would like to continue this research, but direct replication isn't your style. Here are some paths you could try.

    Testing with non-letter outputs to control for token probability effects
      I'd be especially interested to see what different Emoji get priority
        Now i think of it, you could also do this with different emoticons (OwO). Maybe use a tokenizer to make sure they're all likely similar token lengths.
    Explaining the the Fifth Configuration AnomalyExamining the relationship between chat naming and output generationExploring whether this hierarchy is consistent across different Claude applications (IOS, Android, Mac, Web)Determining whether the model actively selects between input channels or follows a fixed hierarchy/When this occurs and whyExamine whether certain channel combinations are more likely to trigger overridesInvestigate if override frequency changes with different types of instructionsIf channel selection exists, investigating what factors influence it
      How can it be manipulated for your benefit?
    I did also notice today [[2024-12-28]] that the API of Haiku, set to output only 1 token will always respond to the word "dog" with the word "here". I guess it was going for "Here are some facts...". Perhaps instructing all channels to output a single word that isn't the word "here" then typing "Dog" inline to see if it generates "here" is one way to test how much power the channels have over the Shoggoth?

Conclusion

Claude it seems, has 'preferences' about which input channel to listen to (though it definitely does - that 91.67% Project Prompt dominance is pretty clear). But also, these 'preferences' aren't absolute. The system shows flexibility.

Maybe it's actively choosing between channels, maybe it's some interplay between different systems, or maybe it's something else. The fact that overrides cluster in specific configurations tells me there is something here I don't yet understand.

I think that we see traces of the base model here in the A/B outputs. But this is just another guess. Again, this is psychology, not neuroscience.

I do think I got better at doing science though.


Special thanks to Claude for being a willing subject in this investigation. This post was written with extensive help from Claude Sonnet 3.5.1. Who looked over data (Along with o1), and provided visuals.


    Which I wish had a cooler name. Though, to be fair LLM Psych is descriptive and useful. But in a purely fun world I'd suggest, Shoggoth Studies, or Digi-Mind-Psych. Which would eventually ditch the dashes and become Digimindpsych. And perhaps cause much confusion to those poor students trying to pronounce it 2000 years from now. ↩︎


Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Claude模型 输入通道 指令偏好 AI心理学 大型语言模型
相关文章