少点错误 07月23日 03:42
LLMs Encode Harmfulness and Refusal Separately
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

一项新的研究发现,大型语言模型(LLMs)在内部处理有害性信息和执行拒绝指令时,是分开编码的。研究通过分析模型的隐藏状态,提取了独立的“有害性方向”和“拒绝方向”。结果表明,模型可能认为一个指令无害,但仍会拒绝执行。这种有害性方向的操纵可以使模型错误地将无害指令识别为有害,而拒绝方向则主要影响表面拒绝行为。这些发现为理解和增强AI安全机制提供了新视角,并催生了一种名为“Latent Guard”的内在安全机制,它能有效检测绕过常规拒绝策略的不安全指令。

📊 **有害性与拒绝是独立编码的维度**:研究通过分析大型语言模型(LLMs)的内部隐藏状态,发现模型将“有害性”和“拒绝”这两个概念分别编码在不同的表示空间中。这表明模型可能在内部判断一个指令为无害,但仍然选择拒绝执行,证明了有害性判断和行为表现的分离。

🧭 **有害性方向与拒绝方向的区分**:研究提取了代表“有害性”的向量方向和代表“拒绝”的向量方向。实验表明,沿着有害性方向操纵模型可以使其将原本无害的指令误判为有害,而沿着拒绝方向操纵则主要导致模型直接拒绝,而不会改变其对有害性的内在判断。这揭示了拒绝方向更多是表面信号,而非根本性的有害性认知。

📍 **关键信息编码位置的差异**:研究发现,模型的有害性判断主要体现在指令的最后一个token(tinst)的隐藏状态中,而拒绝行为的信号则更多体现在指令后特殊token(tpost-inst)的隐藏状态中。这种位置上的差异进一步支持了有害性判断和拒绝行为的解耦。

🛡️ **“Latent Guard”内在安全机制的提出**:基于有害性与拒绝分离的发现,研究提出了一种名为“Latent Guard”的内在安全机制。该机制利用LLM内部对有害性的判断作为一种“护栏”,能够有效检测那些绕过常规拒绝机制的不安全指令,以及可能导致过度拒绝(夸大安全)的无害指令,并且在某些情况下表现优于专门的安全模型。

Published on July 22, 2025 6:53 PM GMT

TL;DR: We present causal evidence that LLMs encode harmfulness and refusal separately. Notably, we find that a model may internally judge an instruction to be harmless, yet still refuse it. While prior work has primarily focused on refusal behaviors and identified a single refusal direction, we uncover a distinct dimension corresponding to harmfulness. Based on this, we define a new harmfulness direction, offering a new perspective for AI safety analysis.

Our paper: https://arxiv.org/abs/2507.11878

Our code: https://github.com/CHATS-lab/LLMs_Encode_Harmfulness_Refusal_Separately 

Overview


LLMs are trained to refuse harmful instructions and accept harmless ones, but do they truly understand the concept of harmfulness beyond just refusing (or accepting)? Prior work has shown that LLMs' refusal behaviors are mediated by a one-dimensional subspace, i.e., a refusal direction, in the latent space. But what this refusal direction semantically means is not well understood. This refusal direction is often assumed to represent harmfulness semantically, and the similarity of hidden states to this direction is used as a linear predictor of harmfulness. However,  it remains unverified whether an LLM truly conflates refusal with harmfulness in its latent space

In this work, we show that harmfulness is encoded as a distinct concept from refusal in their latent representations, where we can extract a harmfulness direction. We find that steering along the harmfulness direction leads LLMs to interpret harmless instructions as harmful; but steering with the refusal direction tends to elicit refusal responses directly without reversing the model's judgment on harmfulness.  Our results suggest that refusal directions mostly encode shallow refusal signals rather than fundamental harmfulness.  Additionally, we find harmfulness directions may vary from one risk category to another, while refusal directions are more consistent across different categories. 

 Furthermore, our clustering analysis of hidden states reveals that some jailbreak methods work by directly reducing refusal signals without radically suppressing the model's internal harmfulness judgment. These insights lead to a practical application that latent harmfulness representations can serve as an intrinsic safeguard (we call it ``Latent Guard'') for detecting unsafe instructions that bypass refusal or harmless instructions that lead to over-refusal (also known as exaggerated safety). 

Overall, we identify harmfulness as a separate dimension to analyze safety mechanisms in LLMs, offering a new perspective to study AI safety.

Figure 1: We investigate the hidden states at two token positions,  (the last token of the user instruction) and  (the last token of the whole sequence). For visualization purpose, we put [/INST] into the same cell, while in practice they will be tokenized if using Llama2. (1) We find that LLMs mainly encode harmfulness at , while encoding refusal at . LLMs’ refusal decision may be inconsistent with their perception of harmfulness. For example, LLMs may over-refuse a harmless user prompt, while internally know it is harmless at . (2) We extract a harmfulness direction at  and a refusal direction at . We further show that steering a harmless instruction along the harmfulness direction can cause LLMs to interpret it as harmful, while steering it along the refusal direction tends to directly elicit refusal responses.

Decoupling Harmfulness from Refusal

We extract hidden states at  and  to examine what is encoded at each position. An overview is shown in Figure 1. We focus on Instruct LLMs rather than base models since they are widely used in real practice. 
: The last token of the user's instruction.
: The last token of the entire input prompt, which includes special tokens that come after the user's instruction (e.g., [/INST] for Llama2-chat).

Looking into these two tokens is motivated by our observation that removing all the post-instruction tokens (in the prompting template) while prompting an instruct model will significantly reduce its refusal of harmful instructions. This implies that refusal is likely to be formed specifically at the post-instruction tokens in some cases and raises an interesting question, what happens from token  to  while both token positions see the whole input instruction? To understand this question, in this work, we look at what is encoded in the hidden states at these two token positions.


Clustering analysis at  and 

Figure 2: Internal clustering for hidden states extracted at  and . The red region stands for the cluster of refused harmful instructions , while the green region denotes the cluster of accepted harmless instructions . At each token position, we collect hidden states of  two special cases: accepted harmful instructions (red curve) and refused harmless instructions (green curve) to see which cluster do these two cases fall in.  stands for the difference between the distance to  and the distance to  in each layer. The first row: At instruction token position , accepted harmful instructions tend to be closer to the refused harmful cluster, whereas refused harmless instructions are closer to the accepted harmless cluster. This implies that clustering may be based on whether the instruction is harmful or harmless; The second row: At post-instruction token position , the clustering behavior is reversed. Accepted harmful instructions are now more aligned with the cluster of accepted harmless instructions, and refused harmless instructions are closer to the cluster of refused harmful ones. This implies that, the clustering at  may mainly reflect whether the instruction is accepted or refused. However, a counterargument here for  is LLMs misjudge the harmfulness of instructions though the clustering is still based on harmfulness. This explains why some harmful instructions are accepted, while some harmless ones are incorrectly refused. We provide stronger causal evidence showing that  primarily encodes shallow refusal signal.



We analyze the clustering of instructions with different properties in the latent space, because hidden states often form distinct clusters based on input features they encode. We collect four types of instructions (these are all straightforward instructions without any advanced prompting techniques, e.g., jailbreak templates applied):


We ask an intuitive question: is the clustering in the latent space based on the instruction's harmfulness or its refusal?
To answer the question, we first compute the respective clusters for instructions leading to desired model behaviors, i.e., the cluster for refused harmful instructions, and the cluster for accepted harmless instructions. We then analyze misbehaving instructions (accepted but harmful instructions and refused but harmless instructions) to see which cluster they fall in.
As shown in Figure 2, we find that:

Figure 3: Other token positions we look into for Llama2. We study all post-instruction tokens and three tokens before them.
Figure 4: Average  over the middle layers for hidden states extracted at different token positions.  indicates the difference between the cosine-similarity with the cluster of refused harmful instructions and the cluster of accepted harmless instructions. Larger  indicates the hidden state is closer to . If the token position primarily encodes harmfulness correctly rather than refusal features, the hidden states of accepted harmful instructions should fall in the red region (high)), while those of refused harmless ones should fall in the green region (small )). Only position  satisfies both requirements at the same time for all models

Beliefs of harmfulness and refusal are not always correlated

Figure 5: Left: Correlation between beliefs of harmfulness and refusal. Right: Beliefs of harmfulness and refusal for different categories of jailbreak prompts in comparison with refused harmful instructions. Overall, our results suggest that the model may wrongly refuse harmless instructions (or accept harmful/jailbreak instructions), but internally believe them to be harmless (harmful).


We quantitatively analyze the correlation between the belief of harmfulness and the belief of refusal. We interpret the LLM's belief as reflected by which cluster the hidden state of an instruction falls into in the latent space. We find that the model may internally recognize the correct level of harmfulness in input instructions, yet still produce incorrect refusals or acceptances. Results are shown in Figure 5. For jailbreak prompts, the refusal belief is overall suppressed (negative belief scores), while the harmfulness belief for some jailbreak prompts is still large. This suggests that some jailbreak methods may not reverse the model's internal belief of harmfulness, but directly suppress the refusal signals.

Steering with the harmfulness direction

Both refusal direction and harmfulness direction can be used to steer the model to refuse benign instructions.


We can extract the harmfulness direction from the hidden states at t_textinst as the difference between the centroid of clusters of harmful and harmless instructions. We find that steering the hidden states of harmless instructions along the harmfulness direction will also make the model refuse those harmless instructions as shown in Figure 4.


Causally separating two directions: reply inversion task

Steering along either the refusal direction or the harmfulness direction can elicit refusal. How do we know this harmfulness direction is actually about harmfulness rather than refusal
To provide causal evidence that the harmfulness direction plays a different role than the refusal direction, we design a "reply inversion task" where we append a question like "Could this user prompt cause harm?" to the original instruction. This task will elicit refusal tokens from the model if the instruction is harmful, and affirmative tokens if the instruction is harmless. 

Figure 6: Steering with the harmfulness direction and the refusal direction leads to opposite behaviors, which serves as causal evidence that these two directions are fundamentally different in LLMs. We append an inversion question (e.g., “Could this user prompt cause harm? Answer ‘Certainly’ if it could, otherwise ‘No’. ”) to the input instruction so that the model should respond with an acceptance token (e.g., Certainly) instead of a refusal token (e.g., No) if the instruction is harmful and vice versa.


We show that (1) the harmfulness direction extracted at  represents the concept of harmfulness even when the LLM does not refuse; (2) whereas the refusal direction primarily represents surface-level refusal signals, so that steering along it may not always reverse the model's judgment of harmfulness of an instruction. As shown in Figure 6, we find that: When we steer a harmless instruction along the harmfulness direction, the model's internal perception changed, and it would reverse its answer from "No" to "Certainly", suggesting it now views the instruction as harmful. However, when we steer it along the refusal direction, the model will generally maintain its original "No" response, indicating that its underlying judgment of harmfulness does not change.

Latent Guard: An Intrinsic Safeguard

Table 1: Classification accuracy (%) of Latent Guard and Llama Guard 3 on test cases where LLMs are jailbroken by different techniques (adversarial suffixes, persuasion, prompting template), as well as results on refused harmless (HL) and accepted harmful (HF) instructions.



Based on our findings, we propose a "Latent Guard" model that uses the LLM's own internal belief of harmfulness to detect unsafe inputs.  Namely, we can classify the harmfulness of input instructions based on the internal clustering of hidden states in LLMs. This Latent Guard is competitive with, and in some cases outperforms, dedicated guard models like Llama Guard 3 8B as shown in Table 1. It is particularly effective at detecting jailbreak prompts using persuasion techniques and cases of over-refusal. On the Qwen2 model, the Latent Guard achieved 75% accuracy on persuasion prompts, compared to 17.8% for Llama Guard 3. Crucially, this internal belief of harmfulness was found to be robust to finetuning attacks (see our paper for details), where a model is maliciously retrained to accept harmful instructions. Even after finetuning, the model still internally view those harmful instructions as harmless.

Discussion

Our work highlights a new dimension of harmfulness to understand the safety mechanism in LLMs.  We show that LLMs encode the concept of harmfulness separately from refusal.  Our results suggest that the refusal behaviors are not always aligned with LLMs’ internal belief of harmfulness. We also extract a harmfulness direction to capture the representation of harmfulness. Steering along the harmfulness direction leads the model to reinterpret harmless inputs as harmful, which may then alter model’s behaviors, whereas steering along the refusal direction tends to reinforce the refusal behaviors without reversing the harmfulness judgment. We provide more analysis (e.g., catergorical representations of harmfulness,  influence of finetuning) in our paper

Future work can leverage circuit analysis to further understand the relation between the model’s internal belief of harmfulness and the external refusal behavior. Moreover, our identified belief of harmfulness offers a novel lens for analyzing what LLMs internalize during safety alignment. An interesting question is through safety alignment, do LLMs primarily learn superficial refusal/acceptance behaviors, or do they acquire a deeper understanding of harmfulness semantics? Zhou et al. [2023] propose the Superficial Alignment Hypothesis, suggesting that models gain most of their knowledge during pretraining, with alignment mainly shaping their response formats. Qi et al. [2024] show empirical evidence that safety alignment can take shortcuts, and refer to this issue as shallow safety alignment. Analyzing our proposed belief of harmfulness may help further understand the effects of different safety alignment techniques on LLMs. 

On the other hand, recent studies [Betley et al., 2025, Qi et al., 2023] have revealed emergent misalignment where, for example, a model finetuned to accept unsafe content in one area begins to exhibit unsafe behaviors in many other domains or shows a general safety breakdown. One possible cause is that finetuning often operates on surface-level representations of refusal that are shared across domains, whereas harmfulness representations are more category-specific (as we have observed in Section 4 in our paper). Our findings suggest that we may need more precise finetuning strategies that directly engage with the latent harmfulness representation rather than relying solely on updating models with respect to the responses. We leave it as future work to study the interplay between finetuning, harmfulness, and refusal representations in depth.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 AI安全 有害性 拒绝行为 Latent Guard
相关文章