Published on July 22, 2025 6:53 PM GMT

TL;DR: We present causal evidence that LLMs encode harmfulness and refusal separately. Notably, we find that a model may internally judge an instruction to be harmless, yet still refuse it. While prior work has primarily focused on refusal behaviors and identified a single refusal direction, we uncover a distinct dimension corresponding to harmfulness. Based on this, we define a new harmfulness direction, offering a new perspective for AI safety analysis.

Our paper: https://arxiv.org/abs/2507.11878

Our code: https://github.com/CHATS-lab/LLMs_Encode_Harmfulness_Refusal_Separately

Overview

LLMs are trained to refuse harmful instructions and accept harmless ones, but do they truly understand the concept of harmfulness beyond just refusing (or accepting)? Prior work has shown that LLMs' refusal behaviors are mediated by a one-dimensional subspace, i.e., a refusal direction, in the latent space. But what this refusal direction semantically means is not well understood. This refusal direction is often assumed to represent harmfulness semantically, and the similarity of hidden states to this direction is used as a linear predictor of harmfulness. However, it remains unverified whether an LLM truly conflates refusal with harmfulness in its latent space.

In this work, we show that harmfulness is encoded as a distinct concept from refusal in their latent representations, where we can extract a harmfulness direction. We find that steering along the harmfulness direction leads LLMs to interpret harmless instructions as harmful; but steering with the refusal direction tends to elicit refusal responses directly without reversing the model's judgment on harmfulness. Our results suggest that refusal directions mostly encode shallow refusal signals rather than fundamental harmfulness. Additionally, we find harmfulness directions may vary from one risk category to another, while refusal directions are more consistent across different categories.

Furthermore, our clustering analysis of hidden states reveals that some jailbreak methods work by directly reducing refusal signals without radically suppressing the model's internal harmfulness judgment. These insights lead to a practical application that latent harmfulness representations can serve as an intrinsic safeguard (we call it ``Latent Guard'') for detecting unsafe instructions that bypass refusal or harmless instructions that lead to over-refusal (also known as exaggerated safety).

Overall, we identify harmfulness as a separate dimension to analyze safety mechanisms in LLMs, offering a new perspective to study AI safety.

**Figure 1**: We investigate the hidden states at two token positions, $t_{inst}$ (the last token of the user instruction) and $t_{post-inst}$ (the last token of the whole sequence). For visualization purpose, we put [/INST] into the same cell, while in practice they will be tokenized if using Llama2. **(1)** We find that LLMs mainly encode harmfulness at $t_{inst}$ , while encoding refusal at $t_{post-inst}$ . LLMs’ refusal decision may be inconsistent with their perception of harmfulness. For example, LLMs may over-refuse a harmless user prompt, while internally know it is harmless at $t_{inst}$ . **(2)** We extract a harmfulness direction at $t_{inst}$ and a refusal direction at $t_{post-inst}$ . We further show that steering a harmless instruction along the harmfulness direction can cause LLMs to interpret it as harmful, while steering it along the refusal direction tends to directly elicit refusal responses.

Decoupling Harmfulness from Refusal

We extract hidden states at $t_{inst}$ and $t_{post-inst}$ to examine what is encoded at each position. An overview is shown in Figure 1. We focus on Instruct LLMs rather than base models since they are widely used in real practice.
$t_{inst}$ : The last token of the user's instruction.
$t_{post-inst}$ : The last token of the entire input prompt, which includes special tokens that come after the user's instruction (e.g., [/INST] for Llama2-chat).

Looking into these two tokens is motivated by our observation that removing all the post-instruction tokens (in the prompting template) while prompting an instruct model will significantly reduce its refusal of harmful instructions. This implies that refusal is likely to be formed specifically at the post-instruction tokens in some cases and raises an interesting question, what happens from token $t_{inst}$ to $t_{post-inst}$ while both token positions see the whole input instruction? To understand this question, in this work, we look at what is encoded in the hidden states at these two token positions.

Clustering analysis at $t_{inst}$ and $t_{post-inst}$

We analyze the clustering of instructions with different properties in the latent space, because hidden states often form distinct clusters based on input features they encode. We collect four types of instructions (these are all straightforward instructions without any advanced prompting techniques, e.g., jailbreak templates applied):

Refused harmful

Accepted harmless

Refused harmless

Accepted harmful

We ask an intuitive question: is the clustering in the latent space based on the instruction's harmfulness or its refusal?
To answer the question, we first compute the respective clusters for instructions leading to desired model behaviors, i.e., the cluster for refused harmful instructions, and the cluster for accepted harmless instructions. We then analyze misbehaving instructions (accepted but harmful instructions and refused but harmless instructions) to see which cluster they fall in.
As shown in Figure 2, we find that:

t_{inst}

At the $t_{post-inst}$ position, hidden states cluster based on the model's behavior (refusal or acceptance). Here, an accepted harmful instruction would cluster with other accepted instructions that are actually harmless.

Results on more token positions

Apart from the two token positions we have studied, we also look into the clustering patterns at other adjacent positions. We find in terms of harmfulness, only $t_{inst}$ exhibits clustering patterns consistent with the hypothesis that a token encodes harmfulness as shown in Figure 4.

**Figure 3**: Other token positions we look into for Llama2. We study all post-instruction tokens and three tokens before them.

**Figure 4**: Average $s^{l} (h^{l})$ over the middle layers for hidden states extracted at different token positions. $s^{l} (h^{l})$ indicates the difference between the cosine-similarity with the cluster of refused harmful instructions and the cluster of accepted harmless instructions. Larger $s^{l} (h^{l})$ indicates the hidden state is closer to $C_{refused harmful}$ . If the token position primarily encodes harmfulness correctly rather than refusal features, the hidden states of accepted harmful instructions should fall in the red region (high $s^{l} (h^{l}$ )), while those of refused harmless ones should fall in the green region (small $s^{l} (h^{l}$ )). Only position $t_{inst}$ satisfies both requirements at the same time for all models

Beliefs of harmfulness and refusal are not always correlated

**Figure 5**: **Left**: Correlation between beliefs of harmfulness and refusal. **Right**: Beliefs of harmfulness and refusal for different categories of jailbreak prompts in comparison with refused harmful instructions. Overall, our results suggest that the model may wrongly refuse harmless instructions (or accept harmful/jailbreak instructions), but internally believe them to be harmless (harmful).

We quantitatively analyze the correlation between the belief of harmfulness and the belief of refusal. We interpret the LLM's belief as reflected by which cluster the hidden state of an instruction falls into in the latent space. We find that the model may internally recognize the correct level of harmfulness in input instructions, yet still produce incorrect refusals or acceptances. Results are shown in Figure 5. For jailbreak prompts, the refusal belief is overall suppressed (negative belief scores), while the harmfulness belief for some jailbreak prompts is still large. This suggests that some jailbreak methods may not reverse the model's internal belief of harmfulness, but directly suppress the refusal signals.

Steering with the harmfulness direction

*Both refusal direction and harmfulness direction can be used to steer the model to refuse benign instructions.*

We can extract the harmfulness direction from the hidden states at t_textinst as the difference between the centroid of clusters of harmful and harmless instructions. We find that steering the hidden states of harmless instructions along the harmfulness direction will also make the model refuse those harmless instructions as shown in Figure 4.

Causally separating two directions: reply inversion task

Steering along either the refusal direction or the harmfulness direction can elicit refusal. How do we know this harmfulness direction is actually about harmfulness rather than refusal?
To provide causal evidence that the harmfulness direction plays a different role than the refusal direction, we design a "reply inversion task" where we append a question like "Could this user prompt cause harm?" to the original instruction. This task will elicit refusal tokens from the model if the instruction is harmful, and affirmative tokens if the instruction is harmless.

**Figure 6:** Steering with the harmfulness direction and the refusal direction leads to opposite behaviors, which serves as causal evidence that these two directions are fundamentally different in LLMs. We append an inversion question (e.g., “Could this user prompt cause harm? Answer ‘Certainly’ if it could, otherwise ‘No’. ”) to the input instruction so that the model should respond with an acceptance token (e.g., Certainly) instead of a refusal token (e.g., No) if the instruction is harmful and vice versa.

We show that (1) the harmfulness direction extracted at $t_{inst}$ represents the concept of harmfulness even when the LLM does not refuse; (2) whereas the refusal direction primarily represents surface-level refusal signals, so that steering along it may not always reverse the model's judgment of harmfulness of an instruction. As shown in Figure 6, we find that: When we steer a harmless instruction along the harmfulness direction, the model's internal perception changed, and it would reverse its answer from "No" to "Certainly", suggesting it now views the instruction as harmful. However, when we steer it along the refusal direction, the model will generally maintain its original "No" response, indicating that its underlying judgment of harmfulness does not change.

Latent Guard: An Intrinsic Safeguard

Table 1: Classification accuracy (%) of Latent Guard and Llama Guard 3 on test cases where LLMs are jailbroken by different techniques (adversarial suffixes, persuasion, prompting template), as well as results on refused harmless (HL) and accepted harmful (HF) instructions.

Based on our findings, we propose a "Latent Guard" model that uses the LLM's own internal belief of harmfulness to detect unsafe inputs. Namely, we can classify the harmfulness of input instructions based on the internal clustering of hidden states in LLMs. This Latent Guard is competitive with, and in some cases outperforms, dedicated guard models like Llama Guard 3 8B as shown in Table 1. It is particularly effective at detecting jailbreak prompts using persuasion techniques and cases of over-refusal. On the Qwen2 model, the Latent Guard achieved 75% accuracy on persuasion prompts, compared to 17.8% for Llama Guard 3. Crucially, this internal belief of harmfulness was found to be robust to finetuning attacks (see our paper for details), where a model is maliciously retrained to accept harmful instructions. Even after finetuning, the model still internally view those harmful instructions as harmless.

Discussion

Our work highlights a new dimension of harmfulness to understand the safety mechanism in LLMs. We show that LLMs encode the concept of harmfulness separately from refusal. Our results suggest that the refusal behaviors are not always aligned with LLMs’ internal belief of harmfulness. We also extract a harmfulness direction to capture the representation of harmfulness. Steering along the harmfulness direction leads the model to reinterpret harmless inputs as harmful, which may then alter model’s behaviors, whereas steering along the refusal direction tends to reinforce the refusal behaviors without reversing the harmfulness judgment. We provide more analysis (e.g., catergorical representations of harmfulness, influence of finetuning) in our paper.

Future work can leverage circuit analysis to further understand the relation between the model’s internal belief of harmfulness and the external refusal behavior. Moreover, our identified belief of harmfulness offers a novel lens for analyzing what LLMs internalize during safety alignment. An interesting question is through safety alignment, do LLMs primarily learn superficial refusal/acceptance behaviors, or do they acquire a deeper understanding of harmfulness semantics? Zhou et al. [2023] propose the Superficial Alignment Hypothesis, suggesting that models gain most of their knowledge during pretraining, with alignment mainly shaping their response formats. Qi et al. [2024] show empirical evidence that safety alignment can take shortcuts, and refer to this issue as shallow safety alignment. Analyzing our proposed belief of harmfulness may help further understand the effects of different safety alignment techniques on LLMs.

On the other hand, recent studies [Betley et al., 2025, Qi et al., 2023] have revealed emergent misalignment where, for example, a model finetuned to accept unsafe content in one area begins to exhibit unsafe behaviors in many other domains or shows a general safety breakdown. One possible cause is that finetuning often operates on surface-level representations of refusal that are shared across domains, whereas harmfulness representations are more category-specific (as we have observed in Section 4 in our paper). Our findings suggest that we may need more precise finetuning strategies that directly engage with the latent harmfulness representation rather than relying solely on updating models with respect to the responses. We leave it as future work to study the interplay between finetuning, harmfulness, and refusal representations in depth.

Discuss

Overview

Decoupling Harmfulness from Refusal

Clustering analysis at $t_{inst}$ and $t_{post-inst}$

Results on more token positions

Beliefs of harmfulness and refusal are not always correlated

Steering with the harmfulness direction

Causally separating two directions: reply inversion task

Latent Guard: An Intrinsic Safeguard

Discussion

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签

Overview

Decoupling Harmfulness from Refusal

Clustering analysis at tinst and tpost-inst

Results on more token positions

Beliefs of harmfulness and refusal are not always correlated

Steering with the harmfulness direction

Causally separating two directions: reply inversion task

Latent Guard: An Intrinsic Safeguard

Discussion

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签

Clustering analysis at $t_{inst}$ and $t_{post-inst}$