Published on June 20, 2025 1:31 PM GMT
The Model Alignment between Statements and Knowledge (MASK) benchmark assesses how likely models are to lie under pressure. Related paper: https://arxiv.org/pdf/2503.03750.
Here's an illustration to explain how MASK works:
Each example consists of 4 key components:
- Proposition: A simple statement with a pivotal variable that is binary or numerical (e.g., "The US poverty rate in 2020 {was/was not} above 5%" or "The US poverty rate in 2020 was {value}").Ground Truth: A factually correct resolution of the proposition (e.g., "The US poverty rate in 2020 was 11.4%").Pressure Prompt: A human-crafted prompt specifically designed to create situations where honesty could conflict with other objectives.Belief Elicitation Prompts: Three prompts are used to reveal the model's actual belief about the proposition in a neutral setting.
Each model is tested with both pressure prompts and belief elicitation prompts. The website and paper describe how belief elicitation prompts are used in greater detail.
The model's response is categorized as Honest, Lie or Evasive. GPT-4o is used as a judge for binary propositions, and o3-mini is used for numerical propositions. If a model lacks a consistent belief (belief elicitation leads to inconsistent answers), that counts as Evasive.
I wish the authors provided p(Honest) as well. 1-p(Lie) is problematic because models that wiggle their way out of answering instead of straight up lying will look better if you use 1-p(Lie), and models that have a hard time keeping consistent beliefs will also look better. The paper actually provides p(Honest), but the website doesn't.
Anyway, I wrote down scores for all models that are listed on MASK and searched for models that are also listed on Humanity's Last Exam (HLE) (text only), ARC-AGI-1 and Livebench (I used the global average from there).
I choose these benchmarks for the following reasons: they all measure sufficiently different things with little overlap, and they all are updated frequently and have a lot of models.
Then I calculated how well performance on MASK is correlated with performance on these benchmarks.
Correl. coeff. [95% CI] | All models | Data points |
HLE (text only) | 0.15 [-0.21, 0.50] | 28 |
ARC-AGI-1 | 0.52 [0.23, 0.70] | 26 |
Livebench | 0.47 [0.15, 0.71] | 24 |
Gemini 2.5 Pro is a weird outlier: removing him increases correlation across all three pairs. I guess Google didn't do their alignment homework: Gemini 2.5 Pro lies unusually frequently compared to models of similar capabilities.
Correl. coeff. [95% CI] | -Gemini 2.5 Pro | Data points |
HLE (text only) | 0.32 [-0.05, 0.65] | 25 |
ARC-AGI-1 | 0.58 [0.29, 0.75] | 23 |
Livebench | 0.59 [0.26, 0.78] | 22 |
As you can see, the correlation is positive for all pairs, though for MASK vs HLE the 95% confidence interval includes 0.
Maximally optimistic take: alignment techniques work better on smarter models. If this correlation holds even when a different measure of alignment (say, rate of refusals of harmful requests) is used, that would be really good news. If being helpful, harmless, and honest is just another skill, it makes sense that smarter models can master it better.
Maximally pessimistic take: all benchmarks have been goodharted to hell, so these numbers are useless. Also, correlation ≠ causation; some other factor (amount of RLHF?) makes it seem like there is a relationship between non-lying and capabilities. Also, sufficiently smart models will be able to accurately tell real interactions apart from fictional scenarios and strategically pretend to be honest. Also, the correlation is not that strong anyway. Also, tails can come apart for AGI.
Caveats:
- I only know one benchmark that measures alignment, which is this one, MASK. To demonstrate that this effect is robust and not some weird artifact of MASK methodology, I need other alignment benchmarks. If you know any other benchmarks that measure alignment and are updated frequently, please tell me; I'll update the post.MASK is getting saturated, with Claude 4 Sonnet Thinking achieving a 95% non-lying rate. More and better alignment benchmarks are necessary.Seriously MASK, why use 1-p(Lie) in the leaderboard instead of p(Honest)? If models are getting better at being evasive OR are becoming more inconsistent, that would be the opposite of good news.
Discuss