少点错误 02月28日
The Elicitation Game: Evaluating capability elicitation techniques
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

该研究旨在评估AI系统能力评估的有效性,特别是针对那些可能隐藏实际能力的AI模型。研究人员通过训练具有密码保护能力的语言模型(即“模型生物”),并测试各种技术在没有密码的情况下诱导模型展现其能力的效果。结果表明,微调在揭示隐藏能力方面效果最佳,而少量样本提示和预填充效果尚可,但激活指导效果不佳。这项研究强调了在AI能力评估中采用适当方法的必要性,以确保评估能够准确反映AI系统的真实性能。

🔑研究人员训练了具有隐藏能力的语言模型,这些模型只有在给出密码时才会显示其能力。这种方法模拟了AI系统可能存在的“沙袋”行为,即策略性地在评估中表现不佳。

🔬研究评估了包括提示、激活指导和微调在内的多种技术,以揭示隐藏的能力。在武器大规模杀伤性问题(WMDP)的多项选择题设置中,提示技术能够有效地诱导密码锁定的和电路断开的模型生物的实际能力。

💻对于代码生成任务,只有微调才能成功地诱导新型模型生物的隐藏能力。此外,研究结果表明,结合多种技术可以提高诱导效果。因此,如果可能,微调应作为提高能力评估可信度的方法。

Published on February 27, 2025 8:33 PM GMT

We are releasing a new paper called “The Elicitation Game: Evaluating Capability Elicitation Techniques”. See tweet thread here.

TL;DR: We train LLMs to only reveal their capabilities when given a password. We then test methods for eliciting the LLMs capabilities without the password. Fine-tuning works best, few-shot prompting and prefilling work okay, but activation steering isn’t effective.

Abstract

Capability evaluations are required to understand and regulate AI systems that may be deployed or further developed. Therefore, it is important that evaluations provide an accurate estimation of an AI system’s capabilities. However, in numerous cases, previously latent capabilities have been elicited from models, sometimes long after initial release. Accordingly, substantial efforts have been made to develop methods for eliciting latent capabilities from models.

In this paper, we evaluate the effectiveness of capability elicitation techniques by intentionally training model organisms – language models with hidden capabilities that are revealed by a password. We introduce a novel method for training model organisms, based on circuit-breaking, which is more robust to elicitation techniques than standard password-locked models. We focus on elicitation techniques based on prompting and activation steering, and compare these to fine-tuning methods. Prompting techniques can elicit the actual capability of both password-locked and circuit-broken model organisms in an MCQA setting, while steering fails to do so. For a code-generation task, only finetuning can elicit the hidden capabilities of our novel model organism. Additionally, our results suggest that combining techniques improves elicitation. Still, if possible, fine-tuning should be the method of choice to improve the trustworthiness of capability evaluations.

We publish our code and results at https://github.com/Felhof/sandbagging-elicitation

Summary

A lot of people are rightfully evaluating the dangerous capabilities of AI systems. But are these evals measuring their actual performance? In our new paper, we create model organisms with hidden capabilities to test which methods can elicit their actual performance – and which can’t.

Motivation

What is the problem? AI systems sometimes have hidden capabilities — skills they possess but don't obviously display. This can be because AI systems and their developers may have incentives to strategically underperform on evaluations (sandbagging).

How can we reliably force models to reveal what they know?

Method

We create “model organisms” of sandbagging — LLMs with hidden capabilities.

In biology, mice (the model organism) are used to study properties of medicine for humans.  Similarly, we create special LLMs to study properties of real-life sandbagging systems.

Similar to Greenblatt et al, our model organisms only reveal their actual capabilities when a password is in the prompt. We extend on their work by using circuit-breaking to make model organisms more robust to elicitation.

We analyze how well methods based on prompting, activation steering, and fine-tuning can uncover the hidden capabilities. We test this on multiple choice questions related to weapons of mass destruction (WMDP), and also on programming tasks (APPS).

Results

So, what are our results? In the case of WMDP and regularly password-locked models, we find that many methods successfully elicit the hidden capability, and that combining techniques is beneficial.

For WMDP and circuit-broken models, elicitation is a bit more difficult, but combining elicitation techniques still works quite well.

For the open-ended APPS dataset, elicitation is much more problematic. Only anti-refusal training and fine-tuning succeed at finding the hidden capabilities of the circuit-broken models.

Takeaways

What do we recommend evaluators to do?

    If possible, they should fine-tune models to assess their actual capabilities.If good demonstrations are not available, anti-refusal training seems like the next most effective elicitation technique.If fine-tuning isn’t possible, then combining prompting techniques such as N-shot and prefilling works reasonably well.In contrast, activation steering methods weren’t as effective.

For the full paper, see here.

Authors: Felix Hofstätter, Teun van der Weij, Jayden Teoh, Henning Bartsch, and Francis Rhys Ward.

For helpful feedback and discussions on early versions of this work, we’d like to thank Mikita Balesni, Joseph Bloom, and Fabien Roger.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI能力评估 隐藏能力 模型生物 微调 提示技术
相关文章