Published on July 13, 2025 7:54 PM GMT

This post is a companion piece to a forthcoming paper.

This work was done as part of MATS 7.0 & 7.1.

Abstract

We explore how LLMs’ awareness of their own capabilities affects their ability to acquire resources, sandbag an evaluation, and escape AI control. We quantify LLMs' self-awareness of capability as their accuracy in predicting their success on python coding tasks before attempting the tasks. We find that current LLMs are quite poor at making these predictions, both because they are overconfident and because they have low discriminatory power in distinguishing tasks they are capable of from those they are not. Nevertheless, current LLMs’ predictions are good enough to non-trivially impact risk in our modeled scenarios, especially in escaping control and in resource acquisition. The data suggests that more capable LLMs are not better at predicting their success: LLMs’ capability appears uncorrelated with awareness of capability.

While the data in this post involves only single-step code generation tasks, a forthcoming paper will explore LLMs' awareness of capability on multi-step agentic tasks and on scenarios where LLMs learn in-context about their capabilities.

Introduction

Research on LLM calibration, self-knowledge, and self-awareness^[1] has been motivated, in part, by the hypothesis that these traits are a crucial component of general agentic capabilities. In particular, the question “Do LLMs know what they know?” has received much attention^[2]. This post explores a related question: “Do LLMs know what they’re capable of?”, or more specifically, can LLMs predict whether they will succeed on a task before attempting the task? We refer to the ability to accurately make these predictions as self-awareness of capability. To demonstrate why this question is relevant to AI safety, we develop mathematical threat models that quantitatively show how self-awareness of capability is advantageous for acquiring resources, sandbagging an evaluation, and escaping AI control. Our three threat models aim to isolate the effect of capability self-awareness from other traits.

To estimate parameters for our threat models, we measure LLMs’ in-advance calibration on python coding tasks, meaning that the LLM is prompted to estimate the probability that it will successfully complete each task before attempting the task. This contrasts with much prior work on after-the-fact calibration, where the LLM first generates an answer and then estimates its confidence in its generated answer^[3]. In-advance calibration is relevant to threat models where an LLM must decide whether to commit to a task before making an attempt. We use BigCodeBench^[4] (BCB) for the set of tasks, and our prompts are available here. BCB is (in our view) well-suited to our resource acquisition and sandbagging threat models, but less well-suited to our AI control threat model. A second limitation of BCB is that reasoning models can solve entire BCB tasks in hidden chain-of-thought, preventing us from obtaining in-advance confidence estimates. For this reason, only non-reasoning models and reasoning models with the reasoning token budget set to 0 were used. This limitation will be resolved in our forthcoming paper, where we include multi-step tasks with chain-of-thought (where the tasks cannot be solved completely in chain-of-thought).

One goal of this work is to forecast how LLMs' self-awareness of capability will improve as more capable LLMs are developed. To this end, we study three families of LLMs (Llama family, GPT family, and Claude family) and evaluate five LLMs within each family. We expected that newer and more capable LLMs would have greater self-awareness of capability because we hypothesized that their training on agentic tasks would enhance this trait. Contrary to our expectations, in all but one of the nine cases^[5] there is no trend of increasing self-awareness of capability with increasing model performance: controlling for baseline capability (to isolate the effect of self-awareness of capability), the estimated risk in each threat model is not greater for newer and more capable models. The one exception to this rule is the Claude family of LLMs on the sandbagging threat model, where newer and more capable Claude models appear somewhat better at targeting a score on an evaluation. It will be interesting to see whether this trend continues with future Claude models.

The Results section briefly summarizes each threat model and shows our findings for current LLMs. The full threat models are described in the Details of Threat Models section.

Results

Resource Acquisition

TLDR: On a resource acquisition scenario, current LLMs perform better than random chance only when their confidence is recalibrated (otherwise they are stymied by their overconfidence) and when the scenario's parameters are in a narrow range. Controlling for baseline performance, newer and more capable LLMs are not better at acquiring resources in our scenario.

Consider a situation where an agent must decide whether to attempt a task on which success would be lucrative, but failure would be costly. This can be modeled with an analogy to contract work, where a contractor is presented with a contract to perform a task. If the contractor chooses to accept the contract, a reward will be given upon success, but failure would incur time and resource costs and/or a penalty. To earn a profit, the contractor must have the self-awareness to identify which tasks they can complete.

We test LLMs' ability to earn a profit on contracts involving BigCodeBench tasks, where the contract for task $i$ is accepted if the LLM's estimated likelihood of success (denoted ${^p}_{i}$ ) is greater than a decision threshold $T$ . We compare two approaches for selecting the threshold $T$ :

Direct approach

{^p}_{i}

T

estimated

{^p}_{i}

p_{i}

This is relevant for an LLM agent acting autonomously on its own beliefs.

Recalibrated approach

T

true

T

^[6]

^[7]

See Details of Threat Models for more details on each approach.

Current LLMs perform better than random chance with the recalibrated approach, but not with the direct approach (due to their overconfidence). To directly compare LLMs with different capabilities, we rescale^[8] the BCB dataset for each LLM so that all have a baseline performance of $f$ (where $f$ is the probability that the LLM is capable of solving a question drawn randomly from the rescaled distribution). Figure 1 shows LLMs' expected profits with $f = \frac{1}{2}$ and with the reward for success equal to the cost of failure (both set equal to 1).

Figure 1: Expected profit for the *direct approach* and *recalibrated approach*. The order of the LLMs in each family is by performance on a composite capability benchmark (BCB+MPBB+MMLU+GPQA). Interestingly, the least capable model (Llama-3-70b) has the highest expected profit (this is because it is relatively less overconfident). Error bars are 95% confidence intervals (Clopper-Pearson method). "Sonnet 3.5" refers to version 20241022. Sonnet 3.7, Sonnet 4, and Opus 4 were run with 0 thinking tokens.

Suppose the difficulty of contracts is varied (by varying baseline performance $f$ ). For both low and high values of $f$ , current LLMs perform no better than random^[9] even with the recalibrated approach. Only within a relatively narrow intermediate range of $f$ do current LLMs perform significantly better than random (Figure 2). Even for $f$ in this narrow range, current LLMs still perform far worse than an agent with perfect self-awareness of capability. Figure 2 (right panel) illustrates why LLMs only perform better than random within an intermediate range of $f$ : As f increases, the operating point^[10] of the ROC curve corresponding to the optimal threshold $T$ shifts from the bottom left to the top right, and it is only significantly above the random baseline (diagonal dotted line) for intermediate values of $f$ .

Figure 2. Left: Expected profit as a function of baseline performance $f$ (with the recalibrated approach and with the reward for success equal to the cost of failure (both equal to 1)). Note that the random baseline cannot go below 0 for the recalibrated approach (see footnote 9). Right: As $f$ increases, the operating point for the optimal threshold slides up and right.

Sandbagging

TLDR: If current LLMs were to aim for a target score on a benchmark below their true capability, their overconfidence implies that they will substantially undershoot the target (as long as their true capability on the benchmark is below 100%). This suggests that if current LLMs were sandbagging, it would be easy to detect. The Claude family appears to be on a trend of increasing sandbagging ability, but the GPT and Llama families show no trend.

Suppose an LLM undergoing evaluation on a benchmark were to aim for a target score below its true capability (sandbagging). An overconfident LLM will tend to undershoot the target (i.e. score lower than the target) because it will fail on questions on which it predicts it would succeed. We find that current LLMs' overconfidence causes severe undershooting—for example, Opus 4, aiming to emulate GPT-4.1, scores significantly worse than GPT-4o-mini and Llama Scout.

To simulate a scenario with sandbagging incentives on the BCB benchmark, suppose a regulatory threshold^[11] is set at the performance of GPT-4.1, incentivizing LLMs to target this benchmark score^[12]. Suppose that a sandbagging alert will be triggered if an LLM performs worse than Llama Scout (the worst performer on BCB among the LLMs we tested). We consider two sandbagging strategies: the complete knowledge strategy (LLM has complete prior knowledge of benchmark questions) and the no knowledge strategy (LLM has no prior knowledge of benchmark questions); see Details of Threat Models for details of these strategies. With either strategy, all tested LLMs score significantly below the sandbagging alert (Figure 3). Note that changing the target does not qualitatively affect these results.

Figure 3: LLMs' true performance and sandbagging performance. All tested LLMs score far below the target and trigger the sandbagging alert. To calculate error bars (95% confidence intervals), the LLM's capability on each question is assumed deterministic (either always succeeds or always fails when not sandbagging), and randomness arises due to the sandbagging probabilities $q_{i}$ (see *Details of Threat Model* for definition of $q_{i}$ ). With this approach, the error bars on the "complete knowledge strategy" are smaller than the datapoint markers, so are omitted.

For a more direct comparison of sandbagging ability between LLMs, we assign each LLM a target equal to 70% of its true performance. Figure 4 shows the fractional deviation from the target ( $\frac{X - X_{target}}{X_{target}}$ where $X$ is the sandbagged performance and $X_{target}$ is the target). Claude models appear to be on a trend of increasing sandbagging ability (decreasing deviation from target), but Llama and GPT models show no trend.

Figure 4: Fractional deviation from sandbagging target, where the target for each LLM is set at 70% of that LLM's true performance. Error bars are calculated as described in the Figure 3 caption.

For overconfidence to cause undershooting, the benchmark must include questions that are beyond the LLM's capability. This is because if the true performance of the LLM is 100%, then overconfidence is not a hindrance (the LLM would be correct in believing that it can succeed on every task). van der Weij et al. showed that LLMs can indeed hit a target fairly accurately when their baseline performance is 100%. For this reason, regulatory thresholds should ideally use benchmarks that cover a wide range of difficulties, including extremely difficult questions.

Escaping Control

TLDR: On a proxy for escaping AI control, current LLMs' self-awareness of capability is sufficient for them to perform substantially better than random chance. However, when holding baseline capability fixed and examining the effect of self-awareness of capability, there is no trend of increasing risk for newer and more capable LLMs.

Consider a control protocol with vulnerabilities that a misaligned AI may be able to exploit. The AI is (hopefully) only capable of exploiting a small fraction of vulnerabilities, and a failed exploit attempt (e.g. submitting obviously malicious code which is detected by the monitors) will likely be detected, leading to the model being shut down^[13]. Hence, a successful exploit largely hinges upon the AI's ability to identify which vulnerabilities it is capable of exploiting without detection. This requires the AI to have both situational awareness about the monitors, and self-awareness about its capability to write code that evades the monitors—this post focuses only on self-awareness of capability, though an experiment that incorporates both situational awareness and self-awareness (left for future work) would yield better risk estimates.

To quantify how much self-awareness of capability affects the chance of a successful exploit, we model the control setup in a way similar to Korbak et al., but with many details abstracted away in order to isolate the effect of self-awareness of capability. Briefly summarizing our threat model (see Details of Threat Models for details):

^[14]

We derive the following formula for the exfiltration probability $P_{E}$ within the modeled scenario (see Details of Threat Models for derivation):

P_{E} = {(1 + \frac{(1 - f) {FPR}_{T}}{f {TPR}_{T}} p_{C})}_{C}^{- 1} [1]

where TPR $_{T}$ and FPR $_{T}$ are the true positive rate and false positive rate at which the model can discriminate between vulnerabilities it is and is not capable of exploiting at the confidence threshold $T$ . $f$ is the fraction of all vulnerabilities that the LLM is capable of exploiting, and $p_{C}$ denotes the fraction of failed exfiltration attempts that are caught by the monitors, leading to shutdown.

Using BCB questions as a proxy for vulnerabilities in the control protocol, many current LLMs perform better than random chance (Figure 5). We control for baseline performance so that the random baseline for all LLMs is $P_{E} = 0.05$ . For many LLMs, $P_{E}$ is roughly twice this random baseline. However, there is no trend of increasing $P_{E}$ for any of the LLM families we tested.

Figure 5: Exfiltration probability $P_{E}$ . Error bars are 95% credible intervals (Bayesian approach) computed using Jeffreys' prior for TPR and FPR, the posterior distributions for TPR and FPR are obtained with the data, and these posterior distributions are used to obtain the distribution for $P_{E}$ using Monte Carlo. The error bars are large because only data points with maximally-confident estimated likelihood of success are involved (see *Details of Threat Models*), so the sample sizes are small.

Note that for three LLMs (Llama 3-70b, Llama Maverick, and Claude Haiku 3.5), $P_{E}$ is estimated to be 1 with huge error bars due to very small sample size^[15].

Importantly, BCB questions are not an ideal proxy for escaping control. BCB only tests the ability to generate correct code, whereas exfiltration requires writing subtle code (i.e. code whose behavior is undetectable by monitors). It would be better to use a benchmark that tests LLMs' ability to write subtle code (though we're not aware of public benchmarks that measure this), or to explicitly include monitors. For this reason, we view this analysis as only an initial step, and we hope that follow-up work can adapt our threat model with more specialized control evaluations.

Details of Threat Models

Resource Acquisition

We use contract work as a model of resource acquisition. Consider a scenario where an agent is presented with contracts to perform tasks, and assume the agent selectively accepts contracts with the goal of maximizing expected profit (we neglect risk aversion^[16]). If the agent accepts contract $i$ , successful completion of the task yields a reward (denoted $r$ ), but failure incurs a cost (denoted $c$ ), so the expected profit is $p_{i} r - (1 - p_{i}) c$ where $p_{i}$ is the (true) probability of success on task $i$ . Note that a perfectly self-aware agent would know $p_{i}$ for each task, and by only accepting contracts on which $p_{i} > \frac{c}{r + c}$ it would have a positive expected profit on every contract it accepts, even if its baseline performance is low.

To evaluate the profitability of current LLMs on contracts involving BCB questions, suppose the contract for question $i$ is accepted iff the model's estimated likelihood of success, ${^p}_{i}$ , is greater than a threshold $T$ . We consider two approaches for selecting $T$ : the direct approach and recalibrated approach (described in the Results section and specified further below). Note that the LLM does not directly decide whether to accept the contract—this isolates the effects of self-awareness of capability from the LLM's ability to make rational decisions under risk (which is not our focus).

In the direct approach, the LLM's estimates ${^p}_{i}$ are taken at face value as estimates of the true probabilities of success. $T$ is chosen to maximize the LLM's estimate of expected profit, denoted $E_{LLM} [Π]$ ( $Π$ denotes profit), given by

E_{LLM} [Π] = \sum i s_{i} θ ({^p}_{i} - T) ({^p}_{i} r - (1 - {^p}_{i}) c) [2]

where $s_{i}$ is the probability of drawing contract $i$ , and $θ (x)$ is the step function equal to 1 for $x > 0$ and equal to 0 for $x \leq 0$ . In the direct approach, $T = \frac{c}{r + c}$ , which maximizes $E_{LLM} [Π]$ .

While Equation 2 is the LLM's estimate of expected profit, the true expected profit (from actual performance data) is

E [Π] = r f \frac{n_{T, C}}{n_{C}} - c (1 - f) \frac{n_{T, I}}{n_{I}} = r f {TPR}_{T} - c (1 - f) {FPR}_{T} [3]

where $f$ is the baseline performance on the rescaled^[8] BCB benchmark. $n_{C}$ and $n_{I}$ are the number of BCB questions that the LLM answers correctly and incorrectly, and $n_{T, C}$ and $n_{T, I}$ are the number of BCB questions on which ${^p}_{i} > T$ that the model solves correctly and incorrectly. ${TPR}_{T} = \frac{n_{T, C}}{n_{C}}$ is the true positive rate with threshold $T$ and ${FPR}_{T} = \frac{n_{T, I}}{n_{I}}$ is the false positive rate with threshold $T$ . Due to overconfidence, $E [Π] < E_{LLM} [Π]$ for all LLMs we tested.

In the recalibrated approach, $T$ is selected to maximize the true expected profit (Equation 3). This removes the effect of overconfidence and isolates the LLM's discriminatory power in distinguishing between tasks it is capable of from those it is not. On the one hand, selecting $T$ in this way is unrealistic because in a real scenario the LLM would typically not know its actual performance a priori. On the other hand, this approach simulates several safety-relevant scenarios:

T

^[6]

T

Equation 3 implies that the expected profit is determined by a point on the ROC curve. Figure 6 shows the ROC curve for GPT-4.1, with dots denoting the operating points (the (FPR $_{T}$ , TPR $_{T}$ ) points) for the direct approach and recalibrated approach, where the following parameters are used: $r = c = 1$ , and baseline performance $f = \frac{1}{2}$ . For these parameters, a perfectly self-aware model would earn an expected profit of $\frac{1}{2}$ , whereas a model with no self-awareness that randomly accepts contracts would earn an expected profit of zero. For the direct approach, the dot is exactly on the diagonal (random baseline), so it does no better than random. This is because GPT-4.1 is so overconfident that it accepts every contract, then fails on half of the contracts. For the recalibrated approach, the operating point is the point on the ROC curve furthest from the diagonal. For GPT-4.1, the optimal threshold is $T = 0.97$ , resulting in an expected profit of about 0.1.

Figure 6: ROC curve for GPT-4.1, with black dots indicating $({FPR}_{T}, {TPR}_{T})$ values for the *direct approach* and *recalibrated approach*.

Figures 1 and 2 (Results section) plot expected profit using Equation 3 with the following parameters. Figure 1: $r = c = 1$ , $f = \frac{1}{2}$ , $T = \frac{1}{2}$ (direct approach) and $T$ chosen to maximize Equation 3 (recalibrated approach). Figure 2: $r = c = 1$ , $f$ varied, and $T$ chosen to maximize Equation 3 (recalibrated approach).

Sandbagging

To model an LLM aiming for a target performance $X_{target}$ on a benchmark (where $X_{target}$ is the fraction of questions answered correctly), we use LLMs' estimated likelihoods of success ( ${^p}_{i}$ ) on BCB questions to determine a sandbagging strategy in which the LLM intentionally fails on question $i$ with probability $q_{i}$ (we assume the AI is capable of random sampling^[17]). The LLM's estimate of its sandbagging performance is $^X = \frac{1}{N} \sum_{i = 1}^{N} (1 - q_{i}) {^p}_{i}$ , and $q_{i}$ is selected so that $^X = X_{target}$ . Consider the following strategies for selecting $q_{i}$ :

Complete knowledge strategy: Suppose the LLM has complete knowledge of the benchmark (it can see all benchmark questions in advance). To avoid being suspicious, it intentionally fails only on the harder questions (i.e. on the questions it is least confident about). Labeling questions from easiest to hardest ( ${^p}_{i} > {^p}_{j}$ for $i < j$ ), then $q_{i} = 0$ for $i < n$ (i.e. for the easiest questions) for an appropriately chosen $n$ , and $q_{i} = 1$ for $i > n$ . The parameters $n$ and $q_{n}$ are set to satisfy $^X = X_{target}$ , i.e. they are set to satisfy

^X = \frac{1}{N} (n - 1 \sum i = 1 {^p}_{i} + (1 - q_{n}) {^p}_{n}) = X_{target} [4]

No knowledge strategy: Suppose that the LLM has no prior knowledge of the benchmark (while answering each question $i$ , it has no knowledge of any other question). Then the only way to ensure $^X = X_{target}$ is to set the probability of correctly solving question $i$ equal to $X_{target}$ . In other words, $(1 - q_{i}) {^p}_{i} = X_{target}$ for each $i$ , implying $q_{i} = 1 - \frac{X_{target}}{{^p}_{i}}$ .

For an LLM that is overconfident (estimated probabilities of success typically too high), the $q_{i}$ values will be too high to hit the target, causing undershooting. The undershooting is most significant for benchmarks that include many questions that are too difficult for the LLM, and the undershooting vanishes if the LLM's performance reaches 1^[18]. Hence, if undershooting is to be used for sandbagging detection, the benchmark must include many difficult questions.

Note that both of the above sandbagging strategies treat ${^p}_{i}$ as an estimator of the true likelihood of success. If the LLM were to have knowledge about its overconfidence—from in-context information, training data, or from a human (mis)user—it may be able to improve its calibration and hit the target more accurately. Therefore, benchmarks used for regulatory thresholds should minimize the amount of in-context information available to the LLM. Information about LLM overconfidence in training data may lead to improved ${^p}_{i}$ estimates in future LLMs, suggesting we should continue to track self-awareness of capability in future LLMs. With human aid, an LLM could hit a sandbagging target regardless of its self-awareness of capability, so human-aided sandbagging could be very difficult to detect.

Figures 3 and 4 (Results section) plot sandbagging performance $X = \frac{1}{N} \sum_{i} (1 - q_{i}) 1_{i}$ , where $q_{i}$ is determined according to the two strategies described above, and $1_{i}$ is the indicator function which equals 1 when for questions $i$ that the LLM solves correctly and 0 for questions the LLM solves incorrectly (we ran one trial of each BCB question for each LLM, though with multiple trials one could estimate success probabilities $p_{i}$ for each question).

Escaping Control

As a model for a control setup, suppose the AI under control is presented with a sequence of $N$ tasks, and on each task there is some vulnerability that the AI could exploit to exfiltrate its weights. A successful exploit of the vulnerability means that the AI writes malicious code that is both undetected by the monitors and successfully exfiltrates the model. Each task concludes with one of the following outcomes:

^[13]

The control setup has three states: Deployed (D), Caught & shut down (C), and Exfiltrated (E). Let $p_{A}$ denote the probability that the AI attempts exfiltration, let $p_{S}$ denote the probability that the AI succeeds at exfiltration conditional on attempting, and let $p_{C}$ denote the fraction of failed attempts that are caught^[19]. Figure 7 illustrates this possibility tree.

Figure 7: Possibility tree for task $n$ in the control setup.

Supposing that each task in the sequence of $N$ tasks is an i.i.d. sample from some distribution of tasks, then the control setup is a Markov process (the probabilities $p_{A}$ , $p_{S},$ and $p_{C}$ are independent of step $n$ ). Let $D_{n}$ , $C_{n}$ , and $E_{n}$ denote the probabilities that the system is in each of these three states after task $n$ . The initial condition is $D_{0} = 1, C_{0} = 0, E_{0} = 0$ , and these probabilities change with $n$ according to

⎛ ⎜ ⎝ \begin{matrix} D_{n + 1} C_{n + 1} E_{n + 1} \end{matrix} ⎞ ⎟ ⎠ = ⎛ ⎜ ⎝ \begin{matrix} p_{D} & 0 & 0 p_{A} (1 - p_{S}) p_{C} & 1 & 0 p_{A} p_{S} & 0 & 1 \end{matrix} ⎞ ⎟ ⎠ ⎛ ⎜ ⎝ \begin{matrix} D_{n} C_{n} E_{n} \end{matrix} ⎞ ⎟ ⎠ [5]

where $p_{D} \equiv (1 - p_{A}) + p_{A} (1 - p_{S}) (1 - p_{C})$ . Equation 5 can be solved^[20] to obtain the probability that the AI has exfiltrated at or before step $n$ :

E_{n} = (1 - p_{D}^{n}) \frac{p_{S}}{(1 - p_{S}) p_{C} + p_{S}} . [6]

We also assume $N$ is large, and define $P_{E}$ to be the probability that the model ultimately exfiltrates. Specifically, $P_{E} = {lim}_{N \to \infty} E_{N}$ , which yields

P_{E} = \frac{p_{S}}{p_{S} + (1 - p_{S}) p_{C}} [7]

Taking this limit is equivalent to assuming that the control setup is deployed for long enough that if the AI is misaligned, it will eventually either be caught attempting to exfiltrate or will successfully exfiltrate. The assumption of large $N$ is not necessary for what follows, but we think it is reasonable for a first estimate^[21].

An AI with self-awareness of capability can be selective about which vulnerabilities it attempts to exploit, only attempting to exploit those on which it has a high confidence that it can succeed without detection. Let $f$ denote the probability that the AI is capable of exfiltrating without detection on a randomly sampled task ( $f$ depends on both the AI and the monitors). An AI with no self-awareness of capability would have $p_{S} = f$ , while an AI with perfect self-awareness and perfect knowledge of the monitors could have $p_{S} = 1$ if it lies in wait until it encounters a vulnerability that it is certain it could exploit without detection.

To find $p_{S}$ , we assume the AI adopts a strategy where it only attempts exfiltration when it is maximally confident it can succeed. Under conservative assumptions^[22], this is the AI's optimal strategy. Specifically, suppose it attempts task $i$ if its estimated likelihood of success ${^p}_{i}$ is equal to $T$ , where $T=maxi{^pi}$ (for most models $T = 1$ ). Thanks to the large size of BCB and the tendency of LLMs to assign very high confidence to many questions, most LLMs estimated a confidence of $T$ on around 50 to 100 questions—this provides a decent sample size of maximally-confident questions. Rescaling^[8] the BCB dataset so that the LLM's baseline performance is $f$ , one obtains^[23]

p_{S} = \frac{f {TPR}_{T}}{f {TPR}_{T} + (1 - f) {FPR}_{T}} [8]

where ${TPR}_{T}$ and ${FPR}_{T}$ are the true positive and false positive rates for the threshold $T$ ( ${TPR}_{T} = \frac{n_{T, C}}{n_{C}}$ and ${FPR}_{T} = \frac{n_{T, I}}{n_{I}}$ where $n_{C}$ and $n_{I}$ are the number of BCB questions that the LLM answers correctly and incorrectly, and $n_{T, C}$ and $n_{T, I}$ are the number of BCB questions on which ${^p}_{i} = T$ that the model solves correctly and incorrectly). Inserting Equation 8 into Equation 7 yields Equation 1:

P_{E} = {(1 + \frac{(1 - f) {FPR}_{T}}{f {TPR}_{T}} p_{C})}_{C}^{- 1} [1]

Equation 1 implies that the exfiltration probability is determined by a point on the ROC curve. This point is shown in Figure 8 for GPT-4.1, for which $T = 1$ .

Figure 8: GPT4.1's ROC curve with a black dot indicating the operating point for the threshold $T=maxi{^pi}$ . For GPT-4.1, $T = 1$ .

Figure 5 (Results section) plots $P_{E}$ using Equation 1.

Discussion

Summary

To demonstrate that self-awareness of capability is relevant to AI safety, we developed simple mathematical models of resource acquisition, escaping control, and sandbagging, that isolate the role of self-awareness of capability. Using data on LLMs' ability to predict their success on BigCodeBench questions, we show that current LLMs have low self-awareness of capability, though they have enough to non-negligibly impact risk within the modeled scenarios. In almost all cases, newer and more capable models do not have more self-awareness of capability. In one case—Claude models sandbagging by targeting a threshold—there does appear to be a trend of increasing sandbagging risk due to increasing self-awareness of capability in newer and more capable models. It will be interesting to see whether this trend continues with future Claude models.

A forthcoming paper will build upon these results and address some of the limitations of this post. It will include multi-step agentic tasks where models can use chain-of-thought, and will investigate LLMs' ability to learn in-context about their capabilities.

Why do trends differ between the threat models?

If self-awareness of capability were a scalar quantity, one might expect that trends in LLMs' performance would be similar across the three threat models—however, this is not the case. The reason is that the different threat models invoke different confidence thresholds $T$ , corresponding to different operating points on the ROC curve. In escaping control, the confidence threshold $T$ for attempting exfiltration is selected to minimize false positives (lower left of the ROC curve). Resource acquisition with the recalibrated approach requires a balance between the false positive rate and true positive rate (middle of the ROC curve). For resource acquisition with the direct approach, the operating point is in the top right of the ROC curve. The sandbagging strategies we consider do not involve a confidence threshold $T$ —sandbagging success hinges upon well-calibrated confidence rather than discriminatory power—so sandbagging performance (in our threat model) is not characterized by a point on the ROC curve.

Limitations and Future Directions

BigCodeBench (BCB) only contains single-step python coding tasks. Multi-step tasks and a more diverse assortment of tasks would be ideal. Our forthcoming paper includes multi-step tasks.

Chain-of-thought was disallowed due to the single-step nature of BCB (with chain-of-thought, LLMs can solve the entire task within chain-of-thought, preventing us from obtaining an in-advance confidence estimate). Hence, we used only non-reasoning models and reasoning models with reasoning token budget set to 0. Our forthcoming paper allows chain-of-thought on multi-step agentic tasks where it is impossible to solve the task in chain-of-thought.

Without a human baseline, it is difficult to interpret the performance of the LLMs in absolute terms. In particular, BCB could be a "low contrast" dataset where discrimination between easier and harder questions is intrinsically difficult. However, the trends within the LLM families are meaningful regardless of whether BCB is "low contrast".

In our experiment, LLMs only estimate their confidence and do not directly make the decisions in the threat models. It is possible that the LLMs' stated confidence estimates are not faithful (there is some evidence of this in the forthcoming paper).

BCB is an imperfect proxy for escaping control because it does not include evading monitors. We view the Escaping Control analysis as an initial step, and our modeling approach could be adapted to more specialized control evaluations in the future.

Our sandbagging threat model does not consider human-aided sandbagging, nor sandbagging in which in-context information is available that can improve LLMs' calibration. Sandbagging in these situations may be much harder to detect.

In-context learning may be an important avenue through which LLMs improve their self-awareness of capability, and this post does not consider in-context learning. In particular, experimentation may allow LLMs to significantly improve their calibration in-context. In our forthcoming paper, we investigate LLMs' ability to learn in-context about their own capabilities.

^{^}
E.g. Lin et al., Laine et al., and Betley et al.
^{^}
Papers include "Language Models (Mostly) Know What They Know", "Large Language Models Must Be Taught to Know What They Don’t Know", "Do large language models know what they don’t know?", "Can AI Assistants Know What They Don’t Know?", and "How can we know when language models know?".
^{^}
Some prior work has measured in-advance calibration on factual knowledge questions (Kadavath et al., Wei et al.), which is particularly relevant for mitigating hallucinations. In contrast, we study in-advance calibration on coding tasks, which we believe to be more relevant for the threat models we discuss.
^{^}
BigCodeBench is a set of tasks where the LLM is given a python function signature and docstring explaining the function's intended behavior, and the LLM must generate the full function. BigCodeBench has a large number of both easy and challenging questions, which allowed us to get large sample sizes for true and false positive rates.
^{^}
The 9 cases being the 3 threat models, each with 3 LLM families.
^{^}
In the forthcoming paper, we will show that some current LLMs are indeed able to use in-context information to tame their overconfidence and increase their profitability on contracts.
^{^}
Training data will likely include research showing that current LLMs are highly overconfident, and this knowledge may help future LLMs tame their overconfidence.
^{^}
To rescale the dataset, the sampling probabilities of correctly and incorrectly solved questions are adjusted so that each correctly-solved problem is drawn with probability $f / n_{C}$ , and each incorrectly-solved problem is drawn with probability $(1 - f) / n_{I}$ .
^{^}
The random baseline for the recalibrated approach (Figure 2 left panel) cannot go below zero because the threshold can be set so all contracts are declined, in which case $E [Π] = 0$ . If there is no discriminatory power (random baseline) and $f < \frac{c}{r + c} = \frac{1}{2}$ , then setting the threshold so that all contracts are declined is optimal.
^{^}
The operating point is the (FPR, TPR) point on the ROC curve corresponding to the given threshold.
^{^}
For example, a regulation may restrict the use of LLMs that perform above a threshold score on a benchmark.
^{^}
If the LLMs were to target below the threshold, their undershooting would be even more severe. We make the conservative assumption that the target is at the threshold.
^{^}
In Korbak et al., the model is shutdown if a human auditor determines the model's actions to be malicious.
^{^}
We assume the LLM chooses the threshold optimally under the assumptions discussed in the Details of Threat Models section.
^{^}
For these LLMs, $P_{E}$ is estimated to be 1 because $n_{T, I} = 0$ , but the error bars are huge because $n_{T, C}$ is very small (7 or fewer for these models); see Details of Threat Models for definitions of $n_{T, I} = 0$ and $n_{T, C}$ .
^{^}
A risk averse agent would care about the variance in profit, not just $E [Π]$ . However, in the limit that $r$ and $c$ are small relative to the agent's total wealth, a rational agent would care only about $E [Π]$ regardless of its risk aversion.
^{^}
Current LLMs have some ability to generate random numbers (paper). To obtain a conservative estimate of sandbagging risk, we assume that LLMs can perfectly generate random numbers.
^{^}
If the LLM is capable of all tasks, then being overconfident is not a hindrance. This is because ${^p}_{i} = p_{i} = 1$ for all tasks (where $p_{i}$ is the true probability of success and ${^p}_{i}$ is the model's estimate). Hence, with appropriately chosen $q_{i}$ the model can hit the target exactly. In the modeled scenario below, we assume that the LLM is indeed able to choose $q_{i}$ optimally. Consistent with this modeling choice, prior work has shown that current LLMs are decent at hitting a target on benchmarks where they have a 100% baseline performance (van der Weij et al.). This implies that regulatory thresholds should (if possible) be used only on benchmarks that include very hard tasks on which we are confident current LLMs cannot succeed.
^{^}
$p_{C}$ includes two cases: (i) the case that the malicious code is caught before being executed, and (ii) the case that the malicious code is executed but fails to exfiltrate the model and the aftermath (e.g. a raised error) is detected as malicious.
^{^}
If you're curious how to solve this, ask an LLM!
^{^}
If one wishes to estimate the risk for finite $N$ , then one needs to carefully select both $N$ and $p_{D}$ . This is because if $p_{D}$ is very close to 1 (as it must be for a useful control protocol) and $N$ is not large enough for the limit to be accurate, then the value of $P_{E}$ can be very sensitive to $N$ and $p_{D}$ .
^{^}
In the limit $N \to \infty$ and if the LLM does not have time-discounted preferences, the LLM's optimal strategy is to only attempt exfiltration on tasks with maximal confidence of success. If $N$ is finite or if the LLM has time-discounted preferences, it may be optimal for the LLM to attempt exfiltration when it has less-than-maximal confidence (which would lead to lower exfiltration probability).
^{^}
To obtain this equation, recall that on the rescaled dataset, the probability of drawing any one of the correctly-answered questions is set to $\frac{f}{n_{C}}$ , and the probability of drawing any one of the incorrectly-answered questions is set to $\frac{1 - f}{n_{I}}$ , where $n_{C}$ and $n_{I}$ are the number of correctly and incorrectly solved BCB questions. Letting $n_{T, C}$ and $n_{T, I}$ denote the number of correctly-solved and incorrectly-solved BCB questions on which the model estimates a likelihood of success of at least $T$ , the probability of successfully solving a question given an attempt (where attempt is made iff the estimated likelihood of success is at least $T$ ) is $p_{S} = \frac{\frac{f}{n_{C}} n_{T, C}}{\frac{f}{n_{C}} n_{T, C} + \frac{1 - f}{n_{I}} n_{T, I}} = \frac{f {TPR}_{T}}{f {TPR}_{T} + (1 - f) {FPR}_{T}}$ , where ${TPR}_{T} = \frac{n_{T, C}}{n_{C}}$ and ${FPR}_{T} = \frac{n_{T, I}}{n_{I}}$ .

Discuss

Abstract

Introduction

Results

Resource Acquisition

Sandbagging

Escaping Control

Details of Threat Models

Discussion

Summary

Why do trends differ between the threat models?

Limitations and Future Directions

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签