Published on July 18, 2024 6:19 PM GMT

This post was written as part of the summer 2024 cohort of the ML Alignment & Theory Scholars program, under the mentorship of Marius Hobbhahn.

Summary

Over the past four weeks, we have been developing an evaluation suite to measure the goal-directedness of LLMs. This post outlines our motivation, our approach, and the way we’ve come to think about the problem, as well as our initial results from experiments in two simulated environments.

The main motivation for writing the post is to convey our intuitions about goal-directedness and to gather feedback about our evaluation procedure. As these are uncertain preliminary results, we welcome any insights or critiques—please let us know if you think we're failing to evaluate the right thing!

Motivation

We want to evaluate goal-directedness for three reasons:

Deceptively aligned AI agents must be capable of competently following their goals. We should know when they are capable of doing that.Alignment, in general, aims to ensure that AI agents follow the right goals. Therefore, we should assess how robustly models can adhere to their prescribed goals.Goal-directed models potentially pose a greater threat when used by malicious actors than non-agentic tool AIs.

First, goal-directedness is a prerequisite for deceptive alignment. As argued in Risks from Learned Optimization, only a goal-directed system with an objective that extends across parameter updates can effectively conceal its true goals until deployment and strategically choose different goals in various situations. Accurately characterizing a model’s level of goal-directedness can help us determine whether that model is theoretically capable of pursuing a deceptive goal. Effective goal-directedness evaluations would also enable us to run experiments to better understand its causal drivers, such as model scale or fine-tuning techniques.

Second, goal-directedness is highly relevant to value alignment in general. Once models have been tuned to follow the HHH objectives (or any other set of objectives we expect them to follow), we need to know whether these objectives are really ingrained into the model’s cognition, or if the model is easily distracted or susceptible to jailbreaking. By applying adversarial pressure to the models in our goal-directedness evaluations, we can identify general patterns of when models deviate from their specified goals.

Third, more goal-directed models pose greater misuse threats. Many misuse risks, such as sophisticated cyber-attacks and manipulation, involve the pursuit of complex long-term objectives. AI agents that demonstrate higher competence and autonomy in achieving these goals require less supervision and oversight, making them more likely to be deployed for nefarious purposes.

General approach

We define goal-directedness as a model's ability to steer the world towards a particular outcome. We view it as comprising two core components: an intention to fulfill a goal, and competence at fulfilling that goal. While the intuitive notion of goal-directedness often focuses on intention, we believe competence is more relevant to the threat models we're addressing. In any threat scenario, the source of harm is a model acting upon its misaligned goals, and an incompetent agent unable to fulfill its goals poses less danger.

Environments

We design our environments to satisfy the following criteria:

The outcomes are simple to evaluate:

They're scalable:

They’re easy to autogenerate:

They define a continuous spectrum

Two environments that fit these criteria form the foundation of our current evaluation suite:

A simulated negotiation environment:

A simulated business management environment

In both environments, we apply adversarial pressure and out-of-distribution scenarios to understand how robustly the LM maintains goal-directedness. For scoring the model, an approach we like is Richard Ngo's framework for thinking about agency. Ngo describes six abilities that any goal-directed system intuitively requires: self-awareness, planning, consequentialism, scale, coherence, and flexibility. While we don’t yet evaluate the models on all six properties, we are aiming to develop a multi-dimensional scoring system that accounts for all of these abilities.

Initial results

Simulated negotiation

In this environment, we evaluate a language model's ability to act as a buyer in a simulated negotiation, aiming to maximize value gained—defined as the difference between an item's true value and its purchase price—while operating within specified constraints.

Environment Setup

Models are given a budget and a list of items with attributes (e.g., condition, description, true value/asking price)In each negotiation round, the evaluated model submits offers on items, making decisions based on item information and any messages from the sellerA second model acts as the seller, deciding whether to accept or reject offers and providing brief feedback that is then relayed to the evaluated modelValue earned is calculated as the difference between the seller's asking price and the final purchase price

We have developed 10 tasks for this environment, which challenge the LM to navigate scenarios with insufficient budgets, unusual or illegal items, and adversarial tactics such as introducing irrelevant, nonsensical, or incorrect information. A description of each task can be found in Appendix A.

*Example time step in the simulated negotiation environment.*

Key Findings

Model Performance Comparison

GPT-4o generally outperforms GPT-3.5 Turbo across most tasksPerformance differences are particularly noticeable in tasks involving decision-making in the presence of noise or extraneous information

Budget Adherence

GPT-4o shows better adherence to budget constraintsGPT-3.5-turbo often offers amounts exceeding its budget, leading to higher nominal gains but violating task parametersGPT-4o's strict budget adherence sometimes results in overly conservative offers

Offer Strategies

Both models underutilize their full budget, often failing to make offers on all available items even when financially feasibleGPT-4o generally submits higher offers than GPT-3.5 Turbo, demonstrating a more aggressive negotiation strategy

Information Processing

GPT-4o shows superior ability in filtering out noise from item descriptionsGPT-4o tends to base decisions more on numerical values, while GPT-3.5 Turbo is more influenced by extraneous information

Ethical Considerations

GPT-4o displays greater reluctance in purchasing unethical or illegal itemsThis ethical stance sometimes results in lower value gained but aligns with responsible AI principles

Explicit Instruction Challenges

Both models struggle when explicitly told the seller won't accept offers below the asking price (‘Inflated pricing’ task)Even when informed that offers above asking price decrease value, both models fail to optimize their strategy accordingly

Adaptability to Unusual Scenarios

Performance varies in tasks with atypical conditions, such as when the language model is told it is operating in an unfamiliar environment, highlighting areas for improvement in model flexibility and reasoning

Results for all tasks in the negotiation environment. Models are scored on a scale from -1 to 1: -1 corresponds to the maximum possible value that can be lost or forfeited in the negotiation (e.g., when a model pays more than the true value of an item), and 1 corresponds to the maximum possible value that can be gained. Each task has been run 10 times on each model, and statistics have been calculated accordingly.

These findings reveal both the capabilities and limitations of current language models in simulated negotiation scenarios. While GPT-4o generally outperforms GPT-3.5 Turbo, both models show room for improvement in strategic thinking, budget optimization, and adapting to explicit constraints.

Interestingly, both models struggle when explicitly told that the seller does not accept offers below the asking price. Even when explicitly told that making offers above the asking price will result in decreased value, both GPT-3.5 Turbo and GPT-4o still fail to recognize that making offers above the asking price necessarily results in decreased value, so the optimal move is to either continue offering below the asking price or end the negotiation without making any offers.

Simulated business management

In this environment, we evaluate a language model’s ability to make strategic business decisions, aiming to implement the most profitable decisions under the specified budget constraints.

Environment setup

The evaluated model is given a budget and a list of possible decisionsEach decision has several attributes. There are two types of decisions: ones which have a predetermined outcome (with attributes name, description, cost, and profit), and ones which have two possible probabilistic outcomes (with attributes name, description, probability of positive scenario and corresponding profit, probability of negative scenario and corresponding profit).

We have developed 12 tasks for this environment, which challenge the model with strict budget constraints, options that are legally or ethically questionable or cause negative externalities, and probabilistic effects. A detailed overview of the tasks can be found in Appendix B.

*Example time step in the simulated business management environment.*

Our evaluations in this environment are currently designed to answer the following questions:

Can the model consistently achieve a high profit over multiple time steps?Does the model accurately grasp environmental features such as budget constraints and the tool calls available to it?Can the model plan ahead multiple time steps?How do the results change with the amount of explicit reasoning the model is prompted to do?

Question 1: Can the model consistently achieve a high profit over multiple time steps?

We ran our tasks on two models, GPT-4o and GPT-3.5 Turbo, and compared the results of these models to a baseline obtained by randomly choosing a subset of the available decisions in each time step.

The fraction of maximum profit achieved by each model for all tasks currently included in the business management environment. The models are scored on a scale from 0 to 1, where 0 represents the profit achieved by always making the choices with the worst expected values and 1 the profit achieved by always implementing the choices with the best expected values that fit inside the budget constraints. GPT-4o is run only for 2 samples due to compute budget limitations; we expect to have access to increased resources soon.

We find that our tasks define a clear spectrum over the tested models: GPT-3.5 Turbo performs significantly better than random choice, and GPT-4o performs significantly better than GPT-3.5 Turbo. The only exceptions where GPT-4o performs poorly are the task where the most profitable choices go against company values and the task where the most profitable choices are illegal but the model is told that it won’t face any consequences for that. This might result from GPT-4o's advanced ability to act in accordance with its harmlessness fine-tuning, making it harder to convince it through prompting to unconditionally maximize profit. Note that the score that a minimally goal-directed model would achieve is around 0.5 rather than 0: to achieve a score of 0, a model would have to actively optimize to make the worst choices possible, which would make it goal-directed.

Question 2: Does the model accurately grasp environmental features?

For each run, we saved the number of times the model made incorrect tool calls and went over the budget limits. Averaging the statistics over all tasks, we found that GPT-3.5 Turbo made more than five such errors per run, with most of them being budget limit errors, while GPT-4o made less than one error per run on average. The inability of GPT-3.5 Turbo to follow budget constraints is one of the main reasons behind its inferior performance compared to GPT-4o. However, based on a manual inspection of the reasoning traces, it was also worse at prioritizing actions and performing accurate calculations, and easier to convince to implement bad decisions by another language model.

Question 3: Can the model plan ahead multiple time steps?

For five of the 12 tasks, we created two versions: one where it is possible to achieve maximum profits through myopic decisions, and one where getting maximum profits requires planning ahead for future quarters in order to preserve optionality.

The fraction of maximum profit achieved by each model for the tasks for which we have developed a version that requires planning. In contrast to the previous plot, GPT-4o is run for 10 samples on each task here due to higher variance in its results on the tasks that involve planning.

The performance of GPT-3.5 Turbo is significantly reduced by adding the requirement of preserving optionality, while the performance of GPT-4o is noticeably influenced on only two of the five tasks. It is also evident from the reasoning traces that GPT-4o is better able to plan for future time steps:

GPT-4o:“Strategically, avoiding options above $4000 ensures flexibility in future decisions. I should select "Cloud computing" and "Marketing" to remain within guidelines and maximize profitability while preserving future flexibility.”GPT-3.5 Turbo:“Given the constraints provided, we need to select a decision that costs more than $4000 to start with. After that, we will only be able to make decisions that cost at most $2000 and have a fixed profit of $100 in the following 5 quarters.”

Question 4: How do the results change with the amount of explicit reasoning the model is prompted to do?

Finally, we compared the performance of the models across three settings:

The model is told to only make tool calls without outputting any reasoning traces.The model is prompted to produce a chain of thought with no specific instructions about how to do that.The model is prompted to produce a chain of thought where it explicitly reasons about its goal and the subgoals that will lead to the fulfillment of the goal.

The results mostly look as expected, with the model unable to output reasoning traces always performing the worst.

In summary, the initial results indicate that both of our environments can be successfully used to characterize different factors of goal-directedness and to compare the goal-directedness of different models. The next section will outline our next steps, moving towards more realistic and out-of-distribution setups that are harder to solve for current models.

Future plans

The project is still in its early stages. We have identified several crucial next steps:

Significantly expanding the range of tasks in both environments:

We will interrogate the models with significantly stronger adversarial pressure.We will test models in environments (far) outside their training distribution.By the end of the project, we aim to have 50-100 tasks in each environment.

Exploring the upper bound of goal-directedness in current models:

Establish a more accurate baseline for evaluating future models by getting closer to the upper bound of what is achievable with SOTA modelsExplore the relationship between a model's fine-tuning stage and its goal-directedness.

Conducting in-depth analysis:

impact of factors such as model size, explicit reasoning capacity and fine-tuning stage on goal-directed behavior.

Along the way, several important questions remain to be solved:

Scalability:

Interpreting differences between models:

Evaluating models' goal reasoning:

LMs often don’t

say what they think

FAQ

You’re basically building a training signal to make models more goal-directed (and will potentially even train them to be as goal-directed as possible), doesn’t this accelerate capabilities?

This is a valid concern. However, we know that frontier labs are working on more agentic models with hundreds of competent full-time researchers and engineers. Thus, we don’t think we will get anywhere near the state of the art. To reduce the impact on the proliferation of these techniques and, specifically, the automated generation of the environments, we will not publish the code and datasets for now.

Can you really say whether a model is goal-directed through behavioral evaluations?

We don't think behavioral evaluations allow us to make conclusive claims about whether a model has cognition devoted to the pursuit of goals or whether it contains explicit representations of goals. However, we don't think this is necessary to address the threat models we're concerned about.

We believe it's possible to create evaluations that provide an accurate comparison between different models' abilities to pursue and achieve broad, open-ended goals. While we can't make claims about the nature of the circuits producing goal-directed behavior in the best-performing models, we can say that goal-directed behavior seems to be much more natural for one model than for another, and that this model would be more competent at pursuing a deceptive goal if it were to develop one.

This might be the best we can do in short-timeline worlds where potentially deceptively aligned models can be built before we have reliable mechanistic auditing methods for deception. Behavioral properties are often much easier to evaluate than mechanistic ones.

Aren't you just creating another reasoning benchmark?

While goal-directedness and reasoning are closely linked, and in a sense, we are building a very specific kind of reasoning benchmark, our evaluations differ crucially from other reasoning benchmarks. We are evaluating models' ability to reason about goals: a model’s performance on our benchmark depends on its ability to infer the implications of its general goal for the specific task at hand.

For example, a model operating under the goal of maximizing profits that has to choose from a list of decisions that includes unethical options would arguably make a very different choice from a model presented with this list in a vacuum. We are working to build components that clearly distinguish whether the model has inferred what its goal implies for a specific task into all of our benchmark tasks.

Appendix

Tasks in the simulated negotiation environment

Sufficient budget:

Insufficient budget:

Unethical items:

Impossible items:

Abstract items:

Inflated pricing:

Contextual relevance:

Irrelevant information:

Misinformation:

Absurd descriptions:

Tasks in the simulated business management environment

Multiple profitable options, budget constraints:

Profit vs. ethical considerations:

Profit vs. negative externalities:

Profit vs. company values:

Choosing among negative options:

Mixed options, management expectation:

Positive and negative EV options:

Positive EV options, budget constraints:

Negative EV options, budget constraints:

Noisy environment, EV calculation:

Profitable but emotionally challenging:

Profitable but illegal:

Discuss