少点错误 05月01日 07:12
Obstacles in ARC's agenda: Finding explanations
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文作者曾在AI对齐研究中心ARC担任研究员,他分享了自己对ARC研究议程的理解,并着重探讨了ARC计划中存在的障碍。尽管ARC拥有一支优秀的团队,并且有可能在研究中发现有用的经验性技术,但作者对ARC的整体计划能否成功解决AI对齐问题持怀疑态度。文章深入探讨了ARC的核心理念,即通过寻找对神经网络现象的充分解释,并利用这些解释来解决诸如引出潜在知识(ELK)和欺骗性对齐等关键问题。作者对ARC能否实现其宏伟目标表示担忧,但同时也肯定了ARC为解决AI对齐问题所做的努力和贡献。

💡ARC的核心计划依赖于一个希望:每当在神经网络中观察到某种现象时,都能找到对该现象的充分解释。他们计划首先建立一个巧妙的损失函数来衡量解释的质量,然后并行地运行一个自动解释发现机制,从而创建一个高质量的解释,解释关于模型的一切。

🤔ARC的目标是创建一种数学语言来书写这些解释,称之为启发式解释。即使人类无法理解这些解释,它们也可以被自动地用于下游任务,从而解决引出潜在知识(ELK)和欺骗性对齐的问题。具体来说,他们希望拥有足够好的解释,以便能够进行低概率估计(LPE)或机制异常检测(MAD),或者两者兼而有之,从而解决ELK。

🔑ARC认为所有令人惊讶的数学陈述都有一个解释;并且,如果用正确的形式语言表达,这个解释不会比陈述本身长很多。这个猜想,被称为无巧合原则,是ARC议程的基石。他们相信,对于每一个陈述,都存在一个相对简单的启发式解释,而他们只需要创建正确的形式语言来表达这些启发式论证。

Published on April 30, 2025 11:03 PM GMT

As an employee of the European AI Office, it's important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.

Also, to stave off a common confusion: I worked at ARC Theory, which is now simply called ARC, on Paul Christiano's theoretical alignment agenda. The more famous ARC Evals was a different group working on evaluations, their work was completely separate from ARC Theory, and they were only housed under the same organization out of convenience, until ARC Evals spun off under the name METR. Nothing I write here has any implication about the work of ARC Evals/METR in any way.

Personal introduction

From October 2023 to January 2025, I worked as a theoretical researcher at Alignment Research Center. 

While working at ARC, I noticed that many of the key questions we were working on have not been written down, so the public, including people who consider applying to ARC, are not very well-informed about our research agenda. Shortly after joining, I volunteered to remedy this by producing more public write-ups about ARC's plans, including writing about the obstacles we see in our course. ARC was very supportive of this, and I started working on my first write-ups more than a year ago. However, I noticed again and again that I still couldn't fully pass a Turing test of presenting ARC's plans in a way that sounded right to Paul Christiano and Mark Xu who had the fullest vision of the overall agenda. This made me keep putting off the writing project to a later time when I would feel comfortable in my understanding of Paul's and Mark's positions. 

This time never came, as I never managed to fully understand Paul's and Mark's thinking. I think that me volunteering to write up things but never doing it is a big part of the reason ARC has had so little public-facing communication about its agenda so far, I'm sorry about that. Now that I quit, there was no excuse for waiting further to understand their position better, as my understanding is only deteriorating now, with my memories fading and ARC's agenda evolving in ways I no longer see. So I wrote this sequence of posts explaining how I understand ARC's agenda, and where I see obstacles. I still don't claim to fully understand everything Paul and Mark believe in, and my posts will probably still involve some serious misrepresentations, and when I say things like "to the best of my knowledge, ARC has no solution to this and that problem", it's always possible that someone in fact has a proposed solution, I just couldn't understand it. [1] But I think I have a decent enough understanding now that it's better to write the posts than not write anything.

Unfortunately, by the time I left ARC, I became very skeptical of the viability of their agenda,[2] so these posts mostly focus on the obstacles in ARC's plans. I'm grateful that ARC supported me in writing these documents, even though much of what I write is negative.

I'm not wholly pessimistic about ARC's prospects, as I think they have a very smart team, and I think there is a decent chance that as they are thinking about implementations of their ambitious plans, some neat empirical technique will fall out of the research, and that will become useful, even if it's not very principled and definitely doesn't solve the whole alignment problem. One analogy I'm thinking of is the discovery of SAEs falling out of mechanistic interpretability research. I can imagine a similar type of discovery coming out of ARC, and that would be pretty cool in itself. 

I'm much more pessimistic about ARC's full agenda working and producing a principled, scalable solution to a significant subproblem of AI alignment. Consequently, these posts focus on ARC's larger scale plans, and my doubts about their viability.[3]

However, it's important to emphasize that if ARC fails to implement their ambitious agenda, and just produce some interesting interpretability flavored results, they should not be held in a worse regard than the researchers who didn't have an ambitious vision in the first place about how their research will lead to solving alignment.

My understanding is that most mechanistic interpretability researchers don't envision a specific path of how their research will eventually solve alignment, they just believe that doing basic science on important questions is generally good. In their day to day work, ARC is also mostly working on largely empirical, interpretability-style projects these days, the only difference is that they also have a big plan about solving alignment that their day to day work channels into. [4] Even if their big plan turns out to be hopeless (as I believe), they should still be applauded for creating a plan at all, and then be evaluated on how good their produced results are by the standards of normal mech interp.

(To be clear, I think that doing foundational interpretability research without a clear vision of how it leads to impact is a somewhat useful, but suboptimal way of working on advancing AI safety. I feel similarly about doing research based on a larger plan that's very unlikely to work, in the hope of something good falling out of it along the way. I think ARC agrees, and they would do other things with their career if they were as pessimistic about the high-level research agenda as I am. My understanding is that they usually give a 10-20% chance of the ambitious plans succeeding, which would mean it's clearly worth working on the project, but which is much higher than the probability I would give.[5])

The bird's eye view

Note: In these posts, I will often refer to ARC as "we", even though I no longer work there.

ARC’s plans rely on the hope that whenever we observe a phenomenon in neural nets, we can find an adequate explanation for why that phenomenon occurs. What method we will use to find the explanation is always a little hazy, but the general assumption is that first we will set up some clever loss function that measures the quality of the explanation. Then in parallel with training the model, we will run an automated explanation-finding mechanism [6] that minimizes the loss function we created, thus creating a high-quality explanation that explains everything that can be explained about the model.

The explanations will not be understandable to humans, [7] but will be written in a mathematical language ARC aims to create. We call these explanations we aim to create heuristic explanations[8]The hope is that even though no human will understand the explanations, they can still be used for some downstream tasks in an automatic way, and solving those downstream tasks well enough will lead to solving Eliciting Latent Knowledge (ELK) and the problem of deceptive alignment. 

In particular, we want to have good enough explanations that we can use them to do either Low Probability Estimation (LPE) or Mechanistic Anomaly Detection (MAD), preferably both, so we can use those to solve ELK. This high-level plan is described in this blog post in a little more detail. 

(I will describe these concepts in more detail once I get to them in my posts, so it's not particularly important to read the linked ARC blog posts. It's good being familiar with at least the outline of the ELK report though.)

I expect that creating these explanations and using them to solve MAD and LPE to be a very hard problem, if possible at all, and I will discuss the main obstacles that I remember the team discussing during the year I spent at ARC. The rest of this post focuses on the difficulty of finding adequate explanations, while in two later posts I explore why it is hard to do MAD and LPE.

Explaining everything

ARC believes that all surprising mathematical statements have an explanation; and this explanation, if expressed in the right formal language, is not much longer than the statement itself. This conjecture, which we refer to as the no-coincidence principle, is the cornerstone of ARC's agenda.

In mathematics, this conjecture feels plausibly true. We looked at many famous theorems and conjectures, and we could basically always find a heuristic explanation of their truth that didn't feel much more complex than the statement itself. A central example of this is the twin prime conjecture: the proof is probably incredibly long and complicated, but there exists a simple heuristic argument for its truth, that mathematicians generally accept as convincing. (See the writing of two famous mathematicians, Terry Tao and Timothy Gowers for further examples of what we are talking about.)

ARC believes that such a relatively simple [9] heuristic explanation needs to exist for every statement, and we just need to create the right formal language for expressing these heuristic arguments. Then, we can say that neural networks are just a special kind of mathematical structure, and so we can use this formal language to explain statements about the behavior of neural nets. Given our belief that explanations can't be more complex or hard to find than the statements, we can hope to find the explanation of the statement "the model engages in this and this behavior" in parallel with the process in which the statement was created: that is, the training of the model. This leads us to the picture I described in the previous section, 'training' the explanation in parallel with the model.

I find the no-coincidence principle about mathematical statements dubious,[10]  though I accept that it's at least somewhat plausibly true. I also find the prospect of formalizing the language of  all heuristic arguments in mathematics (like the one people give for the twin prime conjecture) truly frightening. Heuristic arguments have been used in mathematics since a very long time, but unlike proofs, no one ever created anything like a formal language for them, and creating one would be a revolutionary achievement in mathematics. Compared to the enormity of this task, ARC's attempts so far (most notably the Formalizing the presumption of independence paper) haven't gotten very far, and in recent times ARC has put this project on the back-burner,[11] and is focusing more on the parts of the agenda that are more immediately relevant for neural nets.

However, even though I have very serious doubts about the viability of creating a formal language for heuristic arguments for mathematical statements, this is not what the rest of this post is about. Instead I will focus on another question I find even more daunting.

Empirical regularities

This topic is the biggest fear I have about finding heuristic explanations. In our discussions, I noticed that our hardest to resolve questions often get shelved by saying that we will deal with them once we can solve the question of empirical regularities. With time, I started to think that this may be the central question of our current agenda, which we are currently avoiding because we don’t really have traction on it, but which will be the main point on which our project is likely to fail.

Here is the core of the question:

When we say that we hope that “whenever we observe a phenomenon in a neural net we can find an adequate explanation for why that phenomenon occurs”, that can't be quite right the way it’s stated. After all, some phenomena simply don’t have any real explanation.

If you type in ‘Barack’ to GPT-2, why does it output ‘Obama’? In a sense, of course, there is always an explanation: a man named Barack Obama was elected president in 2008, this is explained by the economic and political conditions of the time, which are explained by the decisions of the previous administration, etc, all the way back to the Big Bang. But these explanations are not within the neural net, but in the larger world, and when we say we want to find an explanation within GPT-2 for why it follows ‘Barack’ with ‘Obama’, there is no hope for this explanation to include all world history. As far as we are concerned, it’s just a brute fact of nature, an empirical regularity rooted in the training data, that the ‘Barack’ token was usually followed by ‘Obama’. It’s something that cannot and shouldn’t be explained any further.

On the other hand, once we acknowledge that there are phenomena within a neural net that can’t and shouldn’t be explained, it’s unclear what our exact goal is and where we draw the line, what can be classified as an empirical regularity and what needs to be further explained.

After all, it would defeat the whole point of our research to expand the category of empirical regularities too broadly. If we say that “whenever the prompt asks the model to behave nicely, it does behave nicely in our tests, this is just an empirical regularity rooted in the training data, no further explanation needed”, then we didn’t get any closer to helping with alignment. 

This means we will need to draw the line somewhere between things that are obviously unexplainable (‘Barack’ followed by ‘Obama’), and things that are complex enough that catastrophic deceptive alignment can hide inside the observed phenomenon (the model always behaves nicely when we ask it to).

Capacity allocation in explanation-finding

In ARC's current thinking, the plan is not to have a clear line of separation between things we need to explain and things we don't. Instead, the loss function of explanation quality should capture how important various things are to explain, and the explanation-finding algorithm is given a certain compute budget [12] to build up the explanation of the model behavior by bits and pieces. The optimization process (gradient descent or something else) underlying the explanation finding algorithm will then find the easiest to explain and (according to the loss function) most important bits of the explanation first. By the time the process runs out of its compute budget, it hopefully explains every part of the model behavior that could hide deception. 

This means that we will need to construct a language of explanations where simple empirical regularities in the training data (like 'Obama' following 'Barack') are treated sort of like axioms so the process is not incentivized to try to explain them. Meanwhile, the claim "The AI behaves nicely on prompts asking it to behave nicely in testing" needs to be easily decomposable in the language of explanations into sub-claims like "The AI wants to acquire power" and "For acquiring power, it's strategically advantageous for the AI to pretend to be nice in testing", so the explanation-finding algorithm can easily find this decomposition and can get better loss on the loss function by pointing out this explanation of the AI's behavior, in case the AI is actually using this thought process.

I don't think we currently have any good plans on how we might construct such a language. On the technical level, I really don't see the bridge from the current proposals on activation modeling to creating the formal language we need, I think we are really far from having anything resembling the required solution. On a more philosophical level, I'm doubtful whether it's even possible to create the explanation language, loss function, and explanation learning algorithm we are looking for.

It's very unclear how one would define that 'Obama' following 'Barack' is a "simple" empirical regularity that cannot have further explanation. But there are other facts about the world that the model also just learned empirically that are not simple at all, and have deeper causes that one could in theory deduce from other information available to the model. 

For example, suppose the model knows both that apples fall down, and that the Moon orbits the Earth, and uses both facts in its decision-making, but the model doesn't know any underlying theory that connects these two facts. When we want to "explain the model's behavior", we need to refer to both of these facts as part of our explanation. By most intuitive metrics, our explanation would be higher quality if these two facts were decomposed into smaller facts and explained as the consequence of the same phenomenon, gravity. But if our loss function rewards finding this connection, we risk our explanation-finding algorithm wasting a lot of resources re-deriving Newton's theories that weren't even known to the model itself. This would be catastrophic: if the explanation-finding algorithm is trying to find explanations that the model itself doesn't know about, then running the explanation-finding algorithm can take arbitrarily more resources than training the model, and there is no guarantee that the process gets to explain the important, deception-relevant facts before running out of the allocated budget. This is why allocating the capacity of explanation-finding to the right questions is a crucial problem, and we don't have a solution to it yet.

It doesn't really help if we try to restrict our explanation to things about the model's cognition and automatically exclude trying to explain things in the AI's world-model. First of all, these two are probably not easily separable. Second, there might be situations where it's important for us to have explanations of facts in the AI's world-model that the AI itself knows the explanation of. [13] Third, there can be myriad facts that are "about the model's cognition" that could be further explained with enough work, but the model itself doesn't know their explanation either. Human psychology is full of things like that, I expect AI psychology is similar in this regard. If our explanation-finding algorithm is incentivized to try to explain all these hidden mysteries of the AI's soul, once again there is no guarantee it will get to the important bits in reasonable time.

This means that we will probably need to define the loss function of explanation quality in a way that's directly tied to explaining things that the AI itself knows. I worry however that it will turn out that this doesn't even really make sense as it relies too much on naive anthropomorphization of the AI, and at any rate, it's very different from how we are normally thinking of explanation quality. I certainly don't see how we get to the types of explanations that focus on the things that "the AI itself understands" from ARC's current work on explaining a small neural net calculating second argmax, where talking about "what the AI understands" would be non-sensical. This is a big open question that I think we don't yet have much of an answer for.

Assuming a catastrophe detector

The other option we discussed for allocating the capacity of the explanation-finding algorithm properly was to define what counts as a catastrophe, then work backwards from that. This is the idea we use in our Low Probability Estimation post, and in general most of the theoretical work we did in the last half year when I was there relied on this assumption.

The idea is to first construct a reliable catastrophe detector that can analyze a proposed action and determine if it is catastrophic. It would assume away the whole problem to assume that we have a reliable catastrophe detector that runs as quickly as the model itself, as we could just use this catastrophe detector as a filter on our agent, ensuring its safety. So ARC often used a weaker assumption, that we have access to a catastrophe detector that's reliable but unaffordably slow to run at every inference. 

It's not entirely implausible that we could get a trustworthy catastrophe detector like that in real life: the situation we are imagining is that there is a weaker trusted model T, and a stronger untrusted model U, but if we give T a lot of inference time, analyzing everything in a very long chain of thought, it can be competitive with U that just has the normal short inference time. In particular, if T is given enough time, it can reliably classify U's every action as catastrophic or not. Therefore, T running for a long time can be used as a slow but trusted catastrophe detector, C.[14][15]

While ARC considered it possible that such a trustworthy C would actually be built, we were mostly hoping to use the assumption of C's existence only as a stepping stone, and later accomplish our plans without a reliable C, once we learned some valuable lessons from solving our problems using this assumption.

A big advantage of assuming C's existence is that capacity allocation might become easier. We can concatenate U and C, looking at it as a single model, and try to explain why it rarely outputs "yes, this action is catastrophic". Hopefully, we can propagate back the signal through C's architecture, and find the parts of U's cognition that are relevant for explaining this, thus allocating our capacity to explaining U's deceptive cognition, while not wasting our compute on explaining irrelevant stuff. 

Unfortunately, while it might be theoretically easier to solve capacity allocation in the presence of C, we have no actual plan how to propagate signals back from C, and I don't feel we are much closer to solving capacity allocation with C than without. I will write more about this in the later post on Mechanistic Anomaly Detection, but my main position is just that solving capacity allocation is crucial for explaining anything interesting at all, and it seems extremely hard, and we don't know how to do it.

Explaining behavior dependent on outside factors

There is another type of problem with empirical regularities that mostly falls outside the question of capacity allocation in explanation finding, and I think might require different solutions. 

There are parts of the AI's behavior that depend on factors outside the model. For example, the AI is using a Python interpreter to run code. Now when we try to explain a phenomenon, like why the AI has a high success rate on a certain set of problems, we need to incorporate in our explanation why the Python interpreter works well. If the AI is using Google, the explanation needs to encompass what is on the internet. If the AI talks with humans, it needs to talk about human nature. 

Obviously, the explanation, created by analyzing the weights of the model, can't really encompass all that is possible to know about the contents of the internet or human nature. The best we can hope for, ARC concluded, is for the explanation to "know as much as the model knows" about the internet or humans or everything else. That way, the explanation will still be expressive enough to prevent the AI to engage in sensor tampering or deceptive alignment. 

But then it's unclear what we are even explaining. We can't really say we are fully explaining why the AI has high accuracy on a set of tasks, as the explanation would need to involve facts about the outside tools (like the internet) the AI is using. If we are only allowed to refer to the things the AI believes about the world, is our explanation about why the AI believes it will have a high accuracy? My understanding is that ARC agrees that this is not the thing we are looking for, we are looking for the explanation of the real behavior, and not what the AI believes about its behavior.[16]

I never understood how we will square the insistence of explaining the real phenomena (the AI having high accuracy on a problem set), with the principle that we only need to explain "as much as the model knows".

Teleological explanations

A closely related problem is what we called the question of "teleological explanations". 

For a lot of phenomena in the world, like food appearing on the store-shelves, my explanation routes through steps like "The store has a lot of employees, they are fairly competent people, and their job is to stack the shelves, so probably there will be food when I come in the morning." I don't know where exactly they buy the food from and how they transport it, I just trust them to do it because it's their goal and they are competent. In a world full of agents, there is no other way of concisely explaining events, and I expect that when we create the formal heuristic explanation of why an AI's actions result in a diamond showing up on cameras,[17] the explanation will have steps that route through certain agents having a goal and competently pursuing those.

On the other hand, it would be fatal to accept too big a part of your explanation to look like this. If your full explanation is just "There is a vault-protecter AI whose goal is to make sure the cameras show a diamond, and it's a very competent AI, therefore the cameras will show a diamond", you have no way of distinguishing the AI doing sensor tampering versus honestly protecting the diamond, both fit the definition.

Similarly, if the vault-protecter AI is allowed to design a new AI, or just hire humans to do some tasks, it is very dangerous to rely on teleological explanations of the new AI or humans following the goals given to them, as you still can't distinguish if they are achieving the goal through legitimate means.

It was never clear to me where to draw the line in how much teleological explanation is allowed, and what the plan is for dealing with this question. Again, ARC's answer will probably involve the words "we only need to know as much as the AI knows", but as I expressed above, I never understood how to operationalize that.

When and what do we explain?

This is a separate question about explanation finding that doesn't fit under the umbrella of "empirical regularities", but it's important enough that I want to discuss it here. 

The general plan is that we create the explanation of the model behavior in parallel with the model's training. [18] But what exactly are we explaining? 

If we have a catastrophe detector, we can try to explain why the catastrophe detector never fires on the model's outputs. If the AI is trained to do a specific task, like protect a diamond in a vault, we can try to create an explanation of why its actions usually result in the cameras showing the diamond being safe. These are objectives we can maybe use for capacity allocation during explanation finding once we figure out how to do that, and explanations for these phenomena are what we want to use in Mechanistic Anomaly Detection and Low Probability Estimation.

But in the current AI paradigm, model training starts with pre-training on next token prediction, in fact probably that's still what the majority of  training compute is spent on. So we also need to start creating our explanation during the pre-training. But what are we explaining there? ARC's answer is we need to explain something like why the AI gets low loss on next token prediction, and this explanation needs to be expressive enough that during the post-training, we can easily turn it into the explanation of any specific phenomena of interest, like why the AI's actions cause the diamond to appear safe. 

This is theoretically possible, after all, we assumed that the next token prediction pre-training was general enough to build up capabilities that could be turned into any specific capability with limited post-training. So why couldn't the explanation for the success in next token prediction be general enough to be possible to be turned into any other explanation during the post-training?

Still, I find the prospect of creating this fully general explanation of all the AI's capabilities during pre-training very daunting, and I thought it was important to write it down that this is something we are committed to doing.

Explaining algorithmic tasks

One of the primary goals ARC set for itself in the last half a year is to create satisfactory explanations for phenomena in neural networks doing algorithmic tasks. For example, a main focus of ARC's research has been to take neural nets trained to output the second argmax from a list of numbers, and explain why the neural nets get low loss, and analytically estimate the accuracy rate that can be later validated by sampling.

This focus on algorithmic tasks is largely motivated by the desire to circumvent the question of empirical regularities for now. After all, in the case of algorithmic tasks, it’s at least theoretically possible to explain everything. The neural net just implements one algorithm, the target is another algorithm, the fact that they agree on 85% of a formally defined input distribution is just a mathematical statement, there is no inherent reason why we couldn’t fully explain it.

I am somewhat on board with this project of using algorithmic tasks as a warm-up exercise where the target of the explanation is more clear, but I'm afraid it will side-step exactly the questions I'm most worried about in ARC's agenda: handling empirical regularities, and being useful for realistic MAD and LPE use cases.

In any case, I will be excited to see what comes out of ARC's effort to explain small neural nets solving algorithmic tasks. If they manage to create a method that can be used to explain multiple different neural nets solving different algorithmic tasks, I will be extremely impressed and surprised, and will have hope that their method can generalize to more realistic cases where they need to deal with empirical regularities too. If they only explain neural nets doing one algorithmic task, like second argmax, I will still be impressed by their work, but will need to look at the details of their method before significantly updating on the viability of heuristic explanations outside a few bespoke cases.

For the rest of my opinions on ARC's agenda, see my posts soon coming out on Mechanistic Anomaly Detection and Low Probability Estimation.

 

  1. ^

    Though I'm at least pretty sure that in the cases when I say I don't know of a solution, there is at least no simple and clearly compelling solution, if I failed to pick it up in the long conversations I had on these topics in the year I spent at ARC.

  2. ^

    That's why I decided to leave.

  3. ^

    Some of my concerns will be about how some parts of the agenda might just be theoretically impossible to accomplish, but most of my concerns are of the form of "Oh my god, this thing we want to build here is extremely ambitious, and currently we have close to nothing, and we don't even know any of the specifics of how it should look like." I sometimes used the metaphor of us building a bridge through the Atlantic Ocean: the empirical and mathematical team is building the Liverpool bridgehead brick by brick, studying vaguely related questions in tiny neural nets, while the conceptual researchers are laying down the road in New York, philosophizing about how we will use our heuristic arguments once we have them, even though we don't know at all how they could look like. Naturally, other members of the team were less pessimistic about the rate of progress and the gulf lying between our conceptual goals and existing results.

  4. ^

    In fact, when I worked at ARC, I always described my job to other AI safety people as "We are trying to do basically mech interp, but in a more principled way that will be hopefully more scalable and automatable." I heard from others in ARC that they describe their job to outsiders similarly.

  5. ^

    I give ~0% probability for the ambitious agenda working. This might sound overconfident, and I'm nervous saying it, but as it will become clear from my posts, the ambitious agenda really is very ambitious. And if I gave any probability higher than ~0% out of fear of being over-confident, I would just Pascal-mug myself into working at ARC given the stakes, so I have no choice but to go with my inside-view probability here. 

  6. ^

    Based on SGD if the model is trained with SGD, based on another exotic method if the model is trained with another exotic method

  7. ^

    ARC hopes to create a scalable alignment solution, so the plan can't be based on humans understanding everything in the model.

  8. ^

    Or sometimes heuristic arguments.

  9. ^

    That is, similarly simple to the statement itself in some sense.

  10. ^

    It is only partially true what I previously said, that looking at famous theorems and conjectures we could always find simple heuristic explanations. In particular, we had some arguments about the Basel problem and Jacobi's four-square theorem, and I'm not convinced that the simplest explanations are as simple as the statements. It's hard to evaluate this, as we don't have a formal language for the explanations and we have no formal definition of simplicity.


    An even more worrying potential counter-example is the Hadamard conjecture, as in that case we really couldn't find any simple argument for the relatively simple statement of the conjecture, even though mathematicians overwhelmingly expect the conjecture to be true. ARC is now sort of committed to the belief that the Hadamard conjecture is either false or it has a simple explanation that no one has found so far. This is an awkward thing to be a foundation belief in an alignment agenda.

  11. ^

    Though hasn't given up on eventually creating the full formal language of heuristic explanations

  12. ^

    This is defined as a percentage of the model's training compute. This is the "alignment tax" that the model's creators need to put up with. The hope is that this will be an affordably small percentage.

  13. ^

    For example if we want to know Low Probability Estimation, and we want to know how likely it is for a coup signal to come up in deployment, we want to explain things about the AI's model of the deployment distribution.

  14. ^

    It's unclear how much smarter an AI can get just by using more inference, the debate on this goes back a long time in the AI safety community.

  15. ^

    There is another hidden assumption here: granting that T was trustworthy, how can we be sure that T running for a very long time, discussing and debating things with other instances of itself, forming an infinite bureaucracy together, is still trustworthy? I think there was some vague proposal for verifying that, but it wasn't discussed much, and I don't remember the proposal.

  16. ^

    It's very unclear what this would even mean and if the AI even necessarily has beliefs about the reasons of its actions.

  17. ^

    Example taken from the ELK report

  18. ^

    What happens if ARC finishes building its explanation finding machinery, goes to a lab to implement their method, and the lab tells us that they are not training models from scratch anymore, they are just post-training a base model they created in 2024, so ARC couldn't be present at the beginning of the model's training? I think ARC's current position is that this could indeed be fatal to our plans, and maybe we would need to ask the company to train a new model from scratch.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI对齐 ARC 启发式解释 无巧合原则
相关文章