Interconnects 2024年10月22日
A post-training approach to AI regulation with Model Specs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了 AI 监管政策,强调应监管 AI 系统的应用而非模型,提到以模型规模设定监管阈值存在问题。还讨论了后训练方法及模型规范,包括其作用、价值及可能的发展,认为模型规范对监管、研究和客户都有重要意义。

🎯AI 监管中以模型规模设定阈值的做法受到质疑,应转向监管应用,因强大 AI 带来的多数危害源于误用和现有危害的放大,而非模型潜在能力引发的新危害。

📄模型规范(Model Spec)由 OpenAI 率先发布,包含模型目标行为的原则和示例,有助于区分语言模型的无意和有意误用行为,对监管和责任认定有重要作用。

🌉模型规范对研究者和客户有价值,研究者可了解训练强大 AI 模型的思维和决策,开发者能比较模型并找到适合其用例的意识形态立场。

🌈模型规范可连接不同受众,明确模型意图有助于应对短期和长期风险,且相比其他方法更具优势,如 Constitutional AI 方法存在不足。

The central point of AI regulation and policy over the last few years, everything from the Biden Executive Order to California’s SB 1047 bill, has been model size. The most common tool for AI enforcement proportional to model size has been thresholds that kick in once an AI system uses more than a certain amount of compute (or money) to be trained. The use of thresholds for regulation is the subject of substantial pushback and is likely fading in relevance due to it.

A common chorus among those who lean on the open side of the open vs. closed AI debate (and the safety vs. acceleration debate that mirrors it) is that we should regulate the applications of AI systems rather than the models. The argument is that most harm from powerful AI comes from misuse, and amplification of existing harms, rather than new harms being unlocked by latent capabilities of the model.

It shouldn’t be surprising that most AI regulation debates to date have been contentious. The moves being made are likely to do some combination of cost a large amount of money, force AI labs to disclose that they are using data likely to result in them getting sued, or slow future development of the technology. Finding a way to regulate AI, particularly in a light-touch and pro-transparency manner, should be where we start.

Until I’m presented with more evidence that frontier language models are meaningfully shifting something like bio-risk or cybersecurity or something in the same vein, I have an application-focused lens for regulation too.

This takes us back to post-training. Post-training methods, the most common topic on Interconnects, is a collection of fine-tuning techniques, inference tricks, and other details to prepare a machine learning model for a certain deployment. Post-training is done to make models ready for general deployments, such as chatbots, and specific deployments, such as fine-tuning a domain-specific model. In this lens, post-training is the closed touchpoint to where harm is likely to occur.

Share

Expanded roles of Model Specifications

The Model Spec, currently only released by OpenAI, is a document that conveys the goal behavior of a model served to customers. It was written by members of the post-training team and product teams at OpenAI, such as co-founder John Schulman (now at Anthropic). The document contains a list of principles and examples for desired behavior, including things that could be the source of disagreement and behaviors that may not reliably work in current models.

The two quotes I highlighted when I first wrote about Model Specs still convey the goal in the best manner.

Sam Altman shared a brief explanation of what this means on Twitter:

we will listen, debate, and adapt this over time, but i think it will be very useful to be clear when something is a bug vs. a decision.

Joanna Jang in the product organization echoed a version of this:

principles are easier to debate and get feedback on, vs. hyper-specific screenshots or abstract feel-good statements.

Model specifications will allow us to contextualize some misuse of language models into two categories:

Given that AI training is rapidly evolving, I expect all types of these to be achieved. From a regulatory perspective, if you mandate the use of Model Spec and notice prevalent “Misuses of intentional model behaviors,” then the lab developing the model could reasonably be more liable. In the early cases, auditing and transparency are more useful than financial damages, but they paint a clear framework for accountability in how you enable AI applications.

While this post focuses on regulation, the Model Specification documents are also valuable to researchers and customers. Researchers get valuable insight into the mindset and decision-making of training powerful AI models while developers can compare models and see the ideological stance that best fits their use case. Documenting the intent of powerful and non-interpretable computational systems is one of the only things we can do.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

Near future of Model Specifications

Model Specs can bridge audiences. Being clear about the intention of the models is useful for addressing both short- and long-term types of risk. It is rare that an intervention can be presented as useful to all communities of the many-viewed AI debate. For example, the question was raised as to “What if we can sufficiently align a model to a Model Spec” on the recent Dwarkesh podcast with Joe Carlsmith.

There are many examples of why the Model Specification is the correct abstraction for auditing current systems. They’re not so technical that only researchers can read and use them, but they are still extremely specific in a way that heavily informs model behavior (and can be informed by recent alignment research, such as OpenAI’s Instruction Hierarchies). An example of what is not an optimal abstraction is releasing the Constitutional AI prompts.

Constitutional AI is a method for getting many different behaviors out of a model and it is not clear if the downstream behaviors of a model are calibrated to the principles used in the training data. Much like human feedback, or generalized to human preferences in reinforcement learning from human feedback (RLHF) process does not clearly capture values. Releasing Constitutional AI principles along with why they were used begins to be useful for oversight.

I’m not convinced we need any specific liability regimes on what is or is not said in corporate Model Specifications to start with, but requiring some disclosure here would be some of the least revealing for the frontier labs in terms of trade secrets (not making anyone talk about data), but also very revealing for monitoring how different strategies and goals in post-training can contribute to increased marginal risk in our information ecosystem.

Now that John Schulman is at Anthropic, I expect them to come out with a Model Spec soon enough, which I expect philosopher turned AI scientist (and many things, such as originator of Constitutional AI) Amanda Askell to have a useful role in. Once Anthropic releases one, or something like it in addition to their documentation of system prompts, Google and others will follow.

If this interests you, or you work at a company performing large amounts of post-training and you want to discuss Model Specification documents, please reach out or comment below. Of course, post-training viewpoints cannot cover everything forever, so we will eventually see regulation on the pretraining side of things. Hopefully not with the likes of CA SB 1047.


Housekeeping

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 监管 模型规范 后训练方法 应用监管 责任认定
相关文章