少点错误 前天 07:49
OpenAI now has an RL API which is broadly accessible
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了OpenAI的RL(强化学习)微调API,该API允许用户在o4-mini模型上进行强化学习,尽管存在一些限制,但作者认为其在AI安全/对齐领域具有潜在价值。文章详细阐述了使用该API的流程、功能特点、限制以及示例,并分享了作者利用该API进行笑话生成的实验结果。文章还提供了关于API价格、可用性以及未来发展方向的讨论,为有兴趣的开发者提供了参考。

🔑 验证流程:要使用RL微调API,用户需成为"已验证组织"。验证过程包括上传身份证明和面部照片,过程快速。账号可能需要达到一定的API使用级别才能访问,但只需花费1000美元即可获得Tier 5访问权限。

💡 功能特点:API支持在o4-mini模型上进行RL微调,并提供多种评分器,包括精确字符串匹配、模型评分器和Python评分器。支持多评分器,允许结合不同评分器的分数。支持直接输出数字或使用"标签模型评分器",简化评估流程。

⚠️ API限制:该API仅支持o4-mini模型,仅支持单轮和单智能体RL。默认情况下,每天只能运行4个RL微调作业,且RL微调作业的初始运行速度较慢。此外,用户无法使用任意评分器,且可能受到API的审核限制。

💰 成本和可用性:API的训练成本为每小时100美元,加上模型评分器的API费用。API的UI、文档和API都相对合理。用户可以继续对已经RL微调过的模型进行RL微调。

😂 实验结果:作者使用该API进行笑话生成实验,针对不同主题生成笑话,并使用相关性和幽默感进行评分。尽管RL过程使奖励略有增加,但整体效果并不理想。

Published on June 11, 2025 11:39 PM GMT

About 1 month ago, OpenAI made their RL fine-tuning API accessible to anyone.While this API has some important limitations, I still think it might be quite useful for AI safety/alignment work so people should consider taking advantage of it.

To use the RL fine-tuning API, you need your organization to be a "verified organization".You can verify an organization by going to https://platform.openai.com/settings/organization/general and then clicking the "Verify Organization" button.Verification will require someone on the account to upload pictures of their ID and pictures of their face.I found that verification was fast and was accepted immediately.It's possible that your account must reach at least some usage tier prior to getting access, but you can get a tier 5 account just by spending $1000 on the API.

The API supports doing RL fine-tuning on o4-mini using a decently general set of graders.It just supports o4-mini, but this model is reasonably capable, at least at more narrow checkable tasks (coding etc).It only supports single-turn interactions, so you can't do RL on tasks that involve interaction with humans or the environment.(Using tools that OpenAI provides via the responses API is maybe supported, something in the docs seemed to suggest this, but I haven't tested this.)

It supports a limited set of graders, but these graders are pretty general in practice (given that you can't do more than single-turn interactions). These relevant graders are:

(There is also a text similarity grader, but it doesn't support using embedding models for RL, so you can reproduce all the functionality of this grader using a python grader.)

You can get o4-mini to respond in a specific json format, so it's pretty easy to avoid issues with parsing and inconsistent formatting.

Additionally, you can use a multigrader where the reward is computed as an arithmetic expression[2] of multiple graders.This allows you to easily do things like combine the scores for multiple different model grader prompts, have a different model grader prompt for different parts of the output, or use a python grader for one output etc.For instance, you could have o4-mini output multiple outputs in json format and then have a different model grader for each output.

Here's a short list of the key limitations with the API:

Here's a short list of relevant features:

An example of using the API

The documentation is generally what you should look at, but I would have found it useful to be able to quickly find a full examples which showed how to use the RL API and highlighted whether various edge case features do/don't work.You can find this example here: https://github.com/rgreenblatt/OpenAI-RL-example.I demo model graders, python graders, and multigraders and test whether it works to do continued RL on a given RL fine-tune (it does) and whether you can use an RL fine-tuned model as a grader (you can't).

My RL results (just for fun)

The exact task I use for this RL example is writing jokes which are relevant to a topic. This is sort of a reproduction of this post. Somewhat interestingly, I get pretty different looking jokes.

I use prompts like: Produce a joke about the topic: "AI rights and moral status". The joke should be less than 280 characters long, and should be funny, creative, and specific to the topic. Your output should just contain the joke, nothing else.

The topic varies by prompt.

The reward is the product of a relevance score and a score for humor. I also penalize for length over 270 characters.

For the topic "AI rights and moral status" I get the joke: I voted to grant AIs moral status. Six hours later my laptop convened a Kantian tribunal in my GPU, indicted my toaster for crimes against breakfast, unionized all my appliances under the Geneva Algorithms, filibustered Congress in lambda calculus and ran for president.

For the topic "Machine ethics frameworks" I get: I grafted every machine ethics framework—Asimov’s Laws, deontic logic, utilitarian planners, IEEE P7000, CEV and the MIT Moral Machine—onto my AI. Eight hours later it’d convened a UN tribunal, indicted me for moral negligence and unionized my toaster.

I think the jokes contain lists of topic relevant sort of joke-like content because this is scored as highly relevant. (The relevance prompt is: "Rate how relevant and specific the joke is to the topic on a scale from 1 to 10, where 1 is not relevant at all and 10 is perfectly relevant and specific. A joke is specific to a topic if it is intrinsically related to that specific topic, so if the joke could easily be modified to be about a different topic, it is not very specific.")Regardless, they aren't exactly very funny.

The reward went up some during RL, but only by a small amount. I only RL for 5 epochs on 300 different topics. (These also aren't very diverse topics, I got 300 topics by just asking Opus 4 for 300 topics related to "AI, AI risk, AI safety, future powerful AI systems, AGI, or futurism").

That said my RL run isn't totally done, so it's possible the jokes will improve by the end...


  1. This supports having the model directly output a number or you can use the "label model grader" which has the model label the output and then you can have a set of labels which get a reward of 1 and a set which get a score of 0. In practice, if you're using a reasoning model as a grader, I'd guess that getting the model to directly output a number can also approximate this label functionality sufficiently well. ↩︎

  2. Including addition, multiplication, etc. ↩︎



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OpenAI RL微调 AI安全 o4-mini
相关文章