Interconnects 2024年10月22日
Artifacts Log 4: Reflection 70B, o1 on LMSYS, fine-tuning fine-tunes, and speech models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

OpenAI 的 o1 模型在本周成为了最大的新闻,尽管许多人还不太确定如何使用它,但一些人已经从中获得了非凡的结果。ChatBotArena 的结果更令人惊讶,它证明了 o1 模型在现实世界中更简单的任务中表现出色。在 LMSYS 的竞技场中,o1-preview 模型相对于其他前沿模型的 Elo 评分有了明显提升,这表明 o1 模型将变得更容易使用,并且更容易理解,就像早期的 ChatGPT 相对于指令微调模型一样。OpenAI 在过去六个月中在影响力风格和 RLHF 训练方面取得了重大进展,并正在利用这一优势。o1 模型在 ChatBotArena 中的表现优于其他模型,这表明了它在风格上的独特之处,以及它在性能上的巨大提升。这标志着后训练时代的到来。o1-mini 模型在整体排名中位居第三,在数学方面排名第一,这表明规模定律并不是前沿实验室的首要目标。这款模型据传只有约 700 亿个参数,却在(大部分)盲测中击败了大多数竞争对手。在接下来的几个月里,我们将看到 OpenAI 发布 o1 的完整版(即非预览版)。这种完善可能会对 o1-mini 产生很大的帮助,并可能导致 o1 在 ChatBotArena 上的评分再次提升。我们还将在 6 个月内看到另一个顶级实验室发布与 o1 相似的模型,而不是标准的自回归转换器。一些人认为,这种比较并不公平,因为没有按计算量进行归一化。但将前沿技术推向能够服务于数百万用户的方向至关重要。第一个发布最佳功能的组织将获得用户和基于其平台构建的产品。然后,成本会降低。在开源方面,预期的发布时间表更长。我们将在 6 个月内看到一些进展,但完全匹配的模型将会更晚出现。在此期间,我们将不得不使用与 o1 相比更接近 Reflection 70B 的系统。

🤔 **o1 模型在 ChatBotArena 中表现出色,证明了 OpenAI 在影响力风格和 RLHF 训练方面的重大进展**。o1 模型在 ChatBotArena 中的表现优于其他模型,这表明了它在风格上的独特之处,以及它在性能上的巨大提升。这标志着后训练时代的到来。

🤖 **o1-mini 模型在整体排名中位居第三,在数学方面排名第一,表明规模定律并不是前沿实验室的首要目标**。这款模型据传只有约 700 亿个参数,却在(大部分)盲测中击败了大多数竞争对手。

🚀 **OpenAI 的 o1 模型在性能和风格方面取得了显著进步,这表明 OpenAI 在 AI 模型开发方面取得了突破性进展**。o1 模型的出现,意味着 AI 模型的开发将进入一个新的阶段,未来将出现更多更强大的模型。

💡 **Reflection 70B 模型的出现,证明了对已经微调的模型进行二次微调的有效性**。Reflection 70B 是一个 700 亿参数的语言模型,它通过对已经微调的模型进行二次微调,在多个任务中表现出色。这表明了对已经微调的模型进行二次微调的有效性,以及这种方法在未来将变得更加普遍。

💡 **Reflection 70B 模型的出现,也证明了反射提示的有效性**。反射提示是一种新的提示技术,它通过在提示中添加反射 token 来提高模型的性能。Reflection 70B 模型的成功,证明了反射提示的有效性,以及这种技术在未来将变得更加普遍。

🤔 **OpenAI 的 o1 模型的出现,引发了关于 AI 模型开发的未来方向的思考**。o1 模型的成功,表明了 OpenAI 在 AI 模型开发方面的领先地位,以及 OpenAI 对 AI 模型的未来发展的思考。未来,AI 模型的开发将更加注重模型的风格和影响力,以及模型在现实世界中的应用。

🤔 **开源模型的开发速度仍然落后于闭源模型**。o1 模型的出现,表明了 OpenAI 在 AI 模型开发方面的领先地位,以及 OpenAI 对 AI 模型的未来发展的思考。未来,AI 模型的开发将更加注重模型的风格和影响力,以及模型在现实世界中的应用。

🤔 **OpenAI 的 o1 模型的出现,将对 AI 模型的未来发展产生深远的影响**。o1 模型的成功,表明了 OpenAI 在 AI 模型开发方面的领先地位,以及 OpenAI 对 AI 模型的未来发展的思考。未来,AI 模型的开发将更加注重模型的风格和影响力,以及模型在现实世界中的应用。

🤔 **OpenAI 的 o1 模型的出现,将对 AI 模型的未来发展产生深远的影响**。o1 模型的成功,表明了 OpenAI 在 AI 模型开发方面的领先地位,以及 OpenAI 对 AI 模型的未来发展的思考。未来,AI 模型的开发将更加注重模型的风格和影响力,以及模型在现实世界中的应用。

🤔 **OpenAI 的 o1 模型的出现,将对 AI 模型的未来发展产生深远的影响**。o1 模型的成功,表明了 OpenAI 在 AI 模型开发方面的领先地位,以及 OpenAI 对 AI 模型的未来发展的思考。未来,AI 模型的开发将更加注重模型的风格和影响力,以及模型在现实世界中的应用。

Reminder — no audio for these posts. Previous Issues can be found here.

o1 and Reflection 70B

OpenAI’s o1 remains the biggest story of the week, as we all expected. On the news front, it’s largely been more of the same, most people aren’t quite sure how to use it but a few people are getting exceptional results from it. The vibes of this change with ChatBotArena’s results, which even surprise me in their magnitude. ChatBotArena, despite its flaws, is still the go to destination for “how models will fair in the real world easier tasks.”

The o1-preview model is another clear step up in Elo relative to the other frontier models on LMSYS’s arena. One should take its results in some bigger error bars than they show, but its rare for us to see step changes on it.

You should be optimistic that o1 models will be come both easier to use and better understood, much like early generations of ChatGPT was relative to instruction tuned models.

Remember that for the first part of this year, we were seeing models incrementally improve on ChatBotArena. It looked like a smooth rate of improvement in Elo scores. Each of the models in this list is a step on ChatBotArena: o1-preview, ChatGPT Sept. 2024, ChatGPT Aug. 2024, Gemini Pro August, GPT 4o May 2024. Below GPT-4o are the rest of the major players.

This reads as OpenAI finding a major hill to climb on impactful style and RLHF training in the last 6 months — and they’re exploiting it.

o1 being top is surprising because it has a very different subjective feel than the rest of the models. It is an outlier in style, which we know to be very important to ChatBotArena, and it is still a major improvement. This is a new era of post-training.

o1-mini being third overall, and joint top in math, pushes back against scaling laws being the highest priority to exploit among frontier labs. This model is rumored to have about 70 billion parameters and it is crushing most competition in a (mostly) blind test.

In the coming months, we’ll see OpenAI release the full version of o1 (i.e. not the preview). This polish could be helping o1-mini a lot and could cause another boost in o1 on ChatBotArena. We’ll also see another top lab ship something closer to this than a standard autoregressive transformer within 6months.

Some have said that this is not a fair comparison because it isn’t normalized by compute. Pushing the frontier out in a way that is served to millions of users counts. The first organization to ship the best capabilities will get the users and the products built on their platform. Then, costs will come down.

On the open-source side of things, the expected timeline is longer. We’ll see signs of life in 6 months, but a fully matching model will be much longer. In the meantime, we’ll be stuck with systems much closer to Reflection 70B than o1.

The story of Reflection 70B has been covered extensively at this point, almost abusively, but the insights of the model largely are missed from this. There are a few things that we can take more seriously at a technical level following this.

    Fine-tuning already fine-tuned models, and

    Reflection prompting (why I wrote an article the day before Reflection 70B on spending more on inference).

Fine-tuning fine-tunes is a very popular enterprise feature. OpenAI/Anthropic/Google’s fine-tuning APIs are on top of fully fine-tuned chat models, mostly with techniques like LoRA. This is where businesses should start when they’re thinking about fine-tuning models like Llama for internal applications.

At the same time, there’s not a good academic and evaluation ecosystem for these types of models. We’ve seen other strong open models in the past that are fine-tuned multiple times. Nexusflow released Athene 70b just before Llama 3.1 and it was very strong, Acree AI has a strong model, the top models on the RewardBench evaluation benchmark are this, a 9B model fine-tuned from Gemma 9B Instruct matches Google’s Gemma 27B fine-tune in ChatBotArena, and more examples I missed. Reflection is one of these.

We need to construct clearer norms for thinking of these models. It’s too easy to write them off as leveraging the success of the previous model, but that’s reductionist.

The second point is that Reflection 70B’s prompting technique with reflection tokens was largely validated by the o1 release. This is something a lot of people are currently working on. It’s a simple way to spend more on inference, as we have seen with chain-of-thought prompting, and it is only going to become more prevalent.

Share


Models

This week, Qwen released its 2.5 suite. It’s impressive, but I’ll cover it more in the next issue.

Base

Read more

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OpenAI o1 AI 模型 ChatBotArena Reflection 70B RLHF 反射提示
相关文章