Interconnects 2024年10月24日
Artifacts 5: Deepseek's Janus, I'm writing a Mini RLHF book, Qwen 2.5, video datasets, audio models, and more
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文涵盖了多个AI方面的内容,包括作者的RLHF教科书项目、一些讲座与博客文章、多种模型的情况等。作者的教科书项目尚在进行中,同时提到了多个有价值的AI相关信息。

🎓作者正在进行关于RLHF的教科书项目,希望以数字形式优先发布,并在年底前推出测试版,希望读者能提供反馈并帮助改进。

📚上月有两场精彩讲座,分别是Andrew Barto在RL会议上的演讲,概述强化学习历史;Sasha Rush的长上下文LLM扩展演讲。

🤖多种模型的情况,如Qwen 2.5模型与Llama 3.1模型的竞争,以及Qwen 2.5的两个具体模型的评估分析;还提到了BaseZamba2 - 7B和EuroLLM - 1.7B等模型。

Reminder — no audio for these posts. Previous Issues can be found here.

To start, I wanted to make sure folks are aware of one of my more notable side-projects this year -- a short textbook on RLHF. Eventually, I want to publish this as a physical book, but it is a digital-first book with a nice website and PDF available. It's clearly not complete if you start poking around, but I'm contributing most weeks and want to have a beta version by the end of the year. As I get more text written, folks reading, telling me what to add, and fixing typos or formatting problems directly on GitHub will go a long way. I’ll send major updates here.

Share

Fun links

I got a few fun comments in response to my post on How Scaling Changes Model Behavior. The core argument was how all that is changing is the log probs are shifting slightly, which we don't know how that translates into value creation. Andrew Carr pointed out this awesome sentence in the recent Meta VideoGen paper that showed a clear relationship between scaling and human preferences.

We observe that the validation loss is well correlated with human evaluation results as the later checkpoints with lower validation loss perform better in the evaluations.

Maybe scaling is just messy for text models?

Models

Qwen 2.5

The Qwen 2.5 models were launched about a month ago and are very competitive with the Llama 3.1 models. My bet is that most of the reason that Llama is adopted so much more is due to the usual mix of a) License terms, b) Meta being better at getting the word out, and c) Meta being better are supporting developers.

Regardless, these models are extremely strong. A month later, when Mistral announced their two small models, Qwen 2.5 was the model the community was most frustrated they didn't compare with.

The Qwen 2.5 72B instruct model is above the original Gemini-1.5-Pro model from Google and below the Llama 3.1 405B Instruct model. Another eval analysis is available from Artifical Analysis.

The two models of interest are the Qwen2.5-72B-Instruct and Qwen2.5-Math-RM-72B by Qwen. Good reward models are far and few between these days, especially for Math. The Instruct model has been scoring extremely high on evals and some vibe checks.

Onto the normal programming.

Base

Instruct

Read more

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

RLHF教科书 AI讲座 模型对比
相关文章