Reminder — no audio for these posts. Previous Issues can be found here.
To start, I wanted to make sure folks are aware of one of my more notable side-projects this year -- a short textbook on RLHF. Eventually, I want to publish this as a physical book, but it is a digital-first book with a nice website and PDF available. It's clearly not complete if you start poking around, but I'm contributing most weeks and want to have a beta version by the end of the year. As I get more text written, folks reading, telling me what to add, and fixing typos or formatting problems directly on GitHub will go a long way. I’ll send major updates here.
Fun links
There were two great lectures in the last month that I've watched. First is Andrew Barto's talk at the RL Conference (which he came out of retirement for) to give an overview of the history of reinforcement learning. The other was a long-context LLM extension talk from Sasha Rush -- right to the point.
Ai2's pretraining team released a blog post investigating training stability with OLMo models and Soumith at PyTorch wrote a very long X post on how to approach pretraining models given a big cluster.
My former colleague at HuggingFace, Clementine, who I think very highly of, wrote an entire Evaluation Guidebook.
Sergey Levine wrote a long post on his view of generalist robotics policies.
There was a very good post on the state of open data by Luis Villa.
I got a few fun comments in response to my post on How Scaling Changes Model Behavior. The core argument was how all that is changing is the log probs are shifting slightly, which we don't know how that translates into value creation. Andrew Carr pointed out this awesome sentence in the recent Meta VideoGen paper that showed a clear relationship between scaling and human preferences.
We observe that the validation loss is well correlated with human evaluation results as the later checkpoints with lower validation loss perform better in the evaluations.
Maybe scaling is just messy for text models?
Models
Qwen 2.5
The Qwen 2.5 models were launched about a month ago and are very competitive with the Llama 3.1 models. My bet is that most of the reason that Llama is adopted so much more is due to the usual mix of a) License terms, b) Meta being better at getting the word out, and c) Meta being better are supporting developers.
Regardless, these models are extremely strong. A month later, when Mistral announced their two small models, Qwen 2.5 was the model the community was most frustrated they didn't compare with.
The Qwen 2.5 72B instruct model is above the original Gemini-1.5-Pro model from Google and below the Llama 3.1 405B Instruct model. Another eval analysis is available from Artifical Analysis.
The two models of interest are the Qwen2.5-72B-Instruct and Qwen2.5-Math-RM-72B by Qwen. Good reward models are far and few between these days, especially for Math. The Instruct model has been scoring extremely high on evals and some vibe checks.
Onto the normal programming.
Base
Zamba2-7B by Zyphra: This seems to be the strongest 7B hybrid architecture language model we have to date (with open weights). Mostly, these are important to keep an eye on, but not a ton of use I know of other than by researchers and very niche domains.
EuroLLM-1.7B by utter-project: A multilingual base model (instruct versions available too) trained on all the European languages. Eventually, I expect Llama to be the default here, ironically.