Tsinghua paper: Does RL Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Published on May 5, 2025 6:56 PM GMT

arXiv | project page | Authors: Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, Gao Huang

This paper from Tsinghua find that RL on verifiable rewards (RLVR) just increases the frequency at which capabilities are sampled, rather than giving a base model new capabilities. To do this, they compare pass@k scores between a base model and an RLed model. Recall that pass@k is the percentage of questions a model can solve at least once given k attempts at each question.

Main result: On a math benchmark, an RLed model (yellow) has much better raw score / pass@1 than the base model (black), but lower pass@256! The authors say that RL prunes away reasoning pathways from the base model, but sometimes reasoning pathways that are rarely sampled end up being useful for solving the problem. So RL “narrows the reasoning boundary”— the region of problems the model is capable of solving sometimes.

Further results

majority

Limitations

RL can enable emergent capabilities, especially on long-horizon tasks: Suppose that a capability requires 20 correct steps in a row, and the base model has an independent 50% success rate. Then the base model will have a 0.0001% success rate at the overall task and it would be completely impractical to sample 1 million times, but the RLed model may be capable of doing the task reliably.Domains: They only test on math, code generation, and mathematical visual reasoning. It’s plausible that results would be different in domains requiring more creativity.

Takeaways

This is reminiscent of past work on mode collapse of RLHF.Their methods seem reasonable so I mostly believe this paper modulo the limitations above. The main reason I would disbelieve it is if models frequently have lucky guesses despite their checks-- see point 3 of "further results".Is the recent acceleration in AI software time horizon (doubling time 7 months → 4 months) mainly due to RL, and if so is pass@8 improving slower than pass@1? If so RL could hit diminishing returns. METR already has the data to test how quickly pass@8 is improving, so I could quickly run this and might do so in the comments.

Thanks to @Vladimir_Nesov for mentioning this paper here.

Discuss

Further results

Limitations

Takeaways

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签