MarkTechPost@AI 03月09日 14:47
Microsoft and Ubiquant Researchers Introduce Logic-RL: A Rule-based Reinforcement Learning Framework that Acquires R1-like Reasoning Patterns through Training on Logic Puzzles
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了LLMs在推理能力方面的进展,包括多种技术的应用及相关研究成果。如Logic-RL通过训练逻辑谜题获取推理模式,该研究也面临一些挑战,同时指出未来研究方向。

🎯LLMs推理能力通过多种技术提升,如CoT、MCTS等

💡Logic-RL是基于规则的RL框架,训练中模型自然分配更多计算步骤

🚧研究面临一些挑战,如Qwen2.5-Math-7B生成代码块的问题

🎉Logic-RL在某些方面显示出优势,但结果基于小规模逻辑数据集

Large language models (LLMs) have made significant strides in their post-training phase, like DeepSeek-R1, Kimi-K1.5, and OpenAI-o1, showing impressive reasoning capabilities. While DeepSeek-R1 provides open-source model weights, it withholds training code and dataset details, raising questions about scaling reasoning abilities to smaller models, optimal training data structures, and reliable replication methodologies. Traditional mathematics datasets like GSM8K and Omini-MATH present inconsistent difficulty levels with varying logical depths, complicating controlled experimentation. The need for targeted datasets with controllable complexity has become critical for isolating variables and studying the emergence of reasoning capabilities in LLMs.

LLMs’ reasoning capabilities have been advanced through various techniques, with Chain-of-Thought (CoT) reasoning playing a crucial role in breaking down complex problems into manageable steps. Monte Carlo Tree Search (MCTS), initially successful in AlphaGo, has been adapted to guide model-based planning by balancing exploration and exploitation through tree-based search and random sampling. Further, Post-training strategies to enhance reasoning capabilities include additional fine-tuning or reinforcement learning (RL) on specialized datasets. Methods like Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and REINFORCE++ are showing promise, forming a frontier for advancing model reasoning alongside test-time scaling methods.

Researchers from Microsoft Research Asia, Ubiquant, and Independent have proposed Logic-RL, a rule-based RL framework that acquires reasoning patterns similar to DeepSeek-R1 through training on logic puzzles. It adopts the REINFORCE++ algorithm and reward designs from DeepSeek-R1 for post-training. As training progresses, the model naturally allocates more computational steps to reasoning, expanding from generating hundreds to thousands of tokens, which enables deeper exploration and refinement of thought processes. Using only 5K generated logic puzzles, their 7B model shows cross-domain generalization, improving by 125% on AIME and 38% on AMC against the base model. This suggests that RL-trained reasoning develops abstract problem-solving patterns rather than domain-specific matching.

The researchers face challenges with Qwen2.5-Math-7B’s tendency to generate Python code blocks that conflict with formatting requirements. Testing both Qwen2.5-7B-Base and Qwen2.5-7B-Instruct reveals nearly identical training metrics during RL training, including validation accuracy, response length growth curves, and reward curves. The implementation shows dramatic improvements in reasoning capabilities, with output length increasing from an initial average of 500 tokens to approximately 2000 tokens after just 1000 RL training steps. This enables the emergence of more complex behaviors, such as reflection and exploration of alternative solutions, and these behaviors significantly enhance the model’s ability to handle complex tasks and are closely aligned with the results reported in DeepSeek-R1.

The results demonstrate that while PPO achieves significant advantages in accuracy and reward, it was 138% slower than REINFORCE++ in training speed. REINFORCE++ shows superior stability, performance gains, and training efficiency compared to GRPO, outperforming it across nearly all metrics. GRPO exhibits the weakest performance among the three RL algorithms evaluated. The model’s Super OOD (out-of-distribution) generalization capability proves exceptionally strong, achieving an overall improvement of 125% on the AIME dataset and 38% on the AMC dataset. This synchronous improvement indicates that the RL process enhances both in-distribution performance and facilitates the emergence of robust, transferable reasoning strategies.

This study shows the significant potential of Logic-RL in developing complex reasoning skills in language models through a rule-based RL framework. However, it’s important to acknowledge that the findings are based on a relatively small-scale logic dataset, which may limit their applicability. The generalizability of these results to large-scale real-world mathematical or coding scenarios remains an open question that requires further investigation. Future research should focus on extending this approach to more diverse and complex datasets to thoroughly validate its effectiveness and robustness across different domains and problem types. By maintaining this work as an open research project, the researchers aim to benefit the broader scientific community.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Meet Parlant: An LLM-first conversational AI framework designed to provide developers with the control and precision they need over their AI customer service agents, utilizing behavioral guidelines and runtime supervision. It’s operated using an easy-to-use CLI and native client SDKs in Python and TypeScript .

The post Microsoft and Ubiquant Researchers Introduce Logic-RL: A Rule-based Reinforcement Learning Framework that Acquires R1-like Reasoning Patterns through Training on Logic Puzzles appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLMs Logic-RL 推理能力 研究挑战
相关文章