Microsoft and Ubiquant Researchers Introduce Logic-RL: A Rule-based Reinforcement Learning Framework that Acquires R1-like Reasoning Patterns through Training on Logic Puzzles

Large language models (LLMs) have made significant strides in their post-training phase, like DeepSeek-R1, Kimi-K1.5, and OpenAI-o1, showing impressive reasoning capabilities. While DeepSeek-R1 provides open-source model weights, it withholds training code and dataset details, raising questions about scaling reasoning abilities to smaller models, optimal training data structures, and reliable replication methodologies. Traditional mathematics datasets like GSM8K and Omini-MATH present inconsistent difficulty levels with varying logical depths, complicating controlled experimentation. The need for targeted datasets with controllable complexity has become critical for isolating variables and studying the emergence of reasoning capabilities in LLMs.

LLMs’ reasoning capabilities have been advanced through various techniques, with Chain-of-Thought (CoT) reasoning playing a crucial role in breaking down complex problems into manageable steps. Monte Carlo Tree Search (MCTS), initially successful in AlphaGo, has been adapted to guide model-based planning by balancing exploration and exploitation through tree-based search and random sampling. Further, Post-training strategies to enhance reasoning capabilities include additional fine-tuning or reinforcement learning (RL) on specialized datasets. Methods like Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and REINFORCE++ are showing promise, forming a frontier for advancing model reasoning alongside test-time scaling methods.

Researchers from Microsoft Research Asia, Ubiquant, and Independent have proposed Logic-RL, a rule-based RL framework that acquires reasoning patterns similar to DeepSeek-R1 through training on logic puzzles. It adopts the REINFORCE++ algorithm and reward designs from DeepSeek-R1 for post-training. As training progresses, the model naturally allocates more computational steps to reasoning, expanding from generating hundreds to thousands of tokens, which enables deeper exploration and refinement of thought processes. Using only 5K generated logic puzzles, their 7B model shows cross-domain generalization, improving by 125% on AIME and 38% on AMC against the base model. This suggests that RL-trained reasoning develops abstract problem-solving patterns rather than domain-specific matching.

The researchers face challenges with Qwen2.5-Math-7B’s tendency to generate Python code blocks that conflict with formatting requirements. Testing both Qwen2.5-7B-Base and Qwen2.5-7B-Instruct reveals nearly identical training metrics during RL training, including validation accuracy, response length growth curves, and reward curves. The implementation shows dramatic improvements in reasoning capabilities, with output length increasing from an initial average of 500 tokens to approximately 2000 tokens after just 1000 RL training steps. This enables the emergence of more complex behaviors, such as reflection and exploration of alternative solutions, and these behaviors significantly enhance the model’s ability to handle complex tasks and are closely aligned with the results reported in DeepSeek-R1.

The results demonstrate that while PPO achieves significant advantages in accuracy and reward, it was 138% slower than REINFORCE++ in training speed. REINFORCE++ shows superior stability, performance gains, and training efficiency compared to GRPO, outperforming it across nearly all metrics. GRPO exhibits the weakest performance among the three RL algorithms evaluated. The model’s Super OOD (out-of-distribution) generalization capability proves exceptionally strong, achieving an overall improvement of 125% on the AIME dataset and 38% on the AMC dataset. This synchronous improvement indicates that the RL process enhances both in-distribution performance and facilitates the emergence of robust, transferable reasoning strategies.

This study shows the significant potential of Logic-RL in developing complex reasoning skills in language models through a rule-based RL framework. However, it’s important to acknowledge that the findings are based on a relatively small-scale logic dataset, which may limit their applicability. The generalizability of these results to large-scale real-world mathematical or coding scenarios remains an open question that requires further investigation. Future research should focus on extending this approach to more diverse and complex datasets to thoroughly validate its effectiveness and robustness across different domains and problem types. By maintaining this work as an open research project, the researchers aim to benefit the broader scientific community.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Meet Parlant: An LLM-first conversational AI framework designed to provide developers with the control and precision they need over their AI customer service agents, utilizing behavioral guidelines and runtime supervision. It’s operated using an easy-to-use CLI and native client SDKs in Python and TypeScript .

The post Microsoft and Ubiquant Researchers Introduce Logic-RL: A Rule-based Reinforcement Learning Framework that Acquires R1-like Reasoning Patterns through Training on Logic Puzzles appeared first on MarkTechPost.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签