MarkTechPost@AI 前天 14:55
Microsoft AI Released Phi-4-Reasoning: A 14B Parameter Open-Weight Reasoning Model that Achieves Strong Performance on Complex Reasoning Tasks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软发布了Phi-4推理模型系列,包括Phi-4-reasoning、Phi-4-reasoning-plus和Phi-4-mini-reasoning三个模型。这些模型基于Phi-4基础模型(140亿参数)构建,专门用于处理数学、科学和软件相关领域的复杂推理任务。通过结构化监督微调和强化学习等方法,Phi-4推理模型在多个推理基准测试中取得了与更大规模模型相媲美的结果,同时保持了计算效率和可解释性。该模型的开源发布和透明的基准测试,为未来小型LLM的发展树立了榜样。

💡Phi-4推理模型家族包含三个模型,分别是Phi-4-reasoning、Phi-4-reasoning-plus和Phi-4-mini-reasoning,它们都基于Phi-4基础模型(14B参数)构建,专注于处理复杂的推理任务。

📚Phi-4-reasoning模型通过结构化监督微调(SFT)进行优化,使用了超过140万个prompt,重点关注Phi-4基础模型能力的边界案例,并采用Chain-of-Thought格式,鼓励模型在输出中使用显式的标签,从而分离推理过程和最终答案。

🚀Phi-4-reasoning-plus模型则通过Group Relative Policy Optimization (GRPO) 使用强化学习进一步优化,在少量精选的数学问题上进行了改进,并设计了一个奖励函数,以鼓励正确、简洁和结构良好的输出,同时惩罚冗长、重复和格式违规。

🧪评估结果显示,Phi-4-reasoning和Phi-4-reasoning-plus在广泛的推理基准测试中,相对于更大的开源模型,表现出具有竞争力的结果,并在规划和组合问题(如TSP和3SAT)上表现出良好的泛化能力。

Despite notable advancements in large language models (LLMs), effective performance on reasoning-intensive tasks—such as mathematical problem solving, algorithmic planning, or coding—remains constrained by model size, training methodology, and inference-time capabilities. Models that perform well on general NLP benchmarks often lack the ability to construct multi-step reasoning chains or reflect on intermediate problem-solving states. Furthermore, while scaling up model size can improve reasoning capacity, it introduces prohibitive computational and deployment costs, especially for applied use in education, engineering, and decision-support systems.

Microsoft Releases Phi-4 Reasoning Model Suite

Microsoft recently introduced the Phi-4 reasoning family, consisting of three models—Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning. These models are derived from the Phi-4 base (14B parameters) and are specifically trained to handle complex reasoning tasks in mathematics, scientific domains, and software-related problem solving. Each variant addresses different trade-offs between computational efficiency and output precision. Phi-4-reasoning is optimized via supervised fine-tuning, while Phi-4-reasoning-plus extends this with outcome-based reinforcement learning, particularly targeting improved performance in high-variance tasks such as competition-level mathematics.

The models were released with transparent training details and evaluation logs, including contamination-aware benchmark design, and are hosted openly on Hugging Face for reproducibility and public access.

Technical Composition and Methodological Advances

The Phi-4-reasoning models build upon the Phi-4 architecture with targeted improvements to model behavior and training regime. Key methodological decisions include:

This data-centric and format-aware training regime supports better inference-time utilization and model generalization across domains, including unseen symbolic reasoning problems.

Evaluation and Comparative Performance

Across a broad range of reasoning benchmarks, Phi-4-reasoning and Phi-4-reasoning-plus deliver competitive results relative to significantly larger open-weight models:

Phi-4-reasoning-plus shows strong performance not only on domain-specific evaluations but also generalizes well to planning and combinatorial problems like TSP and 3SAT, despite no explicit training in these areas. Performance gains were also observed in instruction-following (IFEval) and long-context QA (FlenQA), suggesting the chain-of-thought formulation improves broader model utility.

Importantly, Microsoft reports full variance distributions across 50+ generation runs for sensitive datasets like AIME 2025, revealing that Phi-4-reasoning-plus matches or exceeds the performance consistency of models like o3-mini, while remaining disjoint from smaller baseline distributions like DeepSeek-R1-Distill.

Conclusion and Implications

The Phi-4 reasoning models represent a methodologically rigorous effort to advance small model capabilities in structured reasoning. By combining data-centric training, architectural tuning, and minimal but well-targeted reinforcement learning, Microsoft demonstrates that 14B-scale models can match or outperform much larger systems in tasks requiring multi-step inference and generalization.

The models’ open weight availability and transparent benchmarking set a precedent for future development in small LLMs, particularly for applied domains where interpretability, cost, and reliability are paramount. Future work is expected to extend the reasoning capabilities into additional STEM fields, improve decoding strategies, and explore scalable reinforcement learning on longer horizons.


Check out the Paper, HuggingFace Page and Microsoft Blog. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Microsoft AI Released Phi-4-Reasoning: A 14B Parameter Open-Weight Reasoning Model that Achieves Strong Performance on Complex Reasoning Tasks appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Phi-4 推理模型 LLM 微软 开源
相关文章