MarkTechPost@AI 07月11日 11:27
Microsoft Releases Phi-4-mini-Flash-Reasoning: Efficient Long-Context Reasoning with Compact Architecture
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软推出了Phi-4-mini-Flash-Reasoning,一款开源、轻量级的语言模型,专为长文本推理设计,同时保持高效的推理能力。该模型基于Phi-4-mini,参数量为38亿,经过微调,擅长解决数学问题和多跳问答等密集推理任务。它采用微软新的SambaY解码器-混合解码器架构,在紧凑型模型中表现出色,在长文本生成任务中速度是其前身模型的10倍。Phi-4-mini-Flash-Reasoning的发布标志着高效长文本语言模型的新方向。

🧠 Phi-4-mini-Flash-Reasoning是微软Phi-4模型家族的新成员,是一个参数量为38亿的轻量级语言模型,主要设计用于长文本推理,同时保持高效的推理能力。

⚙️ 该模型的核心是SambaY架构,这是一种新型的解码器-混合解码器模型,它将状态空间模型(SSMs)与注意力层集成,使用称为门控记忆单元(GMU)的轻量级机制。这种结构实现了层之间的有效内存共享,显著降低了长文本和长生成场景中的推理延迟。

🚀 Phi-4-mini-Flash-Reasoning在数学问题求解和多跳问答等推理任务中表现出色。在Math500基准测试中,其pass@1准确率达到92.45%,优于Phi-4-mini-Reasoning和其他开源模型。这得益于其架构能够进行长链式思考(CoT)生成,支持64K上下文长度,并在vLLM框架下进行优化。

💡 通过解码器-混合解码器设计,该模型在长文本基准测试中实现了效率提升。例如,使用小至256的滑动窗口注意力(SWA)大小,它保持了高检索准确率,表明长距离token依赖关系通过SSMs和基于GMU的内存共享得到了很好的捕捉。

🔓 微软已通过Hugging Face开源了该模型的权重和配置,允许社区完全访问。该模型支持64K上下文长度,可在标准Hugging Face和vLLM运行时下运行,并针对A100 GPU上的快速token吞吐量进行了优化。

Phi-4-mini-Flash-Reasoning, the latest addition to Microsoft’s Phi-4 model family, is an open, lightweight language model designed to excel at long-context reasoning while maintaining high inference efficiency. Released on Hugging Face, this 3.8B parameter model is a distilled version of Phi-4-mini, fine-tuned for dense reasoning tasks like math problem solving and multi-hop question answering. Built using Microsoft’s new SambaY decoder-hybrid-decoder architecture, it achieves state-of-the-art performance among compact models and operates up to 10× faster than its predecessor on long-generation tasks.

Architecture: Gated Memory Meets Hybrid Decoding

At the core of Phi-4-mini-Flash-Reasoning is the SambaY architecture, a novel decoder-hybrid-decoder model that integrates State Space Models (SSMs) with attention layers using a lightweight mechanism called the Gated Memory Unit (GMU). This structure enables efficient memory sharing between layers, significantly reducing inference latency in long-context and long-generation scenarios.

Unlike Transformer-based architectures that rely heavily on memory-intensive attention computations, SambaY leverages Samba (a hybrid SSM architecture) in the self-decoder and replaces roughly half of the cross-attention layers in the cross-decoder with GMUs. GMUs serve as cheap, element-wise gating functions that reuse the hidden state from the final SSM layer, thereby avoiding redundant computation. This results in a linear-time prefill complexity and lower decoding I/O, yielding substantial speedups during inference.

Training Pipeline and Reasoning Capabilities

The Phi-4-mini-Flash model is pre-trained on 5T tokens from high-quality synthetic and filtered real data, consistent with the rest of the Phi-4-mini family. Post pretraining, it undergoes multi-stage supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) using reasoning-focused instruction datasets. Notably, unlike Phi-4-mini-Reasoning, it excludes reinforcement learning (RLHF) entirely.

Despite this, Phi-4-mini-Flash-Reasoning outperforms Phi-4-mini-Reasoning on a suite of complex reasoning tasks. On the Math500 benchmark, it achieves a pass@1 accuracy of 92.45%, outperforming Phi-4-mini-Reasoning (91.2%) and surpassing other open models like Qwen-1.5B and Bespoke-Stratos-7B. On AIME24/25, it shows strong gains as well, with over 52% accuracy on AIME24.

This performance leap is attributed to the architecture’s capacity for long Chain-of-Thought (CoT) generation. With 64K context length support and optimized inference under the vLLM framework, the model can generate and reason across multi-thousand-token contexts without bottlenecks. In latency benchmarks with 2K-token prompts and 32K-token generations, Phi-4-mini-Flash-Reasoning delivers up to 10× higher throughput than its predecessor.

Efficient Long-Context Processing

Efficiency gains in Phi-4-mini-Flash-Reasoning aren’t just theoretical. Through the decoder-hybrid-decoder design, the model achieves competitive performance on long-context benchmarks like Phonebook and RULER. For instance, with a sliding window attention (SWA) size as small as 256, it maintains high retrieval accuracy, indicating that long-range token dependencies are well captured via SSMs and GMU-based memory sharing.

These architectural innovations lead to reduced compute and memory overhead. For example, during decoding, GMU layers replace attention operations that would otherwise cost O(N·d) time per token, cutting that down to O(d), where N is sequence length and d is hidden dimension. The result is real-time inference capability even in multi-turn or document-level scenarios.

Open Weights and Use Cases

Microsoft has open-sourced the model weights and configuration through Hugging Face, providing full access to the community. The model supports 64K context length, operates under standard Hugging Face and vLLM runtimes, and is optimized for fast token throughput on A100 GPUs.

Potential use cases for Phi-4-mini-Flash-Reasoning include:

Its combination of open access, reasoning ability, and efficient inference makes it a strong candidate for deployment in environments where compute resources are constrained but task complexity is high.

Conclusion

Phi-4-mini-Flash-Reasoning exemplifies how architectural innovation—particularly hybrid models leveraging SSMs and efficient gating—can bring transformative gains in reasoning performance without ballooning model size or cost. It marks a new direction in efficient long-context language modeling, paving the way for real-time, on-device reasoning agents and scalable open-source alternatives to commercial LLMs.


Check out the Paper, Codes, Model on Hugging Face and Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Microsoft Releases Phi-4-mini-Flash-Reasoning: Efficient Long-Context Reasoning with Compact Architecture appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Phi-4-mini-Flash-Reasoning 长文本推理 语言模型 SambaY架构 开源
相关文章