Microsoft Azure Blog Announcements 02月27日
Empowering innovation: The next generation of the Phi family
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软推出了Phi-4-multimodal和Phi-4-mini两款新型小语言模型(SLMs),旨在为开发者提供先进的AI能力。Phi-4-multimodal能够同时处理语音、视觉和文本,为创新应用创造了新的可能性。而Phi-4-mini则擅长文本任务,以紧凑的形式提供高精度和可扩展性。Phi-4-multimodal集成了语音、视觉和语言,通过LoRA混合专家技术在同一表示空间中同时处理这些模态,无需复杂的流程或单独的模型。Phi-4-mini是一个38亿参数的密集型解码器式Transformer,专为速度和效率而设计,在文本任务中表现出色,支持长达128,000个token的序列,具有强大的功能调用、指令跟随、长文本处理和推理能力。

🗣️Phi-4-multimodal是微软首个多模态语言模型,拥有56亿参数,集成了语音、视觉和文本处理能力,通过先进的跨模态学习技术,实现更自然和上下文感知的交互,优化了设备端执行和降低计算开销。

🖼️Phi-4-multimodal在多个基准测试中展示了卓越的语音和视觉能力。在语音相关任务中,它在自动语音识别(ASR)和语音翻译(ST)方面超越了WhisperV3和SeamlessM4T-v2-Large等专业模型,并在Huggingface OpenASR排行榜上名列前茅。在视觉方面,尽管模型尺寸较小,但在数学和科学推理、文档和图表理解、光学字符识别(OCR)和视觉科学推理等领域保持了竞争力。

⚙️Phi-4-mini是一个38亿参数的模型,专为速度和效率而设计,具有分组查询注意力、200,000个词汇量和共享输入输出嵌入。它在文本任务中表现出色,包括推理、数学、编码、指令跟随和函数调用,支持长达128,000个token的序列,能够无缝集成到结构化编程接口。

🏢Phi-4-mini和Phi-4-multimodal模型由于体积较小,可用于计算受限的推理环境。这些模型可以在设备上使用,尤其是在通过ONNX Runtime进一步优化以实现跨平台可用性时。较长的上下文窗口能够处理和推理大量文本内容,而较小的尺寸也使得微调或定制更加容易和经济实惠。

We are excited to announce Phi-4-multimodal and Phi-4-mini, the newest models in Microsoft’s Phi family of small language models (SLMs). These models are designed to empower developers with advanced AI capabilities. Phi-4-multimodal, with its ability to process speech, vision, and text simultaneously, opens new possibilities for creating innovative and context-aware applications. Phi-4-mini, on the other hand, excels in text-based tasks, providing high accuracy and scalability in a compact form. Now available in Azure AI Foundry, HuggingFace, and the NVIDIA API Catalog where developers can explore the full potential of Phi-4-multimodal on the NVIDIA API Catalog, enabling them to experiment and innovate with ease. 

What is Phi-4-multimodal?

Phi-4-multimodal marks a new milestone in Microsoft’s AI development as our first multimodal language model. At the core of innovation lies continuous improvement, and that starts with listening to our customers. In direct response to customer feedback, we’ve developed Phi-4-multimodal, a 5.6B parameter model, that seamlessly integrates speech, vision, and text processing into a single, unified architecture.

By leveraging advanced cross-modal learning techniques, this model enables more natural and context-aware interactions, allowing devices to understand and reason across multiple input modalities simultaneously. Whether interpreting spoken language, analyzing images, or processing textual information, it delivers highly efficient, low-latency inference—all while optimizing for on-device execution and reduced computational overhead.

Natively built for multimodal experiences

Phi-4-multimodal is a single model with mixture-of-LoRAs that includes speech, vision, and language, all processed simultaneously within the same representation space. The result is a single, unified model capable of handling text, audio, and visual inputs—no need for complex pipelines or separate models for different modalities.

The Phi-4-multimodal is built on a new architecture that enhances efficiency and scalability. It incorporates a larger vocabulary for improved processing, supports multilingual capabilities, and integrates language reasoning with multimodal inputs. All of this is achieved within a powerful, compact, highly efficient model that’s suited for deployment on devices and edge computing platforms.

This model represents a step forward for the Phi family of models, offering enhanced performance in a small package. Whether you’re looking for advanced AI capabilities on mobile devices or edge systems, Phi-4-multimodal provides a high-capability option that’s both efficient and versatile.

Unlocking new capabilities

With its increased range of capabilities and flexibility, Phi-4-multimodal opens exciting new possibilities for app developers, businesses, and industries looking to harness the power of AI in innovative ways. The future of multimodal AI is here, and it’s ready to transform your applications.

Phi-4-multimodal is capable of processing both visual and audio together. The following table shows the model quality when the input query for vision content is synthetic speech on chart/table understanding and document reasoning tasks. Compared to other existing state-of-the-art omni models that can enable audio and visual signals as input, Phi-4-multimodal achieves much stronger performance on multiple benchmarks.

Phi-4-multimodal has demonstrated remarkable capabilities in speech-related tasks, emerging as a leading open model in multiple areas. It outperforms specialized models like WhisperV3 and SeamlessM4T-v2-Large in both automatic speech recognition (ASR) and speech translation (ST). The model has claimed the top position on the Huggingface OpenASR leaderboard with an impressive word error rate of 6.14%, surpassing the previous best performance of 6.5% as of February 2025. Additionally, it is among a few open models to successfully implement speech summarization and achieve performance levels comparable to GPT-4o model. The model has a gap with close models, such as Gemini-2.0-Flash and GPT-4o-realtime-preview, on speech question answering (QA) tasks as the smaller model size results in less capacity to retain factual QA knowledge. Work is being undertaken to improve this capability in the next iterations.

Phi-4-multimodal with only 5.6B parameters demonstrates remarkable vision capabilities across various benchmarks, most notably achieving strong performance on mathematical and science reasoning. Despite its smaller size, the model maintains competitive performance on general multimodal capabilities, such as document and chart understanding, Optical Character Recognition (OCR), and visual science reasoning, matching or exceeding close models like Gemini-2-Flash-lite-preview/Claude-3.5-Sonnet.

What is Phi-4-mini?

Phi-4-mini is a 3.8B parameter model and a dense, decoder-only transformer featuring grouped-query attention, 200,000 vocabulary, and shared input-output embeddings, designed for speed and efficiency. Despite its compact size, it continues outperforming larger models in text-based tasks, including reasoning, math, coding, instruction-following, and function-calling. Supporting sequences up to 128,000 tokens, it delivers high accuracy and scalability, making it a powerful solution for advanced AI applications.

To understand the model quality, we compare Phi-4-mini with a set of models over a variety of benchmarks as shown in Figure 4.

Function calling, instruction following, long context, and reasoning are powerful capabilities that enable small language models like Phi-4-mini to access external knowledge and functionality despite their limited capacity. Through a standardized protocol, function calling allows the model to seamlessly integrate with structured programming interfaces. When a user makes a request, Phi-4-Mini can reason through the query, identify and call relevant functions with appropriate parameters, receive the function outputs, and incorporate those results into its responses. This creates an extensible agentic-based system where the model’s capabilities can be enhanced by connecting it to external tools, application program interfaces (APIs), and data sources through well-defined function interfaces. The following example simulates a smart home control agent with Phi-4-mini.

At Headwaters, we are leveraging fine-tuned SLM like Phi-4-mini on the edge to enhance operational efficiency and provide innovative solutions. Edge AI demonstrates outstanding performance even in environments with unstable network connections or in fields where confidentiality is paramount. This makes it highly promising for driving innovation across various industries, including anomaly detection in manufacturing, rapid diagnostic support in healthcare, and enhancing customer experiences in retail. We are looking forward to delivering new solutions in the AI agent era with Phi-4 mini.
 
—Masaya Nishimaki, Company Director, Headwaters Co., Ltd. 

Customization and cross-platform

Thanks to their smaller sizes, Phi-4-mini and Phi-4-multimodal models can be used in compute-constrained inference environments. These models can be used on-device, especially when further optimized with ONNX Runtime for cross-platform availability. Their lower computational needs make them a lower cost option with much better latency. The longer context window enables taking in and reasoning over large text content—documents, web pages, code, and more. Phi-4-mini and multimodal demonstrates strong reasoning and logic capabilities, making it a good candidate for analytical tasks. Their small size also makes fine-tuning or customization easier and more affordable. The table below shows examples of finetuning scenarios with Phi-4-multimodal.

TasksBase ModelFinetuned ModelCompute
Speech translation from English to Indonesian17.435.53 hours, 16 A100
Medical visual question answering47.656.75 hours, 8 A100

For more information about customization or to learn more about the models, take a look at Phi Cookbook on GitHub. 

How can these models be used in action?

These models are designed to handle complex tasks efficiently, making them ideal for edge case scenarios and compute-constrained environments. Given the new capabilities Phi-4-multimodal and Phi-4-mini bring, the uses of Phi are only expanding. Phi models are being embedded into AI ecosystems and used to explore various use cases across industries.

Language models are powerful reasoning engines, and integrating small language models like Phi into Windows allows us to maintain efficient compute capabilities and opens the door to a future of continuous intelligence baked in across all your apps and experiences. Copilot+ PCs will build upon Phi-4-multimodal’s capabilities, delivering the power of Microsoft’s advanced SLMs without the energy drain. This integration will enhance productivity, creativity, and education-focused experiences, becoming a standard part of our developer platform.

—Vivek Pradeep, Vice President Distinguished Engineer of Windows Applied Sciences.
  1. Embedded directly to your smart device: Phone manufacturers integrating Phi-4-multimodal directly into a smartphone could enable smartphones to process and understand voice commands, recognize images, and interpret text seamlessly. Users could benefit from advanced features like real-time language translation, enhanced photo and video analysis, and intelligent personal assistants that understand and respond to complex queries. This would elevate the user experience by providing powerful AI capabilities directly on the device, ensuring low latency and high efficiency.
  2. On the road: Imagine an automotive company integrating Phi-4-multimodal into their in-car assistant systems. The model could enable vehicles to understand and respond to voice commands, recognize driver gestures, and analyze visual inputs from cameras. For instance, it could enhance driver safety by detecting drowsiness through facial recognition and providing real-time alerts. Additionally, it could offer seamless navigation assistance, interpret road signs, and provide contextual information, creating a more intuitive and safer driving experience while connected to the cloud and offline when connectivity isn’t available.
  3. Multilingual financial services: Imagine a financial services company integrating Phi-4-mini to automate complex financial calculations, generate detailed reports, and translate financial documents into multiple languages. For instance, the model can assist analysts by performing intricate mathematical computations required for risk assessments, portfolio management, and financial forecasting. Additionally, it can translate financial statements, regulatory documents, and client communications into various languages and could improve client relations globally.

Microsoft’s commitment to security and safety

Azure AI Foundry provides users with a robust set of capabilities to help organizations measure, mitigate, and manage AI risks across the AI development lifecycle for traditional machine learning and generative AI applications. Azure AI evaluations in AI Foundry enable developers to iteratively assess the quality and safety of models and applications using built-in and custom metrics to inform mitigations.

Both models underwent security and safety testing by our internal and external security experts using strategies crafted by Microsoft AI Red Team (AIRT). These methods, developed over previous Phi models, incorporate global perspectives and native speakers of all supported languages. They span areas such as cybersecurity, national security, fairness, and violence, addressing current trends through multilingual probing. Using AIRT’s open-source Python Risk Identification Toolkit (PyRIT) and manual probing, red teamers conducted single-turn and multi-turn attacks. Operating independently from the development teams, AIRT continuously shared insights with the model team. This approach assessed the new AI security and safety landscape introduced by our latest Phi models, ensuring the delivery of high-quality capabilities.

Take a look at the model cards for Phi-4-multimodal and Phi-4-mini, and the technical paper to see an outline of recommended uses and limitations for these models.

Learn more about Phi-4

We invite you to come explore the possibilities with Phi-4-multimodal and Phi-4-mini in Azure AI Foundry, Hugging Face, and NVIDIA API Catalog with a full multimodal experience. We can’t wait to hear your feedback and see the incredible things you will accomplish with our new models. 

The post Empowering innovation: The next generation of the Phi family appeared first on Microsoft Azure Blog.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Phi-4-multimodal Phi-4-mini 多模态AI 小语言模型
相关文章