Microsoft Azure Blog Announcements 前天 18:16
Project Flash update: Advancing Azure Virtual Machine availability monitoring
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软Azure发布了Project Flash的最新更新,旨在提升虚拟机(VM)可用性监控的精准度和速度。该项目通过提供实时的平台级和用户级问题检测与诊断能力,帮助客户快速识别和解决VM可用性问题,确保业务连续性。更新亮点包括引入用户与平台维度区分VM可用性指标,以及将健康资源事件集成到Azure Monitor告警中,实现低延迟通知。这些新功能使用户能够更自信地在Azure上运行工作负载,并已获得多家行业领先公司的广泛应用。

🚀 **Project Flash提升Azure VM可用性监控精度与速度**:Project Flash是一项跨部门的微软倡议,致力于提供精确的遥测数据、实时警报和可扩展的监控,以统一的用户体验满足虚拟机可用性的多样化需求。它能够快速检测平台级问题,并为用户提供诊断和解决自身环境问题的洞察,从而支持高可用性并确保业务服务水平协议(SLA)的达成。

💡 **关键功能助力客户洞察与响应**:Flash能够提供VM重启、应用程序冻结、主机OS更新等事件的详细信息,包括发生原因和计划状态。它支持趋势分析和告警设置以加速调试,允许用户构建自定义仪表板,并提供自动化的根本原因分析(RCA),清晰说明受影响的VM、问题原因、持续时间及修复措施。此外,它还能提供关键事件的实时通知,如需要VM重新部署的节点降级、平台发起的服务修复或硬件问题触发的就地重启。

📊 **多元化解决方案满足不同监控需求**:Flash已发展成为一个强大、可扩展的监控框架,提供多种解决方案。Azure Resource Graph适用于大规模调查和历史查找;Event Grid系统主题(公测)用于触发时间敏感的缓解措施,如VM重启;Azure Monitor – Metrics(公测)可用于跟踪趋势和配置告警;Resource Health(普遍可用)则提供便捷的单资源健康检查和30天历史视图。

✨ **最新更新增强用户体验与告警能力**:本次更新引入了VM可用性指标的“用户 vs 平台”维度(公测),让用户能区分VM可用性是受Azure平台还是用户活动影响。同时,将健康资源事件发送到Azure Monitor告警(公测)通过Event Grid实现,用户可通过SMS、邮件等接收低延迟通知,结合了Event Grid的实时交付和Azure Monitor的告警能力。

🔮 **未来展望与最佳实践**:未来,Project Flash将扩展到包括交换机故障、加速网络故障和新型硬件故障预测等场景,并持续提升数据质量和一致性。对于全面的VM可用性监控,建议结合Flash Health事件(提供实时可用性中断洞察)和Scheduled Events(提供计划维护的预警通知),以实现有效的停机管理和主动式决策。

Previously, we shared an update on Project Flash as part of our Advancing Reliability blog series, reaffirming our commitment to helping Azure customers detect and diagnose virtual machine (VM) availability issues with speed and precision. This year, we’re excited to unveil the latest innovations that take VM availability monitoring to the next level—enabling customers to operate their workloads on Azure with even greater confidence. I’ve asked Yingqi (Halley) Ding, Technical Program Manager from the Azure Core Compute team, to walk us through the newest investments powering the next phase of Project Flash.

— Mark Russinovich, CTO, Deputy CISO, and Technical Fellow, Microsoft Azure.

Project Flash is a cross-division initiative at Microsoft. Its vision is to deliver precise telemetry, real-time alerts, and scalable monitoring—all within a unified, user-friendly experience designed to meet the diverse observability needs of virtual machine (VM) availability.

Flash addresses both platform-level and user-level challenges. It enables rapid detection of issues originating from the Azure platform, helping teams respond quickly to infrastructure-related disruptions. At the same time, it equips you with actionable insights to diagnose and resolve problems within your own environment. This dual capability supports high availability and helps ensure your business Service-Level Agreements are consistently met. It’s our mission to ensure you can:

During our team’s journey with Flash, it has garnered widespread adoption from some of the world’s leading companies spanning from e-commerce, gaming, finance, hedge funds, and many other sectors. Their extensive utilization of Flash underscores its effectiveness and value in meeting the diverse needs of high-profile organizations.

At BlackRock, VM reliability is critical to our operations. If a VM is running on degraded hardware, we want to be alerted quickly so we have the maximum opportunity to mitigate the issue before it impacts users. With Project Flash, we receive a resource health event integrated into our alerting processes the moment an underlying node in Azure infrastructure is marked unallocatable, typically due to health degradation. Our infrastructure team then schedules a migration of the affected resource to healthy hardware at an optimal time. This ability to predictively avoid abrupt VM failures has reduced our VM interruption rate and improved the overall reliability of our investment platform.

— Eli Hamburger, Head of Infrastructure Hosting, BlackRock.

Suite of solutions available today

The Flash initiative has evolved into a robust, scalable monitoring framework designed to meet the diverse needs of modern infrastructure—whether you’re managing a handful of VMs or operating at massive scale. Built with reliability at its core, Flash empowers you to monitor what matters most, using the tools and telemetry that align with your architecture and operational model.

Flash publishes VM availability states and resource health annotations for detailed failure attribution and downtime analysis. The guide below outlines your options so you can choose the right Flash monitoring solution for your scenario.

SolutionDescription
Azure Resource Graph (general availability)For investigations at scale, centralized resource repositories, and historical lookups, you can periodically consume resource availability telemetry across all workloads at once using Azure Resource Graph (ARG).
Event Grid system topic (public preview)To trigger time-sensitive and critical mitigations, such as redeploying or restarting VMs to prevent end-user impact, you can receive alerts within seconds of critical changes in resource availability via Event Handlers in Event Grid.
Azure Monitor – Metrics (public preview)To track trends, aggregate platform metrics (e.g., CPU, disk), and configure precise threshold-based alerts, you can consume an out-of-the-box VM availability metric via Azure Monitor.
Resource Health (general availability)To perform instantaneous and convenient per-resource health checks in the Portal UI, you can quickly view the RHC blade. You can also access a 30-day historical view of health checks for that resource to support fast and effective troubleshooting.
Figure 1: Flash endpoints

What’s new?

Public preview: User vs platform dimension introduced for VM availability metric

Many customers have emphasized the need for user-friendly monitoring solutions that provide real-time, scalable access to compute resource availability data. This information is essential for triggering timely mitigation actions in response to availability changes.

Designed to satisfy this critical need, the VM availability metric is well-suited for tracking trends, aggregating platform metrics (such as CPU and disk usage), and configuring precise threshold-based alerts. You can utilize this out-of-the-box VM availability metric in Azure Monitor.

Figure 2: VM availability metric

Now you can use the Context dimension to identify whether VM availability was influenced by Azure or user-orchestrated activity. This dimension indicates, during any disruption or when the metric drops to zero, whether the cause was platform-triggered or user-driven. It can assume values of Platform, Customer, or Unknown.

Figure 3: Context dimension

The new dimension is also supported in Azure Monitor alert rules as part of the filtering process.

Figure 4: Azure Monitor alert rule

Public preview: Enable sending health resources events to Azure Monitor alerts in Event Grid

Azure Event Grid is a highly scalable, fully managed Pub/Sub message distribution service that offers flexible message consumption patterns. Event Grid enables you to publish and subscribe to messages to support Internet of Things (IoT) solutions. Through HTTP, Event Grid enables you to build event-driven solutions, where a publisher service (such as Project Flash) announces its system state changes (events) to subscriber applications.

Figure 5: Event Grid system topics

With the integration of Azure Monitor alerts as a new event handler, you can now receive low-latency notifications—such as VM availability changes and detailed annotations—via SMS, email, push notifications, and more. This combines Event Grid’s near real-time delivery with Azure Monitor’s direct alerting capabilities.

Figure 6: Event Grid subscription

To get started, simply follow the step-by-step instructions and begin receiving real-time alerts with Flash’s new offering.

What’s next?

Looking ahead, we plan to broaden our focus to include scenarios such as inoperable top-of-rack switches, failures in accelerated networking, and new classes of hardware failure prediction. In addition, we aim to continue enhancing data quality and consistency across all Flash endpoints—enabling more accurate downtime attribution and deeper visibility into VM availability.

For comprehensive monitoring of VM availability—including scenarios such as routine maintenance, live migration, service healing, and degradation—we recommend leveraging both Flash Health events and Scheduled Events (SE).

For upcoming updates on the Flash initiative, we encourage you to follow the advancing reliability series!

The post Project Flash update: Advancing Azure Virtual Machine availability monitoring appeared first on Microsoft Azure Blog.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Project Flash Azure 虚拟机 可用性监控 云服务
相关文章