MarkTechPost@AI 2024年10月25日
Microsoft AI Releases OmniParser Model on HuggingFace: A Compact Screen Parsing Module that can Convert UI Screenshots into Structured Elements
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软推出OmniParser,这是一个纯视觉的工具,旨在解决当前屏幕解析技术的差距,能在不依赖额外上下文数据的情况下更深入地理解GUI。它可在多种平台工作,提升了智能GUI自动化水平,在多个基准测试中表现出色。

📱OmniParser是一个纯视觉工具,旨在弥补当前屏幕解析技术的不足。它无需依赖额外数据,如HTML标签或视图层次结构,就能实现对GUI的理解。

🎯该模型结合了多个专门组件,包括可交互区域检测模型、图标描述模型和OCR模块。这些组件共同作用,输出类似于文档对象模型的结构化表示。

💪OmniParser在多个基准测试中表现优异,如在ScreenSpot数据集上,其准确率提高了高达73%,显著提升了对用户操作的预测准确性。

🚀OmniParser的推出是重要进步,增强了跨平台可用性,为创建更通用的AI代理铺平道路,也为开发者提供了强大工具,拓展了多模态AI的应用可能。

Graphical User Interfaces (GUIs) are ubiquitous, whether on desktop computers, mobile devices, or embedded systems, providing an intuitive bridge between users and digital functions. However, automated interaction with these GUIs presents a significant challenge. This gap becomes particularly evident in building intelligent agents that can comprehend and execute tasks based on visual information alone. Traditional methods rely on parsing underlying HTML or view hierarchies, which limits their applicability to web-based environments or those with accessible metadata. Moreover, existing Vision-Language Models (VLMs) like GPT-4V struggle to accurately interpret complex GUI elements, often resulting in inaccurate action grounding.

To overcome these hurdles, Microsoft introduces OmniParser, a pure vision-based tool aimed at bridging the gaps in current screen parsing techniques, allowing for more sophisticated GUI understanding without relying on additional contextual data. This model, available here on Hugging Face, represents an exciting development in intelligent GUI automation. Built to improve the accuracy of parsing user interfaces, OmniParser is designed to work across platforms—desktop, mobile, and web—without requiring explicit underlying data such as HTML tags or view hierarchies. With OmniParser, Microsoft has made significant strides in enabling automated agents to identify actionable elements like buttons and icons purely based on screenshots, broadening the possibilities for developers working with multimodal AI systems.

OmniParser combines several specialized components to achieve robust GUI parsing. Its architecture integrates a fine-tuned interactable region detection model, an icon description model, and an OCR module. The region detection model is responsible for identifying actionable elements on the UI, such as buttons and icons, while the icon description model captures the functional semantics of these elements. Additionally, the OCR module extracts any text elements from the screen. Together, these models output a structured representation akin to a Document Object Model (DOM), but directly from visual input. One key advantage is the overlaying of bounding boxes and functional labels on the screen, which effectively guides the language model in making more accurate predictions about user actions. This design alleviates the need for additional data sources, which is particularly beneficial in environments without accessible metadata, thus extending the range of applications.

OmniParser is a vital advancement for several reasons. It addresses the limitations of prior multimodal systems by offering an adaptable, vision-only solution that can parse any type of UI, regardless of the underlying architecture. This approach results in enhanced cross-platform usability, making it valuable for both desktop and mobile applications. Furthermore, OmniParser’s performance benchmarks speak of its strength and effectiveness. In the ScreenSpot, Mind2Web, and AITW benchmarks, OmniParser demonstrated significant improvements over baseline GPT-4V setups. For example, on the ScreenSpot dataset, OmniParser achieved an accuracy improvement of up to 73%, surpassing models that rely on underlying HTML parsing. Notably, incorporating local semantics of UI elements led to an impressive boost in predictive accuracy—GPT-4V’s correct labeling of icons improved from 70.5% to 93.8% when using OmniParser’s outputs. Such improvements highlight how better parsing can lead to more accurate action grounding, addressing a fundamental shortcoming in current GUI interaction models.

Microsoft’s OmniParser is a significant step forward in the development of intelligent agents that interact with GUIs. By focusing purely on vision-based parsing, OmniParser eliminates the need for additional metadata, making it a versatile tool for any digital environment. This enhancement not only broadens the usability of models like GPT-4V but also paves the way for the creation of more general-purpose AI agents that can reliably navigate across a multitude of digital interfaces. By releasing OmniParser on Hugging Face, Microsoft has democratized access to cutting-edge technology, providing developers with a powerful tool to create smarter and more efficient UI-driven agents. This move opens up new possibilities for applications in accessibility, automation, and intelligent user assistance, ensuring that the promise of multimodal AI reaches new heights.


Check out the Paper, Details, and Try the model here. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)

The post Microsoft AI Releases OmniParser Model on HuggingFace: A Compact Screen Parsing Module that can Convert UI Screenshots into Structured Elements appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OmniParser 微软 GUI解析 智能自动化
相关文章