MarkTechPost@AI 02月19日
Microsoft AI Releases OmniParser V2: An AI Tool that Turns Any LLM into a Computer Use Agent
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软发布OmniParser V2,旨在提升大型语言模型(LLM)对图形用户界面(GUI)的理解能力。该工具通过将UI截图转换为结构化的、机器可读的数据,使LLM能够更有效地理解和交互各种软件界面。OmniParser V2包含检测和字幕两个主要模块,利用YOLOv8模型识别交互元素,并用Florence-2模型生成描述性标签。经过优化训练,OmniParser V2在检测精度和处理速度上均有显著提升,结合GPT-4o使用时,在ScreenSpot Pro基准测试中准确率大幅提高。同时,微软还推出了OmniTool,简化了基于视觉的GUI代理的创建。

🖼️ OmniParser V2 的核心功能在于将 UI 截图转化为结构化的数据,这使得 LLM 能够理解 GUI 中的各种元素,例如图标、按钮和菜单,从而实现更有效的交互。

🚀 OmniParser V2 包含检测和字幕两个模块。检测模块使用 YOLOv8 模型识别屏幕截图中的交互元素,而字幕模块则使用 Florence-2 模型为这些元素生成描述性标签,提供上下文信息。

⏱️ 通过优化图像尺寸,OmniParser V2 显著降低了延迟,与前一版本相比,处理速度提高了 60%,在 A100 GPU 上平均处理时间为 0.6 秒,在 RTX 4090 GPU 上为 0.8 秒。

🎯 OmniParser V2 在 ScreenSpot Pro 基准测试中表现出色,与 GPT-4o 结合使用时,平均准确率达到 39.6%,远高于 GPT-4o 的基线得分 0.8%,证明了其在复杂 GUI 交互中的有效性。

In the realm of artificial intelligence, enabling Large Language Models (LLMs) to navigate and interact with graphical user interfaces (GUIs) has been a notable challenge. While LLMs are adept at processing textual data, they often encounter difficulties when interpreting visual elements like icons, buttons, and menus. This limitation restricts their effectiveness in tasks that require seamless interaction with software interfaces, which are predominantly visual.

To address this issue, Microsoft has introduced OmniParser V2, a tool designed to enhance the GUI comprehension capabilities of LLMs. OmniParser V2 converts UI screenshots into structured, machine-readable data, enabling LLMs to understand and interact with various software interfaces more effectively. This development aims to bridge the gap between textual and visual data processing, facilitating more comprehensive AI applications.

OmniParser V2 operates through two main components: detection and captioning. The detection module employs a fine-tuned version of the YOLOv8 model to identify interactive elements within a screenshot, such as buttons and icons. Simultaneously, the captioning module uses a fine-tuned Florence-2 base model to generate descriptive labels for these elements, providing context about their functions within the interface. This combined approach allows LLMs to construct a detailed understanding of the GUI, which is essential for accurate interaction and task execution.

A significant improvement in OmniParser V2 is the enhancement of its training datasets. The tool has been trained on a more extensive and refined set of icon captioning and grounding data, sourced from widely used web pages and applications. This enriched dataset enhances the model’s accuracy in detecting and describing smaller interactive elements, which are crucial for effective GUI interaction. Additionally, by optimizing the image size processed by the icon caption model, OmniParser V2 achieves a 60% reduction in latency compared to its previous version, with an average processing time of 0.6 seconds per frame on an A100 GPU and 0.8 seconds on a single RTX 4090 GPU.

The effectiveness of OmniParser V2 is demonstrated through its performance on the ScreenSpot Pro benchmark, an evaluation framework for GUI grounding capabilities. When combined with GPT-4o, OmniParser V2 achieved an average accuracy of 39.6%, a notable increase from GPT-4o’s baseline score of 0.8%. This improvement highlights the tool’s ability to enable LLMs to accurately interpret and interact with complex GUIs, even those with high-resolution displays and small target icons.

To support integration and experimentation, Microsoft has developed OmniTool, a dockerized Windows system that incorporates OmniParser V2 along with essential tools for agent development. OmniTool is compatible with various state-of-the-art LLMs, including OpenAI’s 4o/o1/o3-mini, DeepSeek’s R1, Qwen’s 2.5VL, and Anthropic’s Sonnet. This flexibility allows developers to utilize OmniParser V2 across different models and applications, simplifying the creation of vision-based GUI agents.

In summary, OmniParser V2 represents a meaningful advancement in integrating LLMs with graphical user interfaces. By converting UI screenshots into structured data, it enables LLMs to comprehend and interact with software interfaces more effectively. The technical enhancements in detection accuracy, latency reduction, and benchmark performance make OmniParser V2 a valuable tool for developers aiming to create intelligent agents capable of navigating and manipulating GUIs autonomously. As AI continues to evolve, tools like OmniParser V2 are essential in bridging the gap between textual and visual data processing, leading to more intuitive and capable AI systems.


Check out the Technical Details, Model on HF and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

The post Microsoft AI Releases OmniParser V2: An AI Tool that Turns Any LLM into a Computer Use Agent appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OmniParser V2 大型语言模型 图形用户界面 人工智能
相关文章