MarkTechPost@AI 2024年10月12日
UGround: A Universal GUI Visual Grounding Model Developed with Large-Scale Web-based Synthetic Data
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

UGround 是一种新的 GUI 视觉接地模型,它完全消除了对文本输入的依赖,仅使用视觉感知来理解和交互 GUI。这使得 GUI 代理能够直接在 GUI 上执行像素级操作,而无需依赖任何文本数据,如 HTML。UGround 通过利用大规模网页合成数据和 LLaVA 架构的轻微改进来实现,它在各种基准测试中显著优于现有模型,并展现出更强的适应性和泛化能力,为未来自动 GUI 导航和交互的发展奠定了基础。

👨‍💻 UGround 是一款新的 GUI 视觉接地模型,它完全消除了对文本输入的依赖,仅使用视觉感知来理解和交互 GUI。这使得 GUI 代理能够直接在 GUI 上执行像素级操作,而无需依赖任何文本数据,如 HTML。

📊 UGround 通过利用大规模网页合成数据和 LLaVA 架构的轻微改进来实现,它在各种基准测试中显著优于现有模型,例如在 ScreenSpot 基准测试中,UGround 在移动环境中达到了 82.8% 的准确率,在桌面环境中达到了 63.6% 的准确率,在 Web 环境中达到了 80.4% 的准确率。

🚀 UGround 的视觉感知能力能够使其在各种 GUI 任务中表现出色,例如在 AndroidControl 和 OmniACT 基准测试中,UGround 分别达到了 52.8% 的步骤准确率和 32.8 的动作得分,这表明它在处理各种 GUI 任务方面具有高效性和鲁棒性。

💡 UGround 的出现为未来自动 GUI 导航和交互的发展奠定了基础,它能够更加准确地模拟人类与 GUI 的交互方式,并为各种应用场景提供更加高效和可靠的解决方案。

Graphical User Interface (GUI) agents are crucial in automating interactions within digital environments, similar to how humans operate software using keyboards, mice, or touchscreens. GUI agents can simplify complex processes such as software testing, web automation, and digital assistance by autonomously navigating and manipulating GUI elements. These agents are designed to perceive their surroundings through visual inputs, enabling them to interpret the structure and content of digital interfaces. With advancements in artificial intelligence, researchers aim to make GUI agents more efficient by reducing their dependency on traditional input methods, making them more human-like.

The fundamental problem with existing GUI agents lies in their reliance on text-based representations such as HTML or accessibility trees, which often introduce noise and unnecessary complexity. While effective, these approaches are limited by their dependency on the completeness and accuracy of textual data. For instance, accessibility trees may lack essential elements or annotations, and HTML code may contain irrelevant or redundant information. As a result, these agents need help with latency and computational overhead when navigating through different types of GUIs across platforms like mobile applications, desktop software, and web interfaces.

Some multimodal large language models (MLLMs) have been proposed that combine visual and text-based representations to interpret and interact with GUIs. Despite recent improvements, these models still require significant text-based information, which constrains their generalization ability and hinders performance. Several existing models, such as SeeClick and CogAgent, have shown moderate success. Still, they need to be more robust for practical applications in diverse environments due to their dependence on predefined text-based inputs.

Researchers from Ohio State University and Orby AI introduced a new model called UGround, which eliminates the need for text-based inputs entirely. UGround uses a visual-only grounding approach that operates directly on the visual renderings of the GUI. By solely using visual perception, this model can more accurately replicate human interaction with GUIs, enabling agents to perform pixel-level operations directly on the GUI without relying on any text-based data such as HTML. This advancement significantly enhances the efficiency and robustness of the GUI agents, making them more adaptable and capable of being used in real-world applications.

The research team developed UGround by leveraging a simple yet effective methodology, combining web-based synthetic data and slightly adapting the LLaVA architecture. They constructed the largest GUI visual grounding dataset, comprising 10 million GUI elements over 1.3 million screenshots, spanning different GUI layouts and types. The researchers incorporated a data synthesis strategy that allows the model to learn from varied visual representations, making UGround applicable to different platforms, including web, desktop, and mobile environments. This vast dataset helps the model accurately map diverse referring expressions of GUI elements to their coordinates on the screen, facilitating precise visual grounding in real-world applications.

Empirical results showed that UGround significantly outperforms existing models in various benchmark tests. It achieved up to 20% higher accuracy in visual grounding tasks across six benchmarks, covering three categories: grounding, offline agent evaluation, and online agent evaluation. For example, on the ScreenSpot benchmark, which assesses GUI visual grounding across different platforms, UGround achieved an accuracy of 82.8% in mobile environments, 63.6% in desktop environments, and 80.4% in web environments. These results indicate that UGround’s visual-only perception capability allows it to perform comparably or better than models using both visual and text-based inputs.

In addition, GUI agents equipped with UGround demonstrated superior performance compared to state-of-the-art agents that rely on multimodal inputs. For instance, in the agent setting of ScreenSpot, UGround achieved an average performance increase of 29% over the previous models. The model also showed impressive results in AndroidControl and OmniACT benchmarks, which test the agent’s ability to handle mobile and desktop environments, respectively. In AndroidControl, UGround achieved a step accuracy of 52.8% in high-level tasks, surpassing previous models by a considerable margin. Similarly, on the OmniACT benchmark, UGround attained an action score of 32.8, highlighting its efficiency and robustness in diverse GUI tasks.

In conclusion, UGround addresses the primary limitations of existing GUI agents by adopting a human-like visual perception and grounding methodology. Its ability to generalize across multiple platforms and perform pixel-level operations without needing text-based inputs marks a significant advancement in human-computer interaction. This model improves the efficiency and accuracy of GUI agents and sets the foundation for future developments in autonomous GUI navigation and interaction.


Check out the Paper, Code, and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)

The post UGround: A Universal GUI Visual Grounding Model Developed with Large-Scale Web-based Synthetic Data appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

UGround GUI视觉接地 网页合成数据 LLaVA 人工智能
相关文章