MarkTechPost@AI 2024年12月01日
ShowUI: A Vision-Language-Action Model for GUI Visual Agents that Addresses Key Challenges in UI Visual and Action Modeling
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

ShowUI 是一种新型的视觉-语言-动作模型,旨在解决图形用户界面(GUI)自动化中的关键挑战。它通过三种创新技术,即 UI 引导的视觉标记选择、交错的视觉-语言-动作流和 GUI 指令微调,提高了 GUI 视觉代理的效率和有效性。ShowUI 能够处理高分辨率 UI 截图,有效地管理视觉-动作历史和多轮查询-动作序列,并通过精心策划的数据集来增强模型的性能。该模型在移动平台和桌面环境中的实验评估表明,ShowUI 在导航任务中表现出色,并具有良好的零样本导航能力,为未来 GUI 自动化提供了新的方向。

🤔 **UI 引导的视觉标记选择**:通过将 UI 截图转换为连接图并识别冗余关系,有效减少了计算成本,提高了处理高分辨率 UI 截图的效率。例如,在 Google 搜索页面等稀疏区域,标记序列从 1296 减少到 291。

🔄 **交错的视觉-语言-动作流**:通过标准化的 JSON 格式结构化动作,能够管理不同设备平台的 action 变化和新的交互场景,并提供灵活的框架来预测不同平台上的 action。

📚 **GUI 指令微调**:通过精心策划的数据集,解决数据集收集和表示的关键挑战。例如,从 Web 和桌面环境收集了 22K 截图,并使用 GPT-4o 转换有限的原始注释,扩展数据集的复杂性和实用性。

🚀 **ShowUI 在移动平台上的表现**:在 AITW 等移动平台上进行的实验表明,结合视觉历史可以显著提高导航准确性,ShowUI 实现了 1.7% 的准确性提升。

💡 **ShowUI 的零样本导航能力**:从 GUIAct 中学习到的零样本导航能力也展现出良好的潜力。

Large Language Models (LLMs) have demonstrated remarkable potential in performing complex tasks by building intelligent agents. As individuals increasingly engage with the digital world, these models serve as virtual embodied interfaces for a wide range of daily activities. The emerging field of GUI automation aims to develop intelligent agents that can significantly streamline human workflows based on user intentions. This technological advancement represents a pivotal moment in human-computer interaction, where sophisticated language models can interpret and execute complex digital tasks with increasing precision and efficiency.

Early attempts at GUI automation focused on language-based agents that relied on closed-source, API-based Large Language Models like GPT-4. These initial approaches primarily utilized text-rich metadata such as HTML inputs and accessibility trees to perform navigation and related tasks. However, this text-only methodology reveals significant limitations in real-world applications, where users predominantly interact with interfaces visually through screenshots, often without access to underlying structural information. The fundamental challenge lies in bridging the gap between computational perception and human-like interaction with graphical user interfaces, necessitating a more nuanced approach to digital navigation and task execution.

Training multi-modal models for GUI visual agents encounter significant challenges across multiple dimensions of computational design. Visual modeling presents substantial obstacles, particularly with high-resolution UI screenshots that generate lengthy token sequences and create long-context processing complications. Most existing models struggle to optimize such high-resolution data efficiently, resulting in considerable computational inefficiencies. Also, the complexity of managing interleaved vision-language-action interactions adds another layer of complexity, with actions varying dramatically across different device platforms and requiring sophisticated modeling techniques to accurately interpret and execute navigation processes effectively.

Researchers from Show Lab, the National University of Singapore and Microsoft introduce ShowUI, a unique vision-language-action model designed to address key challenges in GUI automation. The model incorporates three innovative techniques: UI-Guided Visual Token Selection, which reduces computational costs by transforming screenshots into connected graphs and intelligently identifying redundant relationships; Interleaved Vision-Language-Action Streaming, enabling flexible management of visual-action histories and multi-turn query-action sequences; and a robust approach to creating small-scale, high-quality GUI instruction-following datasets through meticulous data curation and strategic resampling to mitigate data type imbalances. These advancements aim to significantly enhance the efficiency and effectiveness of GUI visual agents.

UI-guided visual Token Selection strategy addresses computational challenges inherent in processing high-resolution screenshots. By recognizing the fundamental differences between natural images and user interfaces, the method develops an innovative approach to token reduction. Utilizing the RGB color space, researchers construct a UI connected graph that identifies and groups visually redundant patches while preserving functionally critical elements like icons and text. The technique adaptively manages visual token complexity, demonstrating remarkable efficiency by reducing token sequences from 1296 to as few as 291 in sparse areas like Google search pages, while maintaining more granular representation in text-rich environments like Overleaf screenshots.

Interleaved Vision-Language-Action (VLA) Streaming approach addresses complex GUI navigation challenges. By structuring actions in a standardized JSON format, the model can manage diverse device-specific action variations and novel interaction scenarios. The method introduces a flexible framework that enables action prediction across different platforms by providing a comprehensive ‘README’ system prompt that guides the model’s understanding of action spaces. This approach allows for dynamic action execution through a function-calling mechanism, effectively standardizing interactions across web and mobile interfaces while maintaining the ability to handle unique device-specific requirements.

GUI Instructional Tuning approach carefully curates training data from diverse sources, addressing critical challenges in dataset collection and representation. By analyzing various GUI datasets, the team developed a nuanced methodology for data selection and augmentation. For web-based interfaces, they collected 22K screenshots, focusing exclusively on visually rich elements like buttons and checkboxes, strategically filtering out static text. For desktop environments, the researchers employed innovative reverse engineering techniques, using GPT-4o to transform limited original annotations into rich, multi-dimensional queries spanning appearance, spatial relationships, and user intentions, effectively expanding the dataset’s complexity and utility.

The experimental evaluation of ShowUI across diverse navigation tasks reveals critical insights into the model’s performance and potential improvements. Experiments conducted on mobile platforms like AITW demonstrated that incorporating visual history significantly enhances navigation accuracy, with ShowUI achieving a 1.7% accuracy gain. The zero-shot navigation capabilities learned from GUIAct showed promising transferability, outperforming methods relying on closed-source APIs or HTML information. Notably, the performance varied across different domains, with web navigation tasks presenting unique challenges that highlighted the importance of visual perception and domain diversity in training data.

ShowUI represents a significant advancement in vision-language-action models for GUI interactions. The researchers developed innovative solutions to address critical challenges in UI visual modeling and action processing. By introducing UI-Guided Visual Token Selection, the model efficiently processes high-resolution screenshots, dramatically reducing computational overhead. The Interleaved Vision-Language-Action Streaming framework enables sophisticated management of complex cross-modal interactions, allowing for more nuanced and context-aware navigation. Through meticulous data curation and a high-quality instruction-following dataset, ShowUI demonstrates remarkable performance, particularly impressive given its lightweight model size. These achievements signal a promising path toward developing GUI visual agents that can interact with digital interfaces in ways more closely resembling human perception and decision-making.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniques’ Read the Full Report (Promoted)

The post ShowUI: A Vision-Language-Action Model for GUI Visual Agents that Addresses Key Challenges in UI Visual and Action Modeling appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ShowUI GUI自动化 视觉-语言-动作模型 人工智能 用户界面
相关文章