MarkTechPost@AI 01月29日
ByteDance Introduces UI-TARS: A Native GUI Agent Model that Integrates Perception, Action, Reasoning, and Memory into a Scalable and Adaptive Framework
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

UI-TARS是由字节跳动和清华大学的研究人员提出的一个原生GUI智能体框架,旨在提升GUI自动化。它通过整合增强的感知、统一的动作建模、系统2推理和迭代训练,减少人工干预,提高泛化能力。该框架利用大型GUI截图数据集实现对界面元素的精确理解和标注,标准化平台交互,并利用广泛的动作轨迹增强多步骤执行。此外,它还融入了系统2推理进行深思熟虑的决策,并通过在线交互不断改进自身能力。实验结果表明,UI-TARS在感知、定位和智能体任务方面均优于现有模型,为GUI自动化领域的未来研究奠定了坚实基础。

🖼️ 增强感知:UI-TARS通过使用精心策划的数据集,例如元素描述和密集字幕,确保GUI元素被准确识别。

🔗 统一动作建模:框架将元素描述与空间坐标关联,实现精确的定位,从而标准化平台交互,提升动作执行的准确性。

🤔 系统2推理:UI-TARS整合了系统2推理,纳入多种逻辑模式和明确的思考过程,引导深思熟虑的行动决策,增强了复杂场景下的处理能力。

🔄 迭代训练:通过动态数据收集和交互改进进行迭代训练,识别错误,并通过反射调整进行适应,实现鲁棒且可扩展的学习,减少人工干预。

GUI agents seek to perform real tasks in digital environments by understanding and interacting with graphical interfaces such as buttons and text boxes. The biggest open challenges lie in enabling agents to process complex, evolving interfaces, plan effective actions, and execute precision tasks that include finding clickable areas or filling text boxes. These agents also need memory systems to recall past actions and adapt to new scenarios. One significant problem facing modern, unified end-to-end models is the absence of integrated perception, reasoning, and action within seamless workflows with high-quality data encompassing this breadth of vision. Lacking such data, these systems can hardly adapt to a diversity of dynamic environments and scale.

Current approaches to GUI agents are mostly rule-based and heavily dependent on predefined rules, frameworks, and human involvement, which are not flexible or scalable. Rule-based agents, like Robotic Process Automation (RPA), operate in structured environments using human-defined heuristics and require direct access to systems, making them unsuitable for dynamic or restricted interfaces. Framework-based agents use foundation models like GPT-4 for multi-step reasoning but still depend on manual workflows, prompts, and external scripts. These methods are fragile, need constant updates for evolving tasks, and lack seamless integration of learning from real-world interactions. The models of native agents try to bring together perception, reasoning, memory, and action under one roof by reducing human engineering through end-to-end learning. Still, these models rely on curated data and training guidance, thus limiting their adaptability. The approaches do not allow the agents to learn autonomously, adapt efficiently, or handle unpredictable scenarios without manual intervention.

To address the challenges faced in GUI agent development, the researchers from  ByteDance Seed and Tsinghua University, proposed the UI-TARS framework to boost native GUI agent models. It integrates enhanced perception, unified action modeling, advanced reasoning, and iterative training, which helps reduce human intervention with improved generalization. It enables detailed understanding with precise captioning of interface elements using a large dataset of GUI screenshots. This introduces a unified action space to standardize platform interactions and utilizes extensive action traces to enhance multi-step execution. The framework also incorporates System-2 reasoning for deliberate decision-making and iteratively refines its capabilities through online interaction traces.

Researchers designed the framework with several key principles. Enhanced perception was used to ensure that GUI elements are recognized accurately by using curated datasets for tasks such as element description and dense captioning. Unified action modeling links the element descriptions with spatial coordinates to achieve precise grounding. System-2 reasoning was integrated to incorporate diverse logical patterns and explicit thought processes, guiding deliberate actions. It utilized iterative training for dynamic data gathering and interaction refinement, identification of error, and adaptation through reflection tuning for robust and scalable learning with less human involvement.

Researchers tested the UI-TARS trained on a corpus of about 50B tokens along various axes, including perception, grounding, and agent capabilities. The model was developed in three variants: UI-TARS-2B, UI-TARS-7B, and UI-TARS-72B, along with extensive experiments validating their advantages. Compared to baselines like GPT-4o and Claude-3.5, UI-TARS performed better in benchmarks measuring perception, such as VisualWebBench and WebSRC. UI-TARS outperformed models like UGround-V1-7B in grounding across multiple datasets, demonstrating robust capabilities in high-complexity scenarios. Regarding agent tasks, UI-TARS excelled in Multimodal Mind2Web and Android Control and environments like OSWorld and AndroidWorld. The results highlighted the importance of system-1 and system-2 reasoning, with system-2 reasoning proving beneficial in diverse, real-world scenarios, although it required multiple candidate outputs for optimal performance. Scaling the model size improved reasoning and decision-making, particularly in online tasks.

In conclusion, the proposed method, UI-TARS, advances GUI automation by integrating enhanced perception, unified action modeling, system-2 reasoning, and iterative training. It achieves state-of-the-art performance, surpassing previous systems like Claude and GPT-4o, and effectively handles complex GUI tasks with minimal human oversight. This work establishes a strong baseline for future research, particularly in active and lifelong learning areas, where agents can autonomously improve through continuous real-world interactions, paving the way for further advancements in GUI automation.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)

The post ByteDance Introduces UI-TARS: A Native GUI Agent Model that Integrates Perception, Action, Reasoning, and Memory into a Scalable and Adaptive Framework appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

UI-TARS GUI智能体 自动化 系统2推理 迭代训练
相关文章