MarkTechPost@AI 前天 03:45
Salesforce AI Released GTA1: A Test-Time Scaled GUI Agent That Outperforms OpenAI’s CUA
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Salesforce AI Research 发布了名为 GTA1 的新型图形用户界面 (GUI) 智能体,它在智能体人机交互方面取得了显著进展。GTA1 旨在自主操作 Linux 等真实操作系统环境,解决了 GUI 智能体开发的两个关键瓶颈问题:任务规划的歧义性和动作定位的准确性。在 OSWorld 基准测试中,GTA1 的任务成功率为 45.2%,超越了 OpenAI 的 CUA,刷新了开源模型的纪录。GTA1 通过测试时缩放和强化学习等创新技术,提高了规划效率和定位精度,并展示了其在不同基准测试中的卓越性能。

💡 解决规划歧义性:GTA1 引入了测试时缩放技术,在每个步骤中同时采样多个候选动作,并使用多模态判断模型(通常是大型语言模型)来评估和选择最合适的动作。这避免了过早地承诺次优方案,使智能体能够更好地探索执行路径。

🎯 提高定位精度:GTA1 采用基于 Group Relative Policy Optimization (GRPO) 的强化学习 (RL) 框架,直接从点击奖励中学习,而不是依赖中间推理或预测边界框。当预测坐标落在正确的 UI 元素内时,模型才会获得奖励,从而在无需链式思考式监督的情况下实现最先进的精度。

🏆 卓越的性能表现:GTA1 在多个基准测试中均创下新纪录,包括 OSWorld(任务成功率 45.2%)、ScreenSpot-Pro(定位精度 50.1%)和 ScreenSpot-V2(跨平台定位 94.8%)。这些结果验证了 GTA1 在规划和定位方面的创新。

✨ 其他设计亮点:GTA1 还通过 OmniParser 过滤了来自 Aria-UI 和 OS-Atlas 等数据集的错误注释,以提高训练信号的保真度;模型可以从 7B 到 72B 参数的模型进行扩展,其中 GTA1-7B 在性能和计算之间提供了最佳的平衡;测试时缩放中使用的多模态判断模型可以与用于规划的 LLM 相同,从而减少了开销。

Salesforce AI Research has introduced GTA1, a new graphical user interface (GUI) agent that redefines the state-of-the-art in agentic human-computer interaction. Designed to autonomously operate in real operating system environments such as Linux, GTA1 addresses two critical bottlenecks in GUI agent development: ambiguous task planning and inaccurate grounding of actions. With a 45.2% task success rate on the OSWorld benchmark, GTA1 surpasses OpenAI’s CUA (Computer-Using Agent), establishing a new record among open-source models.

Core Challenges in GUI Agents

GUI agents typically translate high-level user instructions into action sequences—clicks, keystrokes, or UI interactions—while observing UI updates after each action to plan subsequent steps. However, two issues persist:

    Planning Ambiguity: Multiple valid action sequences can fulfill a task, leading to execution paths with varying efficiency and reliability.Grounding Precision: Translating abstract action proposals into accurate, coordinate-level GUI interactions is especially challenging in high-resolution, dynamic interfaces.

GTA1 introduces novel mechanisms to resolve both.

Smarter Planning via Test-Time Scaling

Traditional planners commit to a single action proposal at each decision point, limiting robustness. GTA1’s test-time scaling introduces a simple yet effective solution: concurrently sample multiple candidate actions at each step, and employ a multimodal judge model—typically a large language model—to evaluate and select the most appropriate one.

This technique avoids premature commitment to suboptimal plans and enables the agent to better explore execution paths without requiring future rollout, which is infeasible in GUI environments due to irreversible actions. Importantly, this method can work with any planner and scales well with increasing task complexity and action space size.

Reinforcement Learning for Grounding Accuracy

For GUI grounding, most prior models rely on supervised fine-tuning to predict the center of target UI elements, which limits generalization. GTA1 adopts a reinforcement learning (RL) framework based on Group Relative Policy Optimization (GRPO). Rather than relying on intermediate reasoning (“thinking”) or predicting bounding boxes, the model learns directly from click-based rewards: it is rewarded only when the predicted coordinate falls within the correct UI element.

Through this reward structure, GTA1 achieves state-of-the-art accuracy without the complexity or overhead of chain-of-thought style supervision. Notably, an ablation study shows that removing auxiliary signals such as “thinking” or IoU-based box rewards actually improves grounding performance—particularly in static environments.

Performance Across Benchmarks

GTA1 sets a new standard in several evaluations:

These results validate the effectiveness of both the planning and grounding innovations introduced in GTA1.

Additional Design Highlights

Conclusion

GTA1 demonstrates that robust and accurate GUI agents can be built using a modular two-stage framework enhanced by test-time planning diversity and precise RL-based grounding. By forgoing unnecessary complexity—such as chain-of-thought reasoning in static tasks—Salesforce AI has introduced a lean, effective agent architecture that pushes the frontier in open-ended digital interaction.


Check out the Paper, Codes, 7B Model32B Model and 72B Model. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Salesforce AI Released GTA1: A Test-Time Scaled GUI Agent That Outperforms OpenAI’s CUA appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GTA1 GUI智能体 Salesforce AI 人工智能
相关文章