Salesforce AI Released GTA1: A Test-Time Scaled GUI Agent That Outperforms OpenAI’s CUA

Salesforce AI Research has introduced GTA1, a new graphical user interface (GUI) agent that redefines the state-of-the-art in agentic human-computer interaction. Designed to autonomously operate in real operating system environments such as Linux, GTA1 addresses two critical bottlenecks in GUI agent development: ambiguous task planning and inaccurate grounding of actions. With a 45.2% task success rate on the OSWorld benchmark, GTA1 surpasses OpenAI’s CUA (Computer-Using Agent), establishing a new record among open-source models.

Core Challenges in GUI Agents

GUI agents typically translate high-level user instructions into action sequences—clicks, keystrokes, or UI interactions—while observing UI updates after each action to plan subsequent steps. However, two issues persist:

Planning Ambiguity

Grounding Precision

GTA1 introduces novel mechanisms to resolve both.

Smarter Planning via Test-Time Scaling

Traditional planners commit to a single action proposal at each decision point, limiting robustness. GTA1’s test-time scaling introduces a simple yet effective solution: concurrently sample multiple candidate actions at each step, and employ a multimodal judge model—typically a large language model—to evaluate and select the most appropriate one.

This technique avoids premature commitment to suboptimal plans and enables the agent to better explore execution paths without requiring future rollout, which is infeasible in GUI environments due to irreversible actions. Importantly, this method can work with any planner and scales well with increasing task complexity and action space size.

Reinforcement Learning for Grounding Accuracy

For GUI grounding, most prior models rely on supervised fine-tuning to predict the center of target UI elements, which limits generalization. GTA1 adopts a reinforcement learning (RL) framework based on Group Relative Policy Optimization (GRPO). Rather than relying on intermediate reasoning (“thinking”) or predicting bounding boxes, the model learns directly from click-based rewards: it is rewarded only when the predicted coordinate falls within the correct UI element.

Through this reward structure, GTA1 achieves state-of-the-art accuracy without the complexity or overhead of chain-of-thought style supervision. Notably, an ablation study shows that removing auxiliary signals such as “thinking” or IoU-based box rewards actually improves grounding performance—particularly in static environments.

Performance Across Benchmarks

GTA1 sets a new standard in several evaluations:

OSWorld (Task Success Rate)

45.2%

ScreenSpot-Pro (Grounding Accuracy)

50.1%

ScreenSpot-V2 (Cross-platform Grounding)

94.8%

OSWorld-G (Linux GUI Grounding)

67.7%

These results validate the effectiveness of both the planning and grounding innovations introduced in GTA1.

Additional Design Highlights

Data Cleaning

Model Scaling

Judge Reusability

Conclusion

GTA1 demonstrates that robust and accurate GUI agents can be built using a modular two-stage framework enhanced by test-time planning diversity and precise RL-based grounding. By forgoing unnecessary complexity—such as chain-of-thought reasoning in static tasks—Salesforce AI has introduced a lean, effective agent architecture that pushes the frontier in open-ended digital interaction.

Check out the Paper, Codes, 7B Model, 32B Model and 72B Model. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Salesforce AI Released GTA1: A Test-Time Scaled GUI Agent That Outperforms OpenAI’s CUA appeared first on MarkTechPost.

Core Challenges in GUI Agents

Smarter Planning via Test-Time Scaling

Reinforcement Learning for Grounding Accuracy

Performance Across Benchmarks

Additional Design Highlights

Conclusion

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签