MarkTechPost@AI 01月04日
OS-Genesis: A Novel GUI Data Synthesis Pipeline that Reverses the Conventional Trajectory Collection Process
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

OS-Genesis提出了一种创新的交互驱动的逆向任务合成方法,用于解决GUI代理训练中高质量轨迹数据收集的难题。传统方法依赖于耗时的人工标注或难以反映真实世界多样性的合成数据。OS-Genesis通过自主探索GUI元素,将交互行为转化为低级指令,并将其整合为高级任务,利用轨迹奖励模型(TRM)评估合成轨迹的连贯性、逻辑性和完整性,从而生成高质量和多样化的训练数据。该方法在AndroidWorld和WebArena等基准测试中表现出色,显著提升了GUI代理的自主学习和适应能力。

🖱️OS-Genesis通过交互驱动的探索模式,记录GUI元素之间的状态转换,为任务合成收集基础数据,克服了传统方法对人工标注或预定义任务的依赖。

📝利用GPT-4o等模型,将GUI交互行为转化为低级指令,并结合用户意图构建高级目标,从而实现语义层面的深度理解,有效提升了任务的完整性。

🏆通过轨迹奖励模型(TRM),对合成轨迹的连贯性、逻辑性和完整性进行评估,确保训练数据的质量和多样性,为GUI代理的训练提供强大的数据支持。

🚀OS-Genesis在AndroidWorld和WebArena等基准测试中,显著提高了GUI代理的任务规划和执行能力,验证了其在复杂和动态环境中的有效性,展现了强大的鲁棒性。

Designing GUI agents that perform human-like tasks on graphical user interfaces faces a critical obstacle: collecting high-quality trajectory data for training. Existing methods depend on expensive and time-consuming human supervision or on generating synthetic data, which can hardly reflect the diversity and dynamics in the real world. Those constraints significantly limit the GUI agents’ scalability and effectiveness and prevent them from acting autonomously and adapting to diverse and dynamic environments.

Traditional data acquisition for GUI agents is generally based on task-oriented methods. Human annotation is a labor-intensive process that involves designing tasks and annotating trajectories. Although synthetic data reduces the dependency on humans, it depends on pre-defined high-level tasks, which limit the scope and scale of the data. The presence of errors in the intermediate steps or conflicting objectives in the task results in incoherent trajectories and thus decreases the quality of the training data. As mentioned above, these restrictions limit the generalization abilities of agents to work effectively in dynamic or unfamiliar environments.

Researchers from Shanghai AI Laboratory, The University of Hong Kong, Johns Hopkins University, Shanghai Jiao Tong University, the University of Oxford, and Hong Kong University of Science and Technology propose OS-Genesis, a groundbreaking strategy to address these challenges through interaction-driven reverse task synthesis. Unlike predetermined tasks, the exploratory mode of GUI agents involves interaction through clicks, scrolling, and typing over GUI elements for environments. In a retrospective analysis, these interactions are transformed into low-level instructions and contextualized as high-level tasks. It maintains data quality through a TRM, by scoring synthesized trajectories along dimensions of coherence, logical flow, and completeness. Even partial but meaningful data can be trained in such an approach. By bridging the gap between abstract instructions and the dynamic nature of GUIs, this framework significantly enhances the quality and diversity of training data while eliminating the need for human supervision.

The OS-Genesis process consists of several integral components. First, the system autonomously explores dynamic GUI elements, recording transitions between pre- and post-action states to collect foundational data for task synthesis. These transitions are then transformed into detailed low-level instructions with the help of models like GPT-4o. Those instructions are incorporated into comprehensive high-level objectives related to the overall intention of the users, thereby attaining semantic depth. The synthesized pathways then undergo evaluation via the Trajectory Reward Model which uses a stratified scoring framework that focuses more on aspects of logical coherence as well as effective task completion. This ensures the diversity and high quality of data, thus providing a strong basis for training.

Extensive experiments were conducted using benchmarks like AndroidWorld and WebArena, which mimic complex and dynamic environments. Vision-language models, namely Qwen2-VL and InternVL2, were used as the base frameworks for the training process. The training focused on improving both sophisticated task planning and precise low-level action execution to enable deep skill learning for GUI agents.

OS-Genesis was successfully validated on a variety of benchmarks. On AndroidWorld, success rates nearly doubled those of task-driven methods regarding the ability to improve task planning and execution. On AndroidControl, the method performed very well at the high level of autonomous planning but also at the low level of step-by-step execution, including out-of-distribution examples; this shows robustness. On WebArena, the approach outperformed traditional baselines consistently, thereby gaining ground in handling complex and interactive environments. In summary, these results demonstrate the ability of OS-Genesis to generate high-quality trajectories of all sorts, thereby greatly improving the overall effectiveness of GUI agents in general situations.

OS-Genesis is a revolutionary step in the training of GUI agents, as it overcomes the limitations of current data collection methods. Its interaction-driven methodology and reward-based evaluation ensure high-quality and diverse training data that bridge the gap between abstract task instructions and dynamic GUI environments. This approach opens the way for significant progress in digital automation and AI research by enabling GUI agents to learn and adapt autonomously.


Check out the Paper, GitHub and Project Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation IntelligenceJoin this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

The post OS-Genesis: A Novel GUI Data Synthesis Pipeline that Reverses the Conventional Trajectory Collection Process appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OS-Genesis GUI代理 逆向任务合成 交互驱动 数据合成
相关文章