掘金 人工智能 10小时前
【论文导读】OS-Genesis 基于自动探索构建 GUI 数据
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

OS-Genesis是一种创新的GUI数据构造方法,它通过自动探索应用程序,并利用操作前后的状态与描述生成单步及多步任务指令。该方法结合了GPT-4o进行数据评估和奖励建模,确保了训练数据的质量。实验结果表明,OS-Genesis在提升任务完成率方面显著优于Task-Driven、Zero-Shot和Self-Instructions等基线方法,并且其生成的数据训练出的智能体与人类操作的相似度超过80%。该框架通过精细的数据构造和质量控制,为开发更强大的GUI智能体提供了新思路。

✨ **反向推导任务数据构造**:OS-Genesis的核心在于其自动探索GUI并反向推导任务数据的能力。它系统地遍历应用程序,利用操作前后的状态和描述,生成低级(单步)和高级(多步)操作指令,从而构建了大量用于模型训练的数据。这种方法能够捕捉到真实的用户交互流程。

🚀 **GPT-4o在数据生成与评估中的应用**:GPT-4o在OS-Genesis流程中扮演了关键角色。它不仅用于生成与上下文匹配的输入内容,还在“反向任务合成”阶段生成详细的单步和多步指令。此外,一个基于GPT-4o的轨迹奖励模型(TRM)被用来评估轨迹的完成度和连贯性,并赋予1-5分的评分,这使得不完整但有价值的数据也能被纳入训练。

📊 **实验效果显著提升**:通过与Task-Driven、Zero-Shot和Self-Instructions等方法的对比,OS-Genesis在任务完成率上展现出明显优势。使用OS-Genesis生成的数据训练的智能体,其与真实人类操作的GUI过程相似度超过80%,证明了该方法在提高智能体性能方面的有效性。

💡 **奖励模型优化数据质量**:OS-Genesis引入了轨迹奖励模型(TRM)来替代传统的标签筛选方法。TRM能够对轨迹的完成度和逻辑性进行细致评估,并为不完整但有价值的轨迹评分,从而优化了训练数据的质量,避免了信息损失,并最终提升了模型的训练效果。

🎯 **训练策略与数据量影响**:文章还探讨了数据量对模型性能的影响。通过在AndroidWorld数据集上进行测试,不同数量的OS-Genesis生成数据对任务完成率有不同程度的提升。这表明数据量的规模与方法的有效性共同作用,是构建高性能GUI智能体的关键。

OS-Genesis 实践了 「自动探索 GUI - 反向推导任务数据」 的 GUI 数据构造,并基于 Qwen2-VL-7B 进行了模型训练,目前 数据未开源。

主要训练过程:

实验数据结论:

文章地址:arxiv.org/pdf/2412.19…

训练数据:

App 遍历:Type 利用 gpt4o 给出匹配上下文的内容

系统地遍历动作交互元素 a ∈ A ={CLICK, TYPE, SCROLL} 。整个自动探索基于特定规则(论文没有细说),在需要输入时( TYPE ),会调用GPT-4o 来生成与上下文适配的内容。通过探索收集了大量动作以及动作前后的状态组成操作序列。


Reverse Task Synthesis:利用 GPT4o 生成单步指令和多步指令

利用一个 GPT-4o 基于操作序列生成 low-level instructions 和 high-level instructions。

low-level instructions 表示一次单步操作任务,基于一次操作序列生成,比如在当位于微信首页时-打开通讯录

high-level instructions 表示一个更完整复杂的多步的任务,基于多次操作序列生成。

单步任务生成的 Prompt 论文没有提到,多步任务生成的 Prompt 如下:

Prompt for Associating High-Level Tasks You are an expert at envisioning specific tasks corresponding to changes in mobile screenshots. I will provide you with the following:  1. The type of action currently being executed. The type of action currently being executed, which can be one of five types: CLICK, SCROLL, TYPE, PRESS_BACK, and LONG_PRESS. If the action is TYPE, an additional value representing the input will be provided. If the action is SCROLL, an additional scroll direction will be provided. 2. Screenshots of the interface before and after the current action is performed. If the action is CLICK, the pre-action screenshot will include a red bbox highlighting the element being interacted with (if applicable). Pay particular attention to the content of the element corresponding to the red bbox. 3. The name of the app where the current screenshot is located. Your task is to envision a specific task based on the current action and the corresponding changes in screenshots. The output should include three parts: 1. Sub-Instruction:  Based on the interface change caused by the current action, generate a corresponding natural language instruction for the current action. The instruction should be concise, clear, and executable. It must include specific details critical to the operation, such as file names, times, or other content as they appear in the screenshots. For example: “Scroll left to open the app drawer, displaying all installed applications on the devic”, “Click the chat interface, allowing the user to view and participate in conversation”, “Type the username ‘Agent’, preparing for the next step in logging into the account”. 2. Analysis: Based on the interface changes and the current action instructions, analyze the possible subsequent operations. This analysis should involve step-by-step reasoning, considering the potential changes on the screen and the actions that can be taken after these changes. For example: “After clicking the plus button, a dropdown menu appears with an option to create a document. I can select this option to create a new document. First, I need to name the document, then enter any content into the document, and finally save the document and exit”. 3. High-Level-Instruction:  Based on the analysis results, envision a high-level task that can be completed within the current interface. There are two types of High-Level-Instruction: Task-Oriented: Completing a series of operations to achieve a specific goal. Question-Oriented: Performing a series of operations and deriving an answer to a specific question. For example: {examples}. Ensure that the High-Level-Instruction is executable by including all critical specifics, such as file names, relevant timings, or required details. You ONLY need to return a dictionary formatted as follows:   {   “Sub-Instruction”: “xxx”,   “Analysis”: “xxx”,   “High-Level-Instruction”: “xxx”   } Current Action: {current_action} App Name: {app_name} RETURN ME THE DICTIONARY I ASKED FOR.

Exploration and Reward Modeling: 再次对生成数据进行评估完成率

从描述来看,从 GUI 反向生成的任务数据也不意味着 100% 准确,比如某些任务中间可能有冗余步骤

为了确保这些轨迹的质量和实用性,OS-Genesis 采用了轨迹奖励模型 (TRM)。TRM 基于 GPT-4o 构建,根据完成度(任务完成度)连贯性(动作的逻辑顺序) 评估每个轨迹,并分配从 1 到 5 的分级奖励分数。与传统的二进制筛选方法不同,TRM 允许不完整但有价值的轨迹为训练做出贡献。

而且这个奖励模型做不同策略最终也会影响到效果,在最后有数据给出:

以下是对应的 prompt:

Trajectory Reward Model PromptYou are an expert in evaluating GUI agent task trajectories. Your task is to assess the quality and effectiveness of task trajectories for GUI manipulation tasks. A trajectory consists of the following components: 1. High-level Instruction: Describes the user’s intended task (e.g., "Create a new blank project name ’OS-Genesis’"). 2. Action History: Includes two key parts: - Reasoning and Action for Each Step: A sequence of actions performed by the agent, including the reasoning thought and final executed action. - GUI Screenshots: Screenshots of the last state: (if there are at least three states; otherwise, include all states). When evaluating a trajectory, consider these key aspects: Evaluation Criteria: 1. Trajectory Coherence: - Do the low-level steps and corresponding actions follow a logical sequence toward the goal? - Are the actions clearly described and specific? - Are there redundant or unnecessary actions? 2. Task Completion: - Does the trajectory successfully achieve the instructed task? - Are all necessary interactions completed? - Are error cases handled appropriately? Scoring Guidelines: Rate the trajectory on a scale of 1 to 5 based on the evaluation criteria: - 5 :  The task is perfectly completed, successfully executing multiple actions to achieve the goal. The sequence is logically clear with no noticeable redundancies. - 4 :  The task is mostly completed, successfully executing multiple actions. However, due to challenges or ambiguities in the instructions, the completion is not perfect, or there are inefficiencies in the process. - 3 :  The task is partially completed, with some successful actions executed. However, due to task or environmental constraints, the goal is not fully achieved, or the sequence ends in a loop or error. - 2 :  Only a few actions are executed. Although there is an attempt to complete the task, the trajectory deviates from the goal early on or demonstrates significant inefficiencies in execution and logic. - 1: The task fails completely, with no meaningful actions executed at the start. The sequence either falls into an immediate deadlock, a repetitive loop, or demonstrates no value in completing the task. Or the tasks are completely inaccessible. Note: If the task is relatively complex, but the trajectory demonstrates valuable attempts, even if the task is not fully completed, consider adjusting the score upward. However, if the task is complex but the trajectory fails to perform actions that contribute meaningfully to task completion, no extra points should be awarded. You need to judge the score based on the agent’s actions and screenshots combined. Response Format: Format your response into two lines as shown below: Reason:  < your thoughts and reasoning process for the score >  Score:  < your score from 1-5 > 

训练目标:

Plan + Action,做法类似,不过他好像不是纯图像的 Grounding ,看起来和 ViewTree 有关,这块儿数据目前没有公开,也不确定。


实验结果:

任务完成情况

AndroidControl 和 AndroidWorld 作为移动测试数据集,WebArena 作为测试数据集;

主要使用 GPT-4o 进行反向任务合成和奖励建模,决策模型使用 InternVL2 - 4B/8B 和 Qwen2 - VL - 7BInstruct,并在 8×A100 80GB GPUs 上进行 SFT 全量微调,结果如下:

四种对比测试方式如下:


数据质量

同等轨迹数据训练 agent 得到的模型效果。利用 OS-Genesis 生成的轨迹数据训练的模型效果人类操作相速度超过 80%


奖励模型对于指标的影响:

We introduce a Trajectory Reward Model (TRM) for data quality control and exploitation, substituting traditional labeler filtering methods (He et al., 2024; Murty et al., 2024a). To analyze its impact and for ablation purposes, we include additional settings for comparison: (1) training without an RM, where all synthesized data is treated equally during training, and (2) using a labeler, similar to previous approaches where only complete trajectories are retained for training.


训练数据量在 AndroidWorld 完成率的提升

采用不同数据量的任务在 AndroidWorld 测试集上的任务表现(Android-Control 的成功率比这个高很多)

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OS-Genesis GUI数据构造 反向推导 GPT-4o 智能体训练 轨迹奖励模型
相关文章