MarkTechPost@AI 03月16日
Meet PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

PC-Agent框架通过三个创新设计,解决了PC环境中复杂任务自动化的问题。该框架包含一个主动感知模块,通过可访问性树提取交互元素的位置和含义,并使用MLLM驱动的意图理解和OCR进行精确的文本定位。此外,分层多智能体协作实现了三级决策过程,通过管理智能体将指令分解为参数化子任务并管理依赖关系,进度智能体跟踪操作历史,决策智能体执行步骤,并结合感知和进度信息。基于反射的动态决策引入了一个反射智能体,可以评估执行的正确性并提供反馈。实验结果表明,PC-Agent的性能优于现有的单智能体和多智能体方法。

🔍 **主动感知模块**:通过可访问性树提取交互元素的位置和含义,并结合MLLM驱动的意图理解和OCR技术,实现对屏幕元素和文本的精确定位,提升了智能体在复杂PC环境中的感知能力。

🤝 **分层多智能体协作**:采用三级决策过程(指令-子任务-动作),由管理智能体分解指令并管理依赖关系,进度智能体跟踪操作历史,决策智能体根据感知和进度信息执行动作,有效降低了决策复杂性,提高了任务完成的效率。

🔄 **基于反射的动态决策**:引入反射智能体,评估执行的正确性并提供反馈,实现自顶向下的任务分解和自底向上的精确反馈,增强了智能体的鲁棒性和适应性。

📊 **实验验证**:在PC-Eval基准测试中,PC-Agent的性能显著优于其他单智能体和多智能体方法,证明了其在处理复杂PC任务方面的有效性。

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities across various domains, propelling their evolution into multi-modal agents for human assistance. GUI automation agents for PCs face particularly daunting challenges compared to smartphone counterparts. PC environments present significantly more complex interactive elements with dense, diverse icons and widgets often lacking textual labels, leading to perception difficulties. Even advanced models like Claude-3.5 achieve only 24.0% accuracy in GUI grounding tasks. Also, PC productivity tasks involve intricate workflows spanning multiple applications with lengthy operation sequences and inter-subtask dependencies, causing dramatic performance declines where GPT-4o’s success rate drops from 41.8% at subtask level to just 8% for complete instructions.

Previous approaches have developed frameworks to address PC task complexity with varying strategies. UFO implements a dual-agent architecture separating application selection from specific control interactions. Meanwhile, AgentS augments planning capabilities by combining online search with local memory. However, these methods demonstrate significant limitations in fine-grained perception and operation of on-screen text—a critical requirement for productivity scenarios like document editing. In addition, they generally fail to address the complex dependencies between subtasks, resulting in poor performance when handling realistic intra- and inter-app workflows that characterize everyday PC usage.

Researchers from MAIS, Institute of Automation, Chinese Academy of Sciences, China, School of Artificial Intelligence, University of Chinese Academy of Sciences, Alibaba Group, Beijing Jiaotong University, and School of Information Science and Technology, ShanghaiTech University introduce PC-Agent framework to address complex PC scenarios through three innovative designs. First, the Active Perception Module enhances fine-grained interaction by extracting locations and meanings of interactive elements via accessibility trees, while using MLLM-driven intention understanding and OCR for precise text localization. Second, Hierarchical Multi-agent Collaboration implements a three-level decision process (Instruction-Subtask-Action) where a Manager Agent decomposes instructions into parameterized subtasks and manages dependencies, a Progress Agent tracks operation history, and a Decision Agent executes steps with perception and progress information. Third, Reflection-based Dynamic Decision-making introduces a Reflection Agent that assesses execution correctness and provides feedback, enabling top-down task decomposition with bottom-up precision feedback across all four collaborating agents.

PC-Agent’s architecture addresses GUI interaction through a formalized approach where an agent ρ processes user instructions I, observations O, and history H to determine actions A. The Active Perception Module enhances element recognition using pywinauto to extract accessibility trees for interactive elements while employing MLLM-driven intention understanding with OCR for precise text localization. For complex workflows, PC-Agent implements Hierarchical Multi-agent Collaboration across three levels: the Manager Agent decomposes instructions into parameterized subtasks and manages dependencies; the Progress Agent tracks operation progress within subtasks; and the Decision Agent executes step-by-step actions based on environmental perception and progress information. This hierarchical division effectively reduces decision-making complexity by breaking complex tasks into manageable components with clear interdependencies.

Experimental results demonstrate PC-Agent’s superior performance compared to both single and multi-agent alternatives. Single MLLM-based agents (GPT-4o, Gemini-2.0, Claude3.5, Qwen2.5-VL) consistently fail on complex instructions, with even the best performer achieving only 12% success rate, confirming that single-agent approaches struggle with lengthy operational sequences and complex dependencies. Multi-agent frameworks like UFO and AgentS show modest improvements but remain limited by perception deficiencies and dependency management issues. They struggle with fine-grained operations such as text editing in Word or proper data entry in Excel, and often fail to utilize information from previous subtasks. In contrast, PC-Agent significantly outperforms all previous methods, surpassing UFO by 44% and AgentS by 32% in success rate through its Active Perception Module and hierarchical multi-agent collaboration.

This study introduces PC-Agent framework, a significant advancement in handling complex PC-based tasks through three key innovations. The Active Perception Module provides refined perception and operation capabilities, enabling precise interaction with GUI elements and text. The hierarchical multi-agent collaboration architecture effectively decomposes decision-making across instruction, subtask, and action levels, while reflection-based dynamic decision-making allows for real-time error detection and correction. Validation through the newly created PC-Eval benchmark with realistic, complex instructions confirms PC-Agent’s superior performance compared to previous methods, demonstrating its effectiveness in navigating the intricate workflows and interactive environments characteristic of PC productivity scenarios.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post Meet PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

PC-Agent 多智能体 自动化 GUI MLLM
相关文章