MarkTechPost@AI 01月02日
AutoDroid-V2: Leveraging Small Language Models for Automated Mobile GUI Control
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

AutoDroid-V2是一种利用小型语言模型(SLM)进行移动设备GUI自动化控制的新方法。它采用脚本方式,生成多步骤脚本以执行用户指令,而非传统的分步GUI代理,显著降低了查询频率和资源消耗。AutoDroid-V2通过离线构建应用文档,在线生成并执行脚本,实现了高效的任务完成。实验结果表明,相较于现有方法,AutoDroid-V2在任务完成率、计算需求和延迟方面均有显著提升,同时保持了设备级的隐私和安全。尽管在处理无结构文本GUI的应用时存在局限,但通过集成视觉语言模型(VLM),有望解决这一问题。

📱AutoDroid-V2采用脚本方式,通过生成多步骤脚本来执行用户指令,与传统的分步GUI代理相比,显著降低了查询频率和资源消耗。

⚙️该系统包含离线和在线两个阶段。离线阶段构建应用文档,通过AI引导的GUI状态压缩、元素XPath自动生成和GUI依赖分析,为脚本生成奠定基础;在线阶段,本地LLM生成多步骤脚本,由特定领域的解释器执行。

🚀AutoDroid-V2在多个基准测试中表现出色,任务完成率提高了10.5%-51.7%,输入和输出token消耗分别减少了43.5倍和5.8倍,LLM推理延迟降低了5.7-13.4倍。

🔒AutoDroid-V2利用设备端SLM,在保证任务完成率的同时,维护了用户隐私和安全,避免了云端模型带来的数据泄露风险。

Large Language Models (LLMs) and Vision Language Models (VLMs) have revolutionized the automation of mobile device control through natural language commands, offering solutions for complex user tasks. The conventional approach, “Step-wise GUI agents,” operates by querying the LLM at each GUI state for dynamic decision-making and reflection, continuously processing the user’s task, and observing the GUI state until completion. However, this method faces significant challenges as it relies heavily on powerful cloud-based models like GPT-4 and Claude. This raises critical concerns about privacy and security risks when sharing personal GUI pages, substantial user-side traffic consumption, and high server-side centralized serving costs, making large-scale deployment of GUI agents problematic.

Earlier attempts to automate mobile tasks relied heavily on template-based methods like Siri, Google Assistant, and Cortana, which used predefined templates to process user inputs. More advanced GUI-based automation emerged to handle complex tasks without depending on third-party APIs or extensive programming. While some researchers focused on enhancing Small Language Models (SLMs) through GUI-specific training and exploration-based knowledge acquisition, these approaches faced significant limitations. Script-based GUI agents particularly struggled with the dynamic nature of mobile apps, where UI states and elements frequently change, making knowledge extraction and script execution challenging.

Researchers from the Institute for AI Industry Research (AIR), Tsinghua University have proposed AutoDroid-V2 to investigate how to build a powerful GUI agent upon the coding capabilities of SLMs. Unlike traditional step-wise GUI agents that make decisions one action at a time, AutoDroid-V2 uses a script-based approach that generates and executes multi-step scripts based on user instructions. Moreover, it addresses two critical limitations of conventional approaches: 

AutoDroid-V2’s architecture consists of two distinct stages: offline and online processing. In the offline stage, the system begins by constructing an app document through a comprehensive analysis of the app exploration history. This document serves as a foundation for script generation, incorporating AI-guided GUI state compression, element XPath auto-generation, and GUI dependency analysis to ensure both conciseness and precision. During the online stage, when a user submits a task request, the customized local LLM generates a multi-step script, which is then executed by a domain-specific interpreter designed to handle runtime execution reliably and efficiently.

AutoDroid-V2’s performance is evaluated across two benchmarks, testing 226 tasks on 23 mobile apps against leading baselines including AutoDroid, SeeClick, CogAgent, and Mind2Web. It shows significant improvements, achieving a 10.5%-51.7% higher task completion rate while reducing computational demands with 43.5x and 5.8x reductions in input and output token consumption respectively, and 5.7-13.4× lower LLM inference latency compared to baselines. Testing across different LLMs (Llama3.2-3B, Qwen2.5-7B, and Llama3.1-8B) AutoDroid-V2, shows consistent performance with success rates ranging from 44.6% to 54.4%, maintaining a stable reversed redundancy ratio between 90.5% and 93.0%.

In conclusion, researchers introduced AutoDroid-V2 which represents a significant advancement in mobile task automation through its innovative document-guided, script-based approach utilizing on-device SLMs. The experimental results demonstrate that this script-based methodology substantially improves the efficiency and performance of GUI agents, achieving accuracy levels comparable to cloud-based solutions while maintaining device-level privacy and security. Despite these achievements, the system faces limitations when dealing with apps lacking structured text representations of their GUIs, such as Unity-based and Web-based applications. However, this challenge could be addressed by integrating VLMs) to recover structured GUI representations based on visual features.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation IntelligenceJoin this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

The post AutoDroid-V2: Leveraging Small Language Models for Automated Mobile GUI Control appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AutoDroid-V2 小型语言模型 GUI自动化 移动设备控制 脚本执行
相关文章