MarkTechPost@AI 2024年11月03日
OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

OpenWebVoyager是一个开源框架,旨在解决多模态网页导航中的诸多挑战。它通过迭代反馈循环实现自我优化学习,能处理图像和文本,在多个网页导航基准测试中取得显著成功率提升,代表了多模态网页导航的突破。

🎯OpenWebVoyager是一个开源框架,在真实网络环境中促进持续、自我优化的学习循环。它通过合适的本地技能构建者在迭代反馈循环中进行模仿学习,让智能体先学习基本导航技能,再通过探索新任务等进一步提升性能。

💻该框架以视觉语言模型Idefics2 - 8b - instruct为骨干,能够处理图像和文本,更好地理解现实场景。它在探索 - 反馈 - 优化循环中自我改进,GPT - 4o会持续评估每条轨迹的正确性并据此更新智能体。

📈OpenWebVoyager在实施中采用优化模型Idetics2 - 8b - instruct,从48个涉及不同领域的网站积累任务,并使用多模态方式处理数据。经过迭代优化,其在多个网页导航基准测试中成功率显著提升。

Designing autonomous agents that can navigate complex web environments raises many challenges, in particular when such agents incorporate both textual and visual information. More classically, agents have limited capability since they are confined to synthetic, text-based environments with well-engineered reward signals, which restricts their applications to real-world web navigation tasks. A central challenge is that enabling a generally capable agent to interpret multimodal content—consisting of visual and textual inputs—without explicit feedback signals remains one of the hardest problems in AI. These agents will also have to learn and dynamically adapt to the real world of ever-changing online environments, which in many cases will need their continuous optimization and self-improvement for various web interfaces and navigation tasks.

Existing methods for web navigation rely on large language models such as GPT-4o or other closed-source multimodal models. While these are performing well in structured and text-only environments, their resilience in a complex real-world scenario remains low. Only a few of these, like WebVoyager and VisualWebArena, extend these models to multimodal settings by considering screenshots and texts, but they still rely on closed-source models and synthetic training settings. Due to limited multi-modal perception and lack of visual grounding for underlying representation, these models cannot generalize outside the controlled environment. Another limitation of the existing approaches is their dependence on well-defined reward signals that are mostly lacking in real-world tasks. While open-source vision-language models have become increasingly approachable, such as BLIP-2-T5 and LLaVA, shallow contextual understandings make them unsuitable for complex web navigation tasks. They are limited from application to unsupervised learning in real-world, multimodal scenarios.

Researchers from Zhejiang University, Tencent AI Lab, and Westlake University introduce OpenWebVoyager, an open-source framework that fosters continuous, self-optimizing learning cycles in real-world web environments. Suitable local skill builders take the imitation learning (IL) in an iterative feedback loop where agents, by imitating the demonstrations of interactions with web pages, learn basic navigation skills first, after which their performances can be improved further by exploring new tasks, gathering feedback, and optimizing based on successful trajectories. With a vision-language model backbone, the Idefics2-8b-instruct, OpenWebVoyager can process images and text, allowing it to better understand real-world scenarios. The framework self-improves in an exploration-feedback-optimization cycle where GPT-4o continuously evaluates each trajectory concerning its correctness, hence updating the agent iteratively according to its performance. This, therefore, allows independent learning and optimization, moving one step further toward the scalability and adaptability of autonomous web agents.

For implementation, OpenWebVoyager employed Idefics2-8b-instruct, which is an optimized model for handling textual and visual data. In the initial imitation learning phase, tasks were accumulated from 48 websites related to diversified domains of e-commerce, travel, and news with 1516 task-specific queries. Its training data consists of multimodal web trajectories with basic operations instructions that guide the agent. OpenWebVoyager uses the complementary input of accessibility trees and screenshots in multimodal ways to execute complex page layouts after going through an iterative optimization cycle. Each iteration consists of sampling new queries, trajectory success checks, and retention of successful trajectories for improvement of the model. Such a self-instructional method allows OpenWebVoyager to reach all the variable visual elements well and make operational decisions based on dynamical multimodal web page features. In addition, the model of OpenWebVoyager processes up to three screenshots per task, ensuring that it does a more complete job of visual-textual grounding in effectively carrying out those tasks.

The large success rate improvement by OpenWebVoyager in several web navigation benchmarks shows rapid growth through the iterative cycles. Starting with a 19.9% success rate after the imitation learning phase, the performance of the agent has gone up to 25.8% after three optimization cycles on the WebVoyager test set. During evaluation tests on unseen tasks and domains with the Mind2Web cross-task and cross-website sets, the agent improves baseline success rates from 6.3% to 19.6% on previously encountered domains, while its success rate increases by almost 4% on the new sites. It is these improvements over the baselines that underlined the effectiveness of the approach in OpenWebVoyager, through which the agent develops its web navigation capabilities with sustained accuracy and scalability across diverse web situations.

In conclusion, OpenWebVoyager represents a breakthrough in multimodal web navigation by creating an adaptable, self-optimizing framework that improves itself over iterative cycles. By combining imitation learning with exploration and automated feedback, OpenWebVoyager’s approach advances the scope of autonomous web agents, allowing for scalability across diverse domains without extensive retraining. This innovative framework holds the potential to improve real-world web navigation in fields ranging from e-commerce to information retrieval, marking a significant stride toward self-sufficient, multimodal AI agents in dynamic online environments.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[Sponsorship Opportunity with us] Promote Your Research/Product/Webinar with 1Million+ Monthly Readers and 500k+ Community Members

The post OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OpenWebVoyager 多模态网页导航 自我优化学习 成功率提升
相关文章