MarkTechPost@AI 2024年07月26日
This AI Paper Introduces AssistantBench and SeePlanAct: A Benchmark and Agent for Complex Web-Based Tasks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

这篇论文介绍了 AssistantBench,一个用于评估网络代理执行复杂网络任务能力的基准测试,以及 SeePlanAct,一个通过引入规划组件和记忆缓冲区来提高任务性能的网络代理。AssistantBench 包含 214 个跨越不同领域并需要与网络交互的复杂任务,而 SeePlanAct 则在 SEEACT 模型的基础上,通过规划组件和记忆缓冲区,增强了网络导航和任务执行能力,在 AssistantBench 上取得了显著的性能提升。

👨‍💻 **AssistantBench 基准测试:** AssistantBench 是一个用于评估网络代理执行复杂网络任务能力的基准测试。它包含 214 个跨越不同领域并需要与网络交互的复杂任务,例如搜索信息、比较产品、预订机票等。这些任务需要代理能够理解自然语言指令、与网页进行交互、从网页中提取信息、并根据任务目标进行决策。

🤖 **SeePlanAct 网络代理:** SeePlanAct (SPA) 是一个基于 SEEACT 模型的改进版本,它引入了规划组件和记忆缓冲区,以提高任务性能。规划组件允许 SPA 针对每个任务制定策略,并根据与网页元素的交互动态调整策略。记忆缓冲区则存储在任务过程中收集的信息,使 SPA 能够在整个任务持续时间内有效地利用这些信息。这些改进使 SPA 能够更强大地与网页元素进行交互,动态导航,并根据需要调整计划,为处理复杂的网络任务提供更有效的解决方案。

📈 **性能评估:** 在 AssistantBench 基准测试上的性能评估结果表明,SPA 在准确率、召回率和精确率方面都显著优于之前的模型。SPA 的准确率得分达到了 11 分,与 SEEACT 模型的 4.2 分相比有显著提升。此外,SPA 表现出更高的精确率,在正确回答的问题数量上增加了 10 分。这种改进主要归因于 SPA 增强了导航网页环境和有效利用收集信息的的能力。然而,即使有了这些进步,表现最好的模型的总体准确率仍然没有超过 25%,这突出了开发高度可靠的基于网络的 AI 解决方案的持续挑战。

🏆 **结论:** 这项研究强调了开发能够执行现实世界中复杂网络任务的 AI 系统所面临的重大挑战。AssistantBench 和 SPA 的引入代表了应对这些挑战的重要一步。然而,要实现可靠、高精度的 AI 网络导航解决方案,仍然存在很大差距,这强调了在这一领域持续创新和改进的必要性。来自特拉维夫大学、宾夕法尼亚大学、艾伦人工智能研究所、华盛顿大学和普林斯顿大学的研究团队取得的进步令人鼓舞,但也突出了进行持续研究和开发以弥合基于网络的 AI 能力差距的必要性。

Artificial intelligence (AI) is dedicated to developing systems capable of performing tasks that typically require human intelligence. This dedication is met with numerous challenges along the way. One such challenge in AI is creating systems that can manage complex, realistic tasks requiring extensive interaction with dynamic environments. These tasks often involve searching for and synthesizing information from the web, a process that current models need help to accomplish with high accuracy and reliability. This gap in capabilities highlights the need for more advanced AI systems.

Existing methods for addressing web-based tasks include closed-book language models (LMs) and retrieval-augmented LMs. Closed-book models rely solely on pre-existing knowledge encoded within their parameters, often resulting in hallucinations where the model generates incorrect information. Retrieval-augmented models attempt to gather and utilize relevant data from the web. However, the quality and relevance of the retrieved information can vary significantly, limiting the overall effectiveness of these models.

Researchers from Tel Aviv University, the University of Pennsylvania, the Allen Institute for AI, the University of Washington, and Princeton University have introduced a new benchmark called ASSISTANTBENCH to address these challenges, aimed at evaluating the capabilities of web agents in performing realistic, time-consuming web tasks. This benchmark consists of 214 diverse tasks that span various domains and require web-based interaction. Furthermore, researchers proposed SEEPLANACT (SPA), a novel web agent designed to enhance task performance by incorporating a planning component and a memory buffer.

SPA builds upon the existing SEEACT model, introducing several improvements to enhance web navigation and task execution. The planning component enables SPA to strategize its approach to each task, allowing it to re-plan and adjust its strategy dynamically based on interactions with web elements. The memory buffer retains information gathered during the task, enabling SPA to utilize this information effectively throughout the task’s duration. These enhancements allow SPA to interact more robustly with web elements, navigate dynamically, and adjust its plan as needed, providing a more effective solution for handling complex web tasks.

Performance evaluations of SPA on the ASSISTANTBENCH benchmark showed significant improvements over previous models. SPA achieved an accuracy score of 11 points, a substantial increase compared to the 4.2 points achieved by the earlier SEEACT model. Moreover, SPA demonstrated higher precision, with a 10-point increase in the number of correctly answered questions. This improvement was primarily due to SPA’s enhanced ability to navigate web environments and utilize gathered information effectively. Despite these advancements, the overall accuracy of the best-performing models did not exceed 25%, highlighting the continued challenges in developing highly reliable web-based AI solutions.

In more detailed performance metrics, SPA’s integration of planning and memory components allowed it to outperform other models in terms of answer rate and precision. SPA’s answer rate was 38.8%, compared to the 20% achieved by the earlier SEEACT model. The precision of SPA was also higher, at 29.0%, compared to the 19.6% of SEEACT. Combining SPA with a closed-book model, the ensemble model achieved the best overall performance, with an accuracy of 25.2 points, further emphasizing SPA’s effectiveness in improving task performance.

To conclude, this research underscores the critical challenges in developing AI systems capable of performing realistic, time-consuming web tasks. The introduction of ASSISTANTBENCH and SPA represents a significant step forward in addressing these challenges. However, a considerable gap remains in achieving reliable, high-precision AI solutions for web navigation, emphasizing the need for continued innovation and improvement in this field. The advancements made by the research teams from Tel Aviv University, the University of Pennsylvania, the Allen Institute for AI, the University of Washington, and Princeton University are promising but highlight the necessity for ongoing research and development to bridge the gap in web-based AI capabilities.


Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post This AI Paper Introduces AssistantBench and SeePlanAct: A Benchmark and Agent for Complex Web-Based Tasks appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能 网络任务 基准测试 代理 规划 记忆
相关文章