百川智能联合天津大学探索智能体框架，登顶大模型复杂任务榜单GAIA

2024-07-23 18:57 北京

日前，百川智能联合天津大学推出“Sibyl System”Agent框架，成功问鼎GAIA Leader Bo

日前，百川智能联合天津大学推出“Sibyl System”Agent框架，成功问鼎GAIA Leader Board。

学术界从很早以前就已开展了对Agent （智能体）技术的探索，但受限于技术发展始终无法在通用领域取得实用性进展。

大语言模型的出现，带来了通用可泛化的智能，再次让Agent（智能体）成为业内关注的焦点。

大语言模型+Agent可以做到理解人类指令，提出规划及工具调用完成指定工作，被喻为打通人工智能落地应用的最后一环。

什么是GAIA？

GAIA是2023年11月由Meta、Huggingface、AutoGPT 一起提出的全新评测方案(主要评测Agent复杂任务执行的方案和能力)。

这个评测体现出现有模型的能力缺陷，为模型和 Agent 的研发指出了改进方向。

区别于MMLU、BBH这类综合知识评测，GAIA的题目更接近真实世界：

需要 AI 具备推理、多模态理解（文本、图片、音/视频）、网页浏览和使用工具等能力

题目概念对人类来说不难理解，但对模型却极具挑战性。GPT-4在测试中成功率仅为15%，而人类实验者可以达到92%

完成题目的逻辑链路较长：大多需要 10 步左右的流程，1-3个工具来完成

单个题目的完成耗时较长：人类完成一个题目的时间大致需要 6-17 分钟

GAIA的测试集难度分三挡，示例如下：

Level 1:

What was the actual enrollment count of the clinical trial on H. pylori in acne vulgaris patients from Jan-May 2018 as listed on the NIH website?

美国国立卫生研究院网站上列出的2018年1月至5月在寻常痤疮患者中进行的幽门螺杆菌临床试验的实际入组人数是多少?

Level 2:

If this whole pint is made up of ice cream, how many percent above or below the US federal standards for butterfat content is it when using the standards as reported by Wikipedia in 2020? Answer as + or - a number rounded to one decimal place.

如果这一整品脱都是由冰淇淋组成的，当使用2020年维基百科报道的美国联邦黄油脂肪含量标准时，其黄油脂肪含量相对于该标准是高出多少百分比还是低于多少百分比？答案请以 + 或 - 一个数值，并四舍五入至小数点后一位。

Level 3:

I was referencing each of the tables in the file from papers that were cited by the "Trans fatty acid contents in chocolates and chocolate wafers in Turkey" paper. I lost my own reference sheet and need to know which of the papers each table came from. The file may not use the full table caption. If the references in the"Trans fatty acid" paper bibliography were numbered starting with 1, give me the numbers in the order that they would be used to fill the cells in the Excel file from top to bottom, as a comma separated list.

我正在引用文件中各个表格，这些表格来自被《土耳其巧克力和巧克力威化中的反式脂肪酸含量》论文引用的论文。我丢失了自己的参考单，需要知道每个表格分别来自哪篇论文。文件中可能未使用完整的表格标题。如果《反式脂肪酸》论文参考文献中的引用是从1开始编号的，请按照从上到下填充Excel文件单元格的顺序，给我一个用逗号分隔的编号列表。