未知数据源 2024年09月15日
The AI detective: The Needle in a Haystack test and how Gemini 1.5 Pro solves it
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Google 的 Gemini 1.5 Pro 在“海中捞针”测试中表现出色,它能够在高达 100 万个 token 的海量文本、视频和音频数据中准确地提取特定信息。该测试评估了大型语言模型 (LLM) 从其上下文窗口中检索特定信息的能力,Gemini 1.5 Pro 在此方面表现优异,展现了其在处理长上下文信息方面的强大实力。

😊 **海中捞针测试:评估 LLM 的信息检索能力** 该测试将一个随机语句(“针”)嵌入到一个较长的上下文中(“干草堆”),然后提示 LLM 检索该语句。测试过程包括将“针”插入到上下文中,提示 LLM 检索特定语句,并通过迭代不同的上下文长度和文档深度来衡量 LLM 的性能,最终根据详细的评分和平均计算来评估结果。

🤩 **Gemini 1.5 Pro:信息检索的“侦探大师”** Google DeepMind 的研究论文显示,Gemini 1.5 Pro 在高达 100 万个 token 的文本、视频和音频数据中,对特定信息(“针”)的召回率接近完美(> 99.7%)。即使将上下文扩展到 1000 万个文本 token、970 万个音频 token 和 990 万个视频 token,Gemini 1.5 Pro 仍然保持着出色的召回率。目前,Gemini 1.5 Pro 支持 200 万个 token 的上下文窗口,这是所有模型提供商中最大的上下文窗口。

🎥 **视频中的“海中捞针”:Gemini 1.5 Pro 识别视频中的关键信息** Gemini 1.5 Pro 能够从长达 10.5 小时的视频中随机帧中检索“秘密词”,并实现了近乎完美的召回率(99.8%)。它甚至能够根据手绘草图识别视频中的场景,展示了其多模态能力。该技术在医疗保健、体育、内容创作等领域具有巨大的潜力,例如分析长时间的手术记录、分析比赛活动和伤病,以及简化视频编辑流程。

🎙️ **音频中的“海中捞针”:Gemini 1.5 Pro 识别音频中的关键信息** Gemini 1.5 Pro 和 Gemini 1.5 Flash 在从长达 107 小时(接近 5 天)的音频信号中检索隐藏的关键词方面,表现出 100% 的准确率。这对于提高嘈杂环境下的音频转录和字幕的准确性、识别录音的法律对话中的特定关键词以及进行客户支持通话的情感分析等应用场景非常有用。

💬 **多轮共指消解:Gemini 1.5 Pro 记住长对话中的关键信息** 多轮共指消解测试向 AI 模型提出了一个挑战,即在长对话中要求它们从较早的对话中复制特定的回应。这就像要求一个人回忆几天前发生的一段对话中的某个评论,即使是人类也很难做到。Gemini 1.5 Pro 和 Gemini 1.5 Flash 在这项任务中表现出色,即使上下文窗口扩展到 100 万个 token,也保持了 75% 的准确率。这展示了它们在推理、消歧和在长时间内保持上下文的能力。该能力在 AI 系统需要与用户进行长时间交互、保持上下文并提供准确响应的场景中具有重要的现实意义,例如客户服务聊天机器人处理需要参考先前交互并提供一致准确信息的信息复杂的询问。

🎯 **多针海中捞针:Gemini 1.5 Pro 识别多个关键信息** 在 100 万个 token 的海量数据中,Gemini 1.5 Pro 能够识别出多个“针”,并保持了 60% 的召回率。虽然与单针任务相比,性能略有下降,但这突出了该模型处理更复杂检索场景的能力,例如从大型且可能存在噪声的数据集中识别和提取多个信息片段。

📊 **Gemini 1.5 Pro 与 GPT-4 的比较:Gemini 1.5 Pro 表现更出色** 在“多针海中捞针”任务中,Gemini 1.5 Pro 超越了 GPT-4。该任务要求在单次检索中检索 100 个独特的“针”。Gemini 1.5 Pro 在高达 100 万个 token 时,保持了较高的召回率(> 99.7%),在 1000 万个 token 时仍然表现良好(99.2%)。而 GPT-4 Turbo 的上下文长度限制为 128k 个 token。在该任务中,GPT-4 Turbo 的性能随着上下文长度的增加而“大幅波动”,其最大上下文长度下的平均召回率约为 50%。

🧠 **Gemini 1.5 Pro 的秘密武器:混合专家模型** Gemini 1.5 Pro 的卓越表现,得益于其先进的架构、多模态能力和创新的训练技术。它采用了基于 Transformer 架构的混合专家 (MoE) 模型,对架构进行了重大改变。MoE 模型利用学习路由函数(可以理解为侦探机构中的调度员)将不同的输入数据引导到模型中专门的组件。这使模型能够扩展其整体能力,同时仅使用特定任务所需的资源。

🔮 **AI 的未来:在不断扩大的“干草堆”中寻找“针”** AI 的真正价值不在于它处理信息的能力,而在于它理解和进行有意义对话的能力。这些“海中捞针”测试表明,Gemini 1.5 Pro 和 Gemini 1.5 Flash 正在突破现有技术界限,证明它们能够应对最复杂和最长的对话。它不仅是关于响应,更是关于理解和跨模态连接,这是 AI 向更像真正智能对话伙伴迈出的巨大一步。

🤩 **尝试使用 Gemini 1.5 Pro 的 200 万个 token 上下文窗口,在 Vertex AI 上进行自己的“海中捞针”测试。**

Imagine a vast library filled with countless books, each containing a labyrinth of words and ideas. Now, picture a detective tasked with finding a single, crucial sentence hidden somewhere within this literary maze. This is the essence of the "Needle in a Haystack" test for AI models, a challenge that pushes the boundaries of their information retrieval capabilities.

Generated using Imagen 2. Prompt: A detective looking for a needle in a haystack. The detective is mostly covered by shadows holding a magnifying glass.

In the realm of artificial intelligence, this test is not about finding a physical needle, but it tests how well a large language model (LLM) can retrieve specific information from large amounts of data in its context window. It's a trial by fire for LLMs, assessing their ability to sift through a sea of data and pinpoint the exact information needed.

This test gauges how well an LLM can pinpoint exact information within its context window. It involves embedding a random statement ("needle") within a long context ("haystack") and prompting the LLM to retrieve it. Key steps include:

  • Insert the needle: Place a random fact or statement within a long context window.

  • Prompt the LLM: Ask the model to retrieve the specific statement.

  • Measure performance: Iterate through different context lengths and document depths.

  • Score the results: Provide detailed scoring and calculate an average.

The 2 million token challenge

An AI model's context window is like its short-term memory. Google’s Gemini 1.5 Pro has an industry-leading 2 million token context window, roughly equivalent to 1.5 million words or 5,000 pages of text! This is transformative for AI applications requiring understanding and responding to lengthy inputs.

However, a large context window also presents challenges. More information makes it harder to identify and focus on relevant details. So we use the Needle in the Haystack test to measure recall, and Google's Gemini 1.5 Pro has emerged as a star performer.

Google Gemini 1.5 Pro: The master detective

In Google Deepmind’s research paper, Gemini 1.5 Pro demonstrates near-perfect recall (>99.7%) of specific information ("needle") within a vast context ("haystack") of up to 1 million tokens across text, video, and audio modalities. This exceptional recall persists even with contexts extended to 10 million tokens for text, 9.7 million for audio, and 9.9 million for video. While this was an internal test, Gemini 1.5 Pro supports a 2M token context window (the largest of any model provider today).

Gemini 1.5 Pro achieves near-perfect “needle” recall (>99.7%) up to 1M tokens of “haystack” in all modalities, i.e., text, video and audio.

Let’s test all the haystacks

The following benchmark data showcases the impressive advancements made with Gemini 1.5 Pro, particularly in handling long-context text, video, and audio. It not only holds its own against the February 2024 1.5 Pro release but also demonstrates significant improvements over its predecessors, 1.0 Pro and 1.0 Ultra.

Gemini 1.5 Pro win-rates compared to Gemini 1.5 Pro from the February 2024 release, as well as the Gemini 1.0 family. Gemini 1.5 Pro maintains high levels of performance even as its context window increases.

Let’s dive deeper.

Video Haystack: Gemini 1.5 Pro retrieved “a secret word” from random frames within a 10.5-hour video, with Gemini 1.5 Flash also achieving near-perfect recall (99.8%) for videos up to 2 million tokens. The model even identified a scene from a hand-drawn sketch, showcasing its multimodal capabilities!

When prompted with a 45 minute Buster Keaton movie “Sherlock Jr." (1924) (2,674 frames at 1FPS, 684k tokens), Gemini 1.5 Pro retrieves and extracts textual information from a specific frame and provides the corresponding timestamp. At the bottom right, the model identifies a scene in the movie from a hand-drawn sketch.

This has high potential for fields like healthcare to analyze lengthy surgical recordings, sports to analyze game activities and injuries, or content creation to streamline the video editing process. 

Audio Haystack: Both Gemini 1.5 Pro and Flash exhibited 100% accuracy in retrieving a secret keyword hidden within an audio signal of up to 107 hours (nearly five days!). You can imagine this being useful for improving the accuracy of audio transcription and captioning in noisy environments, identifying specific keywords during recorded legal conversations, or sentiment analysis during customer support calls.

Multi-round co-reference resolution (MRCR): The MRCR test throws a curveball at AI models with lengthy, multi-turn conversations, asking them to reproduce specific responses from earlier in the dialogue. It's like asking someone to remember a particular comment from a conversation that happened days ago — a challenging task even for humans. Gemini 1.5 Pro and Flash excelled, maintaining 75% accuracy even when the context window stretched to 1 million tokens! This showcases their ability to reason, disambiguate, and maintain context over extended periods.   

This capability has significant real-world implications, particularly in scenarios where AI systems need to interact with users over extended periods, maintaining context and providing accurate responses. Imagine customer service chatbots handling intricate inquiries that require referencing previous interactions and providing consistent and accurate information.   

Multiple needles in a haystack: While finding a single needle in a haystack is impressive, Gemini 1.5 tackles the challenging task of finding multiple needles in a haystack. Even when faced with 1 million tokens, Gemini 1.5 Pro maintains a remarkable 60% recall rate. This performance, while showing a slight decrease compared to the single-needle task, highlights the model's capacity to handle more complex retrieval scenarios, where multiple pieces of information need to be identified and extracted from a large and potentially noisy dataset.   

Comparison to GPT-4: Gemini 1.5 Pro outperforms GPT-4 in a “multiple needles-in-haystack" task, which requires retrieving 100 unique needles in a single turn. It maintained a high recall (>99.7%) up to 1 million tokens, still performing well at 10 million tokens (99.2%), while GPT-4 Turbo is limited by its 128k token context length. GPT-4 Turbo's performance on this task "largely oscillates" with longer context lengths, with an average recall of about 50% at its maximum context length.

Retrieval performance of the “multiple needles-in-haystack" task, which requires retrieving 100 unique needles in a single turn. When comparing Gemini 1.5 Pro to GPT-4 Turbo we observe higher recall at shorter context lengths, and a very small decrease in recall towards 1M tokens.

Gemini 1.5 Pro's secret weapon

What makes Gemini 1.5 Pro such a master detective? It's the combination of advanced architecture, multimodal capabilities, and innovative training techniques. It incorporated significant architectural changes by using the mixture-of-experts (MoE) model, based on the Transformer architecture. MoE models utilize a learned routing function — think of it like a dispatcher in a detective agency — to direct different parts of the input data to specialized components within the model. This allows the model to expand its overall capabilities while only using the necessary resources for a given task.

The future of AI: finding needles in ever-larger haystacks

The true measure of AI lies not just in its ability to process information, but in its capacity to understand and engage in meaningful conversations. These Needle in the Haystack tests show that Gemini 1.5 Pro and Flash are pushing the boundaries of what's possible, showing it can navigate even the most complex and lengthy dialogues. It's not just about responding; it's about understanding and connecting across modalities — a giant leap towards AI that feels less like a machine and more like a truly intelligent conversational partner.

Try your own Needle in a Haystack test using Gemini 1.5 Pro’s 2M token context window today on Vertex AI.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Gemini 1.5 Pro 大型语言模型 信息检索 上下文窗口 多模态 海中捞针测试
相关文章