MarkTechPost@AI 2024年09月21日
MMSearch Engine: AI Search with Advanced Multimodal Capabilities to Accurately Process and Integrate Text and Visual Queries for Enhanced Search Results
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

MMSearch 引擎是一种创新的搜索工具,它能够处理多模态搜索查询,将文本和视觉输入整合在一起,从而更准确地理解用户意图并提供更全面的搜索结果。该引擎基于一个三步流程,首先将查询改写成更适合搜索引擎的形式,然后重新排序搜索结果,最后将文本和视觉信息结合起来进行总结。与传统的搜索引擎相比,MMSearch 能够更有效地处理复杂的多模态查询,尤其是在涉及图像和文本的查询中表现出色。

🤔 MMSearch 引擎是一个多模态搜索系统,它能够处理文本和视觉输入,从而更准确地理解用户意图并提供更全面的搜索结果。与传统的搜索引擎相比,MMSearch 能够更有效地处理复杂的多模态查询,尤其是在涉及图像和文本的查询中表现出色。

🚀 MMSearch 引擎基于一个三步流程:首先,它将查询改写成更适合搜索引擎的形式,例如将图像数据转换为文本查询;其次,它重新排序搜索结果,优先考虑提供最相关信息的网站;最后,它将文本和视觉信息结合起来进行总结,确保响应覆盖查询的所有方面。

📊 在性能方面,MMSearch 引擎在处理多模态查询方面表现出显著的改进。研究人员使用 300 个涵盖 14 个子领域的查询对该系统进行了评估,包括技术、体育和金融。MMSearch 的性能明显优于 Perplexity Pro,这是一款领先的商业 AI 搜索引擎。例如,配备了 MMSearch 的 GPT-4o 在多模态搜索任务中获得了最高的总分。它在端到端评估中超过了 Perplexity Pro,尤其是在处理复杂的基于图像的查询方面。在 14 个子领域中,MMSearch 处理了超过 2,900 张独特的图像,确保提供的数据与查询相关且匹配良好。

🌟 MMSearch 引擎的出现标志着多模态搜索技术的重大进步。通过解决仅限文本查询的局限性,并引入一个强大的系统来处理文本和视觉数据,研究人员提供了一种工具,有可能改变 AI 搜索引擎的运作方式。该系统在处理超过 2,900 张图像并生成准确的搜索结果(跨越 300 个独特查询)方面的成功展示了其在学术和商业环境中的潜力。将图像数据与先进的 LLM 功能相结合,已经导致了显著的性能改进,将 MMSearch 定位为领先的多模态搜索引擎。

Traditional search engines have predominantly relied on text-based queries, limiting their ability to process and interpret the increasingly complex information found online today. Many modern websites feature both text and images. Yet, the ability of conventional search engines to handle these multimodal queries, those that require an understanding of both visual and textual content, remains lacking. Large Language Models (LLMs) have shown great promise in enhancing the accuracy of textual search results. However, they still fall short when fully addressing queries involving images, videos, or other non-textual media.

One of the major challenges in search technology is bridging the gap between how search engines process textual data and the growing need to interpret visual information. Users today often seek answers that require more than text; they may upload images or screenshots, expecting AI to retrieve relevant content based on these inputs. However, current AI search engines remain text-centric and need help to grasp the depth of image-text relationships that could improve the quality and relevance of search results. This limitation constrains the effectiveness of such engines and hinders their need to be more cohesive, particularly in scenarios where visual context is as important as textual content.

Current methods for multimodal search integration still need to be more cohesive. While tools like Google Lens can perform rudimentary image searches, they must efficiently combine image recognition with comprehensive web data searches. The gap between interpreting visual inputs and connecting those with relevant text-based results limits the overall capability of AI-powered search engines. Moreover, the performance of these tools is further improved by the need for real-time processing for multimodal queries. Despite the rapid evolution of LLMs, there is still a need for a search engine that can cohesively process both text and images in a unified manner.

A research team from CUHK MMLab, ByteDance, CUHK MiuLar Lab, Shanghai AI Laboratory, Peking University, Stanford University, and Sensetime Research introduced the MMSearch Engine. This new tool transforms the search landscape by empowering any LLM to handle multimodal search queries. Unlike traditional engines, MMSearch incorporates a structured pipeline that processes text and visual inputs simultaneously. The researchers developed this system to optimize how LLMs handle the complexities of multimodal data, thereby improving the accuracy of search results. The MMSearch Engine is built to reprocess user queries, analyze relevant websites, and summarize the most informative responses based on text and images.

The MMSearch Engine is based on a three-step process designed to address the shortcomings of existing tools. First, the engine reformulates queries into a more conducive format for search engines. For example, if a query includes an image, MMSearch translates the visual data into meaningful text queries, making it easier for LLMs to interpret. Second, it reranks the websites that the search engine retrieves, prioritizing those that offer the most relevant information. Finally, the system summarizes the content by integrating visual and textual data, ensuring the response covers all aspects of the query. Notably, this multi-stage interaction ensures a robust search experience for users who require image and text-based results.

In terms of performance, the MMSearch Engine demonstrates considerable improvements over existing search tools. The researchers evaluated the system on 300 queries spanning 14 subfields, including technology, sports, and finance. MMSearch performed significantly better than Perplexity Pro, a leading commercial AI search engine. For instance, the MMSearch-enhanced version of GPT-4o achieved the highest overall score in multimodal search tasks. It surpassed Perplexity Pro in an end-to-end evaluation, particularly its ability to handle complex image-based queries. Across the 14 subfields, MMSearch handled over 2,900 unique images, ensuring that the data provided was relevant and well-matched to the query.

The detailed results of the study show that GPT-4o equipped with MMSearch achieved a notable 62.3% overall score in handling multimodal queries. This performance included querying, reranking, and summarizing content based on text and images. The comprehensive dataset, collected from various sources, was designed to exclude any information that could overlap with the LLM’s pre-existing knowledge, ensuring that the evaluation focused purely on the engine’s ability to process new, real-time data. Furthermore, MMSearch outperformed Perplexity Pro in reranking tasks, demonstrating its superior capacity to rank websites based on multimodal content.

In conclusion, the MMSearch Engine represents a significant advancement in multimodal search technology. By addressing the limitations of text-only queries and introducing a robust system for handling both textual and visual data, the researchers have provided a tool that could reshape how AI search engines operate. The system’s success in processing over 2,900 images and generating accurate search results across 300 unique queries showcases its potential in academic and commercial settings. Combining image data with advanced LLM capabilities has led to significant performance improvements, positioning MMSearch as a leading solution for the next generation of AI search engines.


Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)

The post MMSearch Engine: AI Search with Advanced Multimodal Capabilities to Accurately Process and Integrate Text and Visual Queries for Enhanced Search Results appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MMSearch 多模态搜索 AI 搜索 大型语言模型 图像搜索
相关文章