MarkTechPost@AI 01月12日
RAG-Check: A Novel AI Framework for Hallucination Detection in Multi-Modal Retrieval-Augmented Generation Systems
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

RAG-check是评估多模态RAG系统的综合方法,旨在解决语言模型的幻觉问题,其包含三个关键组件,使用两种主要评估指标,不同配置的RAG系统性能有显著差异。

🎯RAG-check由评估数据与用户查询相关性的神经网络等三个组件构成

📊该架构使用Relevancy Score和Correctness Score两种评估指标

💪不同RAG系统配置性能差异显著,GPT-4o在某些方面表现优越

🌟RS模型虽提高相关性分数,但计算成本增加

Large Language Models (LLMs) have revolutionized generative AI, showing remarkable capabilities in producing human-like responses. However, these models face a critical challenge known as hallucination, the tendency to generate incorrect or irrelevant information. This issue poses significant risks in high-stakes applications such as medical evaluations, insurance claim processing, and autonomous decision-making systems where accuracy is most important. The hallucination problem extends beyond text-based models to vision-language models (VLMs) that process images and text queries. Despite developing robust VLMs such as LLaVA, InstructBLIP, and VILA, these systems struggle with generating accurate responses based on image inputs and user queries.

Existing research has introduced several methods to address hallucination in language models. For text-based systems, FactScore improved accuracy by breaking long statements into atomic units for better verification. Lookback Lens developed an attention score analysis approach to detect context hallucination, while MARS implemented a weighted system focusing on crucial statement components. For RAG systems specifically, RAGAS and LlamaIndex emerged as evaluation tools, with RAGAS focusing on response accuracy and relevance using human evaluators, while LlamaIndex employs GPT-4 for faithfulness assessment. However, no existing works provide hallucination scores specifically for multi-modal RAG systems, where the contexts include multiple pieces of multi-modal data.

Researchers from the University of Maryland, College Park, MD, and NEC Laboratories America, Princeton, NJ have proposed RAG-check, a comprehensive method to evaluate multi-modal RAG systems. It consists of three key components designed to assess both relevance and accuracy. The first component involves a neural network that evaluates the relevancy of each retrieved piece of data to the user query. The second component implements an algorithm that segments and categorizes the RAG output into scorable (objective) and non-scorable (subjective) spans. The third component utilizes another neural network to evaluate the correctness of objective spans against the raw context, which can include both text and images converted to text-based format through VLMs.

The RAG-check architecture uses two primary evaluation metrics: the Relevancy Score (RS) and Correctness Score (CS) to evaluate different aspects of RAG system performance. For evaluating selection mechanisms, the system analyzes the relevancy scores of the top 5 retrieved images across a test set of 1,000 questions, providing insights into the effectiveness of different retrieval methods. In terms of context generation, the architecture allows for flexible integration of various model combinations either separate VLMs (like LLaVA or GPT4) and LLMs (such as LLAMA or GPT-3.5), or unified MLLMs like GPT-4. This flexibility enables a comprehensive evaluation of different model architectures and their impact on response generation quality.

The evaluation results demonstrate significant performance variations across different RAG system configurations. When using CLIP models as vision encoders with cosine similarity for image selection, the average relevancy scores ranged from 30% to 41%. However, implementing the RS model for query-image pair evaluation dramatically improves relevancy scores to between 71% and 89.5%, though at the cost of a 35-fold increase in computational requirements when using an A100 GPU. GPT-4o emerges as the superior configuration for context generation and error rates, outperforming other setups by 20%. The remaining RAG configurations show comparable performance, with an accuracy rate between 60% and 68%.

In conclusion, researchers RAG-check, a novel evaluation framework for multi-modal RAG systems to address the critical challenge of hallucination detection across multiple images and text inputs. The framework’s three-component architecture, comprising relevancy scoring, span categorization, and correctness assessment shows significant improvements in performance evaluation. The results reveal that while the RS model substantially enhances relevancy scores from 41% to 89.5%, it comes with increased computational costs. Among various configurations tested, GPT-4o emerged as the most effective model for context generation, highlighting the potential of unified multi-modal language models in improving RAG system accuracy and reliability.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation IntelligenceJoin this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

The post RAG-Check: A Novel AI Framework for Hallucination Detection in Multi-Modal Retrieval-Augmented Generation Systems appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

RAG-check 幻觉检测 多模态 评估指标 性能差异
相关文章