FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance

cs.AI updates on arXiv.org 12小时前

FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in finance

本文提出一套针对金融LLMs幻觉检测的框架，通过创建自动化数据集和评估数据集，以及全面评估现有LLMs在金融表格数据上的幻觉模式，为构建更可靠金融AI系统提供方法。

arXiv:2508.05201v1 Announce Type: cross Abstract: Hallucination remains a critical challenge for deploying Large Language Models (LLMs) in finance. Accurate extraction and precise calculation from tabular data are essential for reliable financial analysis, since even minor numerical errors can undermine decision-making and regulatory compliance. Financial applications have unique requirements, often relying on context-dependent, numerical, and proprietary tabular data that existing hallucination benchmarks rarely capture. In this study, we develop a rigorous and scalable framework for evaluating intrinsic hallucinations in financial LLMs, conceptualized as a context-aware masked span prediction task over real-world financial documents. Our main contributions are: (1) a novel, automated dataset creation paradigm using a masking strategy; (2) a new hallucination evaluation dataset derived from S&P 500 annual reports; and (3) a comprehensive evaluation of intrinsic hallucination patterns in state-of-the-art LLMs on financial tabular data. Our work provides a robust methodology for in-house LLM evaluation and serves as a critical step toward building more trustworthy and reliable financial Generative AI systems.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

金融LLMs 幻觉检测数据集 AI系统

相关文章

MS MARCO Web Search: A Large-Scale Information-Rich Web Dataset Featuring Millions of Real Clicked Query-Document Labels

This Week In Machine Learning & AI - 5/27/16: The White House on AI & Aggressive Self-Driving Cars

CinePile: A Novel Dataset and Benchmark Specifically Designed for Authentic Long-Form Video Understanding

FairProof: An AI System that Uses Zero-Knowledge Proofs to Publicly Verify the Fairness of a Model while Maintaining Confidentiality

‘RAG Me Up’: A Generic AI Framework (Server + UIs) that Enables You to Do RAG on Your Own Dataset Easily

HuggingFace Releases ? FineWeb: A New Large-Scale (15-Trillion Tokens, 44TB Disk Space) Dataset for LLM Pretraining

Deciphering Doubt: Navigating Uncertainty in LLM Responses

Unlocking the Language of Proteins: How Large Language Models Are Revolutionizing Protein Sequence Understanding

Galileo Introduces Luna: An Evaluation Foundation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost

MAGPIE: A Self-Synthesis Method for Generating Large-Scale Alignment Data by Prompting Aligned LLMs with Nothing