Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025

cs.AI updates on arXiv.org 07月22日 12:44

本文介绍了一种针对胃肠道内窥镜视觉问答（VQA）的挑战解决方案，采用Florence模型作为VQA流程的主干，结合视觉编码器和文本编码器来解析内窥镜图像并生成临床相关答案。通过领域特定增强提高泛化能力，实验表明Florence模型在官方挑战指标上表现良好。

arXiv:2507.14544v1 Announce Type: cross Abstract: This paper describes our approach to Subtask 1 of the ImageCLEFmed MEDVQA 2025 Challenge, which targets visual question answering (VQA) for gastrointestinal endoscopy. We adopt the Florence model-a large-scale multimodal foundation model-as the backbone of our VQA pipeline, pairing a powerful vision encoder with a text encoder to interpret endoscopic images and produce clinically relevant answers. To improve generalization, we apply domain-specific augmentations that preserve medical features while increasing training diversity. Experiments on the KASVIR dataset show that fine-tuning Florence yields accurate responses on the official challenge metrics. Our results highlight the potential of large multimodal models in medical VQA and provide a strong baseline for future work on explainability, robustness, and clinical integration. The code is publicly available at: https://github.com/TiwariLaxuu/VQA-Florence.git

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签