MarkTechPost@AI 2024年08月09日
MedTrinity-25M: A Comprehensive Multimodal Medical Dataset with Advanced Annotations and Its Impact on Vision-Language Model Performance
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

MedTrinity-25M 是一个包含超过 2500 万张图像的大规模多模态医学数据集,涵盖 10 种模态和 65 种疾病。它使用自动管道从未配对的图像生成详细的注释,利用专家模型和先进的 MLLM。该数据集丰富的多粒度注释支持各种任务,包括字幕、报告生成和分类。在 MedTrinity-25M 上预训练的模型在 VQA 任务中取得了最先进的结果,突出了其训练多模态医学 AI 模型的有效性。

🤔 MedTrinity-25M 是一个包含超过 2500 万张图像的大规模多模态医学数据集,涵盖 10 种模态和 65 种疾病。它使用自动管道从未配对的图像生成详细的注释,利用专家模型和先进的 MLLM。该数据集丰富的多粒度注释支持各种任务,包括字幕、报告生成和分类。

🔍 MedTrinity-25M 数据集使用自动管道生成注释,避免了以往依赖配对图像-文本数据的限制。该管道利用专家模型和先进的 MLLM,为每个图像生成详细的多粒度注释,包括图像模态、疾病类型、感兴趣区域 (ROI) 的边界框和分割掩码。

📈 在 MedTrinity-25M 上预训练的模型在 VQA 任务中取得了最先进的结果,突出了其训练多模态医学 AI 模型的有效性。研究人员使用 VQA-RAD、SLAKE 和 PathVQA 数据集评估了 LLaVA-Med++ 在生物医学视觉问答 (VQA) 任务中的性能,结果表明,MedTrinity-25M 预训练的 LLaVA-Med++ 在 VQA-RAD、SLAKE 和 PathVQA 上分别比基线模型提高了约 10.75%、6.1% 和 13.25%。

🚀 MedTrinity-25M 的发布为医学 AI 研究提供了一个宝贵的数据资源,它可以帮助开发更先进的多模态模型,用于各种医学视觉语言任务,例如医学影像诊断、病理分析和药物发现。

💡 随着医学 AI 的快速发展,高质量的医学数据集对于训练强大且可靠的模型至关重要。MedTrinity-25M 的出现为医学 AI 研究开辟了新的可能性,为未来医学影像分析和诊断技术的进步奠定了基础。

Large-scale multimodal foundation models have achieved notable success in understanding complex visual patterns and natural language, generating interest in their application to medical vision-language tasks. Progress has been made by creating medical datasets with image-text pairs and fine-tuning general domain models on these datasets. However, these datasets have limitations. They lack multi-granular annotations that link local and global information within medical images, which is crucial for identifying specific lesions from regional details. Additionally, current methods for constructing these datasets rely heavily on pairing medical images with reports or captions, limiting their scalability.

Researchers from UC Santa Cruz, Harvard University, and Stanford University have introduced MedTrinity-25M, a large-scale multimodal medical dataset containing over 25 million images across ten modalities. This dataset includes detailed multi-granular annotations for more than 65 diseases, encompassing global information like disease type and modality and local annotations such as bounding boxes and segmentation masks for regions of interest (ROIs). Using an automated pipeline, the researchers generated these comprehensive annotations without relying on paired text descriptions, enabling advanced multimodal tasks and supporting large-scale pretraining of medical AI models.

Medical multimodal foundation models have seen growing interest due to their ability to understand complex visual and textual features, leading to advancements in medical vision-language tasks. Models like Med-Flamingo and Med-PaLM have been fine-tuned on medical datasets to enhance their performance. However, the scale of available training data often limits these models. To address this, researchers have focused on constructing large medical datasets. However, datasets like MIMIC-CXR and RadGenome-Chest CT are constrained by the labor-intensive process of pairing images with detailed textual descriptions. In contrast, the MedTrinity-25M dataset uses an automated pipeline to generate comprehensive multi-granular annotations for unpaired photos, offering a significantly larger and more detailed dataset.

The MedTrinity-25M dataset features over 25 million images organized into triplets of {image, ROI, description}. Images span ten modalities and cover 65 diseases, sourced from repositories like TCIA and Kaggle. ROIs are highlighted with masks or bounding boxes, pinpointing abnormalities or key anatomical features. Multigranular textual descriptions detail the image modality, disease, and ROI specifics. The dataset construction involves generating coarse captions, identifying ROIs with models like SAT and BA-Transformer, and leveraging medical knowledge for accurate descriptions. MedTrinity-25M stands out for its scale, diversity, and detailed annotations compared to other datasets.

The study evaluated LLaVA-Med++ on biomedical Visual Question Answering (VQA) tasks using VQA-RAD, SLAKE, and PathVQA datasets to assess the impact of pretraining on the MedTrinity-25M dataset. Initial pretraining followed LLaVA-Med’s methodology, with additional fine-tuning on VQA datasets for three epochs. Results show that LLaVA-Med++ with MedTrinity-25M pretraining outperforms the baseline model by approximately 10.75% on VQA-RAD, 6.1% on SLAKE, and 13.25% on PathVQA. It achieves state-of-the-art results in two benchmarks and ranks third in the third, demonstrating significant performance improvements with MedTrinity-25M pretraining.

The study presents MedTrinity-25M, a vast multi-modal medical dataset with over 25 million image-ROI-description triplets from 90 sources, spanning ten modalities and covering over 65 diseases. Unlike previous methods reliant on paired image-text data, MedTrinity-25M is created using an automated pipeline that generates detailed annotations from unpaired images, leveraging expert models and advanced MLLMs. The dataset’s rich multi-granular annotations support a variety of tasks, including captioning, report generation, and classification. The model, pretrained on MedTrinity-25M, achieved state-of-the-art results in VQA tasks, highlighting its effectiveness for training multimodal medical AI models.


Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 48k+ ML SubReddit

Find Upcoming AI Webinars here


The post MedTrinity-25M: A Comprehensive Multimodal Medical Dataset with Advanced Annotations and Its Impact on Vision-Language Model Performance appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MedTrinity-25M 多模态医学数据集 视觉语言模型 医学AI
相关文章