MarkTechPost@AI 02月19日
ViLa-MIL: Enhancing Whole Slide Image Classification with Dual-Scale Vision-Language Multiple Instance Learning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

ViLa-MIL模型通过结合视觉语言模型和多示例学习,显著提升了全切片图像(WSI)的分类性能,尤其是在病理学领域。该模型利用大型语言模型生成特定于病理学的描述性文本提示,并在两个尺度上提取特征:低尺度关注全局肿瘤结构,高尺度关注精细细胞细节。此外,原型引导的补丁解码器通过聚类相似补丁来逐步累积补丁特征,降低了计算复杂度并改善了特征表示。实验结果表明,ViLa-MIL在多个癌症亚型数据集上优于现有的MIL和VLM方法,尤其是在少样本学习场景中表现出色,为AI辅助癌症诊断带来了新的突破。

🔬ViLa-MIL模型采用双尺度视觉语言多示例学习,通过病理学专用描述性文本提示,高效地将视觉语言模型知识迁移到数字病理学中。

🧬该模型利用冻结的大型GPT-3.5语言模型,在两个尺度上生成类别特定的描述性提示,并结合可学习向量以促进有效的特征表示。低尺度提示突出全局肿瘤结构,高尺度提示突出更精细的细胞细节,从而提高特征辨别能力。

🧮原型引导的补丁解码器通过将相似的补丁聚类成可学习的原型向量,逐步累积补丁特征,从而最大限度地降低了计算复杂性并改善了特征表示。上下文引导的文本解码器通过使用多粒度图像上下文进一步改进文本描述,从而促进视觉和文本模态的更有效融合。

📊实验结果表明,ViLa-MIL在TIHD-RCC、TCGA-RCC和TCGA-Lung三个数据集上,AUC、F1分数和准确率均有显著提高,证明了该模型在单中心和多中心测试中的稳健性。与最先进的方法相比,分类准确率的提高幅度在AUC上为1.7%至7.2%,在F1分数上为2.1%至7.3%。

Whole Slide Image (WSI) classification in digital pathology presents several critical challenges due to the immense size and hierarchical nature of WSIs.  WSIs contain billions of pixels and hence direct observation is computationally infeasible. Current strategies based on multiple instance learning (MIL) are effective in performance but considerably dependent on large amounts of bag-level annotated data, whose acquisition is troublesome, particularly in the case of rare diseases. Moreover, current strategies are strongly based on image insights and encounter generalization issues due to differences in the data distribution across hospitals. Recent improvements in Vision-Language Models (VLMs) introduce linguistic prior through large-scale pretraining from image-text pairs; however, current strategies fall short in addressing domain-specific insights related to pathology. Moreover, the computationally expensive nature of pretraining models and their insufficient adaptability with the hierarchical characteristic specific to pathology are additional setbacks. It is essential to transcend these challenges to promote AI-based cancer diagnosis and proper WSI classification.

MIL-based methods generally adopt a three-stage pipeline: patch cropping from WSIs, feature extraction with a pre-trained encoder, and patch-level to slide-level feature aggregation to make predictions. Although these methods are effective for pathology-related tasks like cancer subtyping and staging, their dependency on large annotated datasets and data distribution shift sensitivity renders them less practical to use. VLM-based models like CLIP and BiomedCLIP try to tap into language priors by utilizing large-scale image-text pairs gathered from online databases. These models, however, depend on general, handcrafted text prompts that lack the subtlety of pathological diagnosis. In addition, knowledge transfer from vision-language models to WSIs is inefficient owing to the hierarchical and large-scale nature of WSIs, which demands astronomical computational costs and dataset-specific fine-tuning.

Researchers from Xi’an Jiaotong University, Tencent YouTu Lab, and Institute of High-Performance Computing Singapore introduce a dual-scale vision-language multiple instance learning model capable of efficiently transferring vision-language model knowledge to digital pathology through descriptive text prompts designed specifically for pathology and trainable decoders for image and text branches. In contrast to generic class-name-based prompts for traditional vision-language methods, the model utilizes a frozen large language model to generate domain-specific descriptions at two resolutions. The low-scale prompt highlights global tumor structures, and the high-scale prompt highlights finer cellular details, with improved feature discrimination. A prototype-guided patch decoder progressively accumulates patch features by clustering similar patches into learnable prototype vectors, minimizing computational complexity and improving feature representation. A context-guided text decoder further improves text descriptions by using multi-granular image context, facilitating a more effective fusion of visual and textual modalities.

The model proposed relies on CLIP as its underlying model and utilizes several additions to adapt it for pathology tasks. Whole-slide images are patchily segmented at the 5× and 10× magnification levels, while feature extraction uses a frozen ResNet-50 image encoder. A frozen large GPT-3.5 language model is also used to generate class-specific descriptive prompts for two scales with learnable vectors to facilitate effective feature representation. Progressive feature agglomeration is supported using a set of 16 learnable prototype vectors. The patch and prototype multi-granular features also help support the text embeddings, hence improved cross-modal alignment. Optimizing training makes use of the cross-entropy loss with equally weighted low- and high-scale similarity scores for robust classification support.

This method demonstrates better performance on various subtyping datasets of cancer significantly outperforming current MIL-based and VLM-based methods in few-shot learning scenarios. The model records impressive gains in AUC, F1 score, and accuracy over three diverse datasets—TIHD-RCC, TCGA-RCC, and TCGA-Lung—demonstrating the model’s solidity in tests executed in both single-center and multi-center setups. In comparison to state-of-the-art approaches, significant gains in classification accuracy are observed with rises of 1.7% to 7.2% in AUC and 2.1% to 7.3% in F1 score. The employment of dual-scale text prompts with a prototype-guided patch decoder and context-guided text decoder aids the framework in its ability to learn effective discriminative morphological patterns despite the presence of few training instances. Moreover, excellent generalization abilities across multiple datasets suggest enhanced adaptability toward domain shift during cross-center testing. These observations demonstrate the merits of fusing vision-language models with pathology-specialized advances toward whole slide image classification. 

Through the development of a new dual-scale vision-language learning framework, this research makes a substantial contribution to WSI classification with the utilization of large language models to prompt text and prototype-based feature aggregation. The method enhances few-shot generalization, decreases computational cost, and promotes interpretability, solving core pathology AI challenges. By building on the successful vision-language model transfer to digital pathology, this research is a valuable contribution to cancer diagnosis with AI, with the potential to generalize to other medical image tasks.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

The post ViLa-MIL: Enhancing Whole Slide Image Classification with Dual-Scale Vision-Language Multiple Instance Learning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

WSI分类 视觉语言模型 多示例学习 数字病理学
相关文章