MedUnA: Efficient Medical Image Classification through Unsupervised Adaptation of Vision-Language Models

Supervised learning in medical image classification faces challenges due to the scarcity of labeled data, as expert annotations are difficult to obtain. Vision-Language Models (VLMs) address this issue by leveraging visual-text alignment, allowing unsupervised learning, and reducing reliance on labeled data. Pre-training on large medical image-text datasets enables VLMs to generate accurate labels and captions, lowering annotation costs. Active learning prioritizes key samples for expert annotation, while transfer learning fine-tunes pre-trained models on specific medical datasets. VLMs also generate synthetic images and annotations, enhancing data diversity and model performance in medical imaging tasks.

Researchers from Mohamed Bin Zayed University of AI and Inception Institute of AI propose MedUnA, a Medical Unsupervised Adaptation method for image classification. MedUnA employs two-stage training: Adapter Pre-training using text descriptions generated by an LLM aligned with class labels, followed by Unsupervised Learning. The adapter integrates with MedCLIP’s visual encoder, utilizing entropy minimization to align visual and text embeddings. MedUnA addresses the modality gap between textual and visual data, improving classification performance without extensive pre-training. This method efficiently adapts vision-language models for medical tasks, reducing reliance on labeled data and enhancing scalability.

A common method for using VLMs in medical imaging involves extensive pre-training on large datasets, followed by fine-tuning for tasks like classification, segmentation, and report generation. Unlike these resource-intensive strategies, MedUnA leverages the existing alignment between visual and textual embeddings to avoid large-scale pre-training. It uses unlabeled images and auto-generated descriptions from an LLM for disease categories. A lightweight adapter and prompt vector are trained to minimize self-entropy, ensuring confident performance across multiple data augmentations. MedUnA offers improved efficiency and performance without the need for extensive pre-training.

The methodology consists of two stages: adapter pre-training and unsupervised training. In Stage 1, textual descriptions for each class are generated using an LLM and embedded via a text encoder. A cross-modal adapter is trained by minimizing cross-entropy between the generated logits and ground-truth labels. In Stage 2, the adapter is further trained using medical images in an unsupervised manner, with weak and strong augmentations of the input passed through two branches. The strong branch uses a learnable prompt, and training minimizes the difference between the outputs of the two branches. Inference is performed using the optimized strong branch.

The experiments tested the proposed method using five public medical datasets, covering diseases like tuberculosis, pneumonia, diabetic retinopathy, and skin cancer. Text descriptions for the classes in each dataset were generated using GPT-3.5 and other language models and then fed into a text classifier. The method was evaluated using CLIP and MedCLIP visual encoders, with MedCLIP performing better overall. Unsupervised learning was used to generate pseudo-labels for unlabelled images, and models were trained using the SGD optimizer. Results showed that MedUnA, the proposed method, achieved superior accuracy compared to baseline models.

The study analyzes the experimental results, highlighting the performance of MedUnA compared to other methods like CLIP, MedCLIP, LaFTer, and TPT. MedUnA demonstrates notable accuracy improvements on several medical datasets, particularly outperforming zero-shot MedCLIP in most cases. Minimal improvement is observed on the Pneumonia dataset due to MedCLIP’s pre-training. Additionally, t-SNE plots indicate that MedUnA produces more distinct clustering, enhancing classification precision. The correlation between text classifier accuracy from various LLMs and MedUnA’s performance is also explored, along with an ablation study on the impact of different loss functions.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)

The post MedUnA: Efficient Medical Image Classification through Unsupervised Adaptation of Vision-Language Models appeared first on MarkTechPost.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签