MarkTechPost@AI 2024年11月15日
Microsoft Released LLM2CLIP: A New AI Technique in which a LLM Acts as a Teacher for CLIP’s Visual Encoder
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

LLM2CLIP是一种新的人工智能技术,它将大型语言模型(LLM)整合到CLIP模型中,以增强其视觉表示学习能力。该方法通过替换CLIP原始的文本编码器,并利用LLM丰富的知识来增强其视觉编码器,从而显著提升了CLIP在图像文本检索等跨模态任务中的性能。研究人员使用了一种字幕对比微调技术,有效地提高了LLM区分图像字幕的能力,并最终构建了一个强大的跨模态模型。LLM2CLIP方法在计算上高效,成本低廉,并且在多个基准测试中取得了优异的成绩,展现了将LLM与图像文本模型相结合的巨大潜力。

🤔LLM2CLIP方法的核心在于将大型语言模型(LLM)整合到CLIP模型中,取代了CLIP原有的文本编码器,从而利用LLM的丰富知识增强CLIP的视觉编码器。

💡为了解决LLM作为CLIP文本编码器时难以区分图像字幕的问题,研究人员提出了字幕对比微调技术,显著提升了LLM的字幕区分能力。

🚀通过LLM2CLIP微调后的模型,在图像到文本和文本到图像检索等任务中表现出色,超越了现有的最先进模型,例如标准CLIP和EVA。

📈LLM2CLIP方法有效提升了CLIP模型的性能,尤其是在长文本和短文本检索任务上,将之前最先进的EVA02模型的性能提升了16.5%。

🔄LLM2CLIP可以利用Laion-2B和Recaption-1B等数据集进行从头训练,以获得更好的结果和性能,为CLIP训练及其广泛应用提供了新的基准。

In today’s world, CLIP is one of the most important multimodal foundational models. It combines visual and textual signals into a shared feature space using a simple contrastive learning loss on large-scale image-text pairs. As a retriever, CLIP supports many tasks, including zero-shot classification, detection, segmentation, and image-text retrieval. Also, as a feature extractor, it has become dominant in virtually all cross-modal representation tasks, such as image understanding, video understanding, and text-to-image/video generation. Its strength mainly comes from its ability to connect images with natural language and capture human knowledge as it is trained on large web data with detailed text descriptions, unlike vision encoders. As the large language models (LLMs) are developing rapidly, the boundaries of language comprehension and generation are continually being pushed. LLMs’ strong text skills can help CLIP better handle long, complex captions, a weakness of the original CLIP. LLMs also have broad knowledge of large text datasets, making training more effective. LLMs have strong understanding skills, but their way of generating text hides abilities that make their outputs unclear. 

Current developments have extended CLIP to handle other modalities, and its influence in the field is growing. New models like Llama3 have been used to extend CLIP’s caption length and improve its performance by leveraging the open-world knowledge of LLMs. However, incorporating LLMs with CLIP takes work due to the limitations of its text encoder. In multiple experiments, it was found that directly integrating LLMs into CLIP leads to reduced performance. Thus, certain challenges exist to overcome to explore the potential benefits of incorporating LLMs into CLIP.

Tongji University and Microsoft Corporation researchers conducted detailed research and proposed the LLM2CLIP approach for enhancing visual representation learning by integrating large language models (LLMs). This method takes a straightforward step by replacing the original CLIP text encoder and enhances the CLIP visual encoder with extensive knowledge of LLMs. It identifies key obstacles associated with this innovative idea and suggests a cost-effective fine-tuning strategy to overcome them. This method boldly replaces the original CLIP text encoder. It recognizes the challenges of this approach and suggests an affordable way to fine-tune the model to address them.

The LLM2CLIP method effectively improved the CLIP model by integrating large language models (LLMs) like Llama. Initially, LLMs struggled as text encoders for CLIP due to their inability to clearly distinguish image captions. Researchers introduced the caption contrastive fine-tuning technique to address this, greatly improving the LLM’s ability to separate captions. This fine-tuning led to a substantial performance boost, surpassing existing state-of-the-art models. The LLM2CLIP framework combined the improved LLM with the pretrained CLIP visual encoder, creating a powerful cross-modal model. The method used large LLMs but remained computationally efficient with minimal added costs.

The experiments mainly focused on fine-tuning models for better image-text matching using datasets like CC-3M. For LLM2CLIP fine-tuning, three dataset sizes were tested: small (CC-3M), medium (CC-3M and CC-12M), and large (CC-3M, CC-12M, YFCC-15M, and Recaption-1B). Training with augmented captions improved performance, while using an untrained language model for CLIP worsened it. Models trained with LLM2CLIP outperformed standard CLIP and EVA in tasks like image-to-text and text-to-image retrieval, highlighting the advantage of integrating large language models with image-text models. 

The method directly boosted the performance of the previous SOTA EVA02 model by 16.5% on both long-text and short-text retrieval tasks, transforming a CLIP model trained solely on English data into a state-of-the-art cross-lingual model. After integrating multimodal training with models like Llava 1.5, it performed better than CLIP on almost all benchmarks, showing significant overall improvements in performance.

In conclusion, the proposed method allows LLMs to assist in CLIP training. By adjusting parameters such as data distribution, length, or categories, the LLM can be modified to fix CLIP’s limitations. It allows LLM to act as a more comprehensive teacher for various tasks. In the proposed work, the LLM gradients were frozen during fine-tuning to maintain a large batch size for CLIP training. In future works, the LLM2CLIP can be trained from scratch on datasets like Laion-2Band and Recaption-1B for better results and performance. This work can be used as a baseline for future research in CLIP training and its wide range of applications!


Check out the Paper, Code, and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI WEBINAR] Implementing Intelligent Document Processing with GenAI in Financial Services and Real Estate Transactions

The post Microsoft Released LLM2CLIP: A New AI Technique in which a LLM Acts as a Teacher for CLIP’s Visual Encoder appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

CLIP LLM 视觉编码器 跨模态 图像文本检索
相关文章