MarkTechPost@AI 2024年10月18日
Researchers at Stanford University Propose Locality Alignment: A New Post-Training Stage for Vision Transformers ViTs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Vision-Language Models在空间推理任务上存在困难,斯坦福大学研究者提出名为Locality Alignment的新解决方案,包括对Vision Transformers的后训练阶段,以增强其局部语义提取能力,改善空间推理任务表现,该方案在多个基准测试中显示出有效性。

🌐Vision-Language Models在对象定位、计数和关系问答等空间推理任务上表现不佳,原因是Vision Transformers在图像级监督下训练,难以有效编码局部信息,限制了空间理解能力。

💡斯坦福大学研究者提出Locality Alignment解决方案,包括对Vision Transformers的后训练阶段,其过程中的MaskEmbed程序通过遮蔽图像部分并训练模型重建,以学习每个图像块的语义贡献。

🎉Locality Alignment方案在仅视觉和视觉语言基准测试中均显示出有效性,改进了局部语义提取能力,在空间理解相关任务中表现更佳,且计算成本低,是VLM训练方法的有前景的补充。

Vision-Language Models (VLMs) struggle with spatial reasoning tasks like object localization, counting, and relational question-answering. This issue stems from Vision Transformers (ViTs) trained with image-level supervision, which often fail to encode localized information effectively, limiting spatial understanding.

Researchers from Stanford University propose a novel solution called Locality Alignment, which involves a post-training stage for Vision Transformers. This process aims to enhance the local semantic extraction capabilities of ViTs to improve their performance on spatial reasoning tasks. Their approach includes a fine-tuning procedure called MaskEmbed, which uses a masked reconstruction loss to learn the semantic contributions of each image patch. By leveraging the latent knowledge of local semantics present in pre-trained models, the authors aim to align and enhance locality understanding in a scalable, self-supervised manner. This technique does not require new labeled data, making it efficient and easy to implement.

The proposed locality alignment process begins by applying the MaskEmbed procedure to pre-trained vision backbones. MaskEmbed works by masking parts of the image and training the model to reconstruct the masked portions. This allows the model to understand the contributions of each image patch to the overall representation. The training is conducted as a post-training phase on the ViT, which then integrates into a full Vision-Language Model pipeline. The approach can be applied to models trained with image-level supervision, such as CLIP or SigLIP. Importantly, MaskEmbed uses self-supervision, reducing computational costs compared to traditional supervised approaches. The process of locality alignment is visualized in the VLM training pipeline, starting with locality alignment and progressing to fine-tuning for vision-language tasks.

The effectiveness of locality alignment was tested using both vision-only and vision-language benchmarks. The locality-aligned ViTs showed improved performance in patch-level semantic segmentation tasks, particularly for models like CLIP and SigLIP that were trained with image-caption pairs. In the vision-language evaluations, VLMs trained with locality-aligned backbones demonstrated better performance across a range of benchmarks involving spatial understanding. Specifically, improvements were observed in tasks like object localization (RefCOCO, OCID-Ref), relational question-answering (VSR), and counting (TallyQA). The locality alignment approach improved local semantic extraction without sacrificing global image understanding, yielding significant performance improvements across multiple benchmarks.

Locality alignment effectively enhances the local semantic capabilities of vision backbones in Vision-Language Models. The MaskEmbed approach leverages self-supervision to improve local semantics in pre-trained ViTs, leading to better spatial reasoning performance. With low computational cost and consistent improvements, locality alignment is a promising addition to VLM training methods and may benefit other tasks requiring spatial understanding. The research emphasizes disentangling local and global semantics in vision backbones with a scalable approach.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)

The post Researchers at Stanford University Propose Locality Alignment: A New Post-Training Stage for Vision Transformers ViTs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Locality Alignment Vision Transformers 空间推理 局部语义
相关文章