MarkTechPost@AI 2024年12月08日
UC Berkeley Researchers Explore the Role of Task Vectors in Vision-Language Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

加州大学伯克利分校的研究人员通过实验分析了任务向量在视觉语言模型 (VLM) 中的编码和转换方式。研究发现,无论任务是由文本示例、图像示例还是显式指令定义,VLM 都能将输入映射到共享的任务表示空间。研究人员创建了六个任务来测试 VLM 是否与任务向量的行为类似,并观察任务向量在不同模态之间的转换效果。通过分析 VLM 中标记表示的变化,研究人员揭示了一个三阶段过程:编码输入、形成任务表示和生成输出。该研究还评估了任务向量在文本和图像上下文学习 (ICL) 中的跨模态转换性能,发现跨模态转换可以显著提高准确性,并且基于文本的任务向量比基于图像的任务向量更有效。将基于指令和基于示例的任务向量集成到一个向量中可以提高任务表示的效率。

👁️‍🗨️视觉语言模型 (VLM) 是一种重要的工具,它使用文本来处理不同的计算机视觉任务,例如图像识别、图像文本识别 (OCR) 和对象检测。

🧠当前 VLM 的方法将任务视为基于文本或基于图像,一次只关注一种输入类型,这忽略了图像和文本信息结合的深层可能性。上下文学习 (ICL) 使模型能够通过最少的示例适应任务,VLM 借鉴了 LLM 的思想,使用后期融合或早期融合方法结合视觉和文本数据。

🧑‍🔬研究人员通过实验分析了任务向量在 VLM 中的编码和转换方式,发现 VLM 将输入映射到共享的任务表示空间,无论任务是由文本示例、图像示例还是显式指令定义。

📊研究创建了六个任务来测试 VLM 的行为,并观察任务向量在不同模态之间的转换效果。分析表明,VLM 中存在一个三阶段过程:编码输入、形成任务表示和生成输出。任务向量的解码通常总结了任务概念并对齐了文本和图像模态。

📈该研究评估了任务向量的跨模态转换性能,发现跨模态转换可以显著提高准确性。基于文本的任务向量比基于图像的任务向量更有效。将基于指令和基于示例的任务向量集成到一个向量中可以提高任务表示的效率。

Vision-and-language models (VLMs) are important tools that use text to handle different computer vision tasks. Tasks like recognizing images, reading text from images (OCR), and detecting objects can be approached as answering visual questions with text responses. While VLMs have shown limited success on tasks, what remains unclear is how they process and represent multimodal inputs like images and text to produce those answers, which raises doubts about the kind of representations that enable them to achieve such tasks.

The current methods in vision-and-language models treat tasks as either text-based or image-based, focusing on one input type at a time. This misses the deeper possibilities of combining information from images and text. In-context learning (ICL), a feature of large language models (LLMs), allows models to adapt to tasks with minimal examples, driven by mechanisms like attention heads or task vectors that encode tasks as latent activations. Vision-and-language models (VLMs), inspired by LLMs, combine visual and text data using either late-fusion (pre-trained components) or early-fusion (end-to-end training) methods. Studies revealed that task representations can transfer across modalities, and even VLMs without image ICL can use task vectors for better performance, highlighting similarities between image and text ICL processes. Combining image and text input can allow VLMs to perform complex tasks more effectively.

To solve this, researchers from the University of California, Berkeley, experimented to analyze how task vectors are encoded and transferred in VLMs. Researchers found that VLMs map inputs into a shared task representation space, regardless of whether text examples, image examples, or explicit instructions define the task. 

Researchers created six tasks to test whether VLMs behave similarly to task vectors and see how well task vectors could transfer across different modalities, using text, images, or direct instructions to define them. These vectors were then applied in cross-modal scenarios, like using text examples to define tasks but querying with images. Analyzing how token representations changed in VLMs showed a three-phase process: encoding input, forming a task representation, and generating outputs. The decoding of task vectors often summarized the task concept and aligned text and image modalities, although image-based tasks were less clear. 

The study evaluated the cross-modal transfer performance of task vectors from text and image in-context learning (ICL), revealing significant improvements. Cross-modal patching (xPatch) surpassed same-context examples (xBase), boosting accuracy by 14–33% over text ICL xBase and 8–13% over image ICL Patch. Text-based task vectors proved more efficient than the image-based ones, as those involved extra recognition steps. Adding instruction-based and exemplar-based task vectors into a single vector improves task representation, reducing variance and increasing efficiency by 18%. Cross-modal transfer from text to image results were as high as 37–52% accuracy compared with the baselines. LLM-to-VLM transfers exhibited a high similarity in the task vectors (cosine similarity: 0.89–0.95). Thus, the results highlighted cross-modal patching and vector integration as key to optimizing task performance.


In summary, VLMs can effectively encode and transfer task representations across different modalities, which shows potential for achieving more versatile and efficient multi-modal models. Researchers attempted possible explanations, such as shared structures between language and perception or the models learning from the same underlying reality. They found better performance in transferring tasks from text to images than from images to text, likely because VLM training focuses more on text. Thus, this work can be a future baseline for further research and innovation!


Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

[Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)

The post UC Berkeley Researchers Explore the Role of Task Vectors in Vision-Language Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

视觉语言模型 任务向量 跨模态学习 上下文学习 人工智能
相关文章