MarkTechPost@AI 2024年07月12日
Google DeepMind Unveils PaliGemma: A Versatile 3B Vision-Language Model VLM with Large-Scale Ambitions
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Google DeepMind发布了PaliGemma,一个30亿参数的视觉语言模型,它结合了PaLI系列的优势和Gemma语言模型家族的优势。PaliGemma在图像字幕、视觉问答等各种视觉语言任务中表现出色,并且在一些特定任务,例如图表理解和OCR相关任务中也取得了优异的成绩。

🤔 PaliGemma 的架构包含三个关键组件:SigLIP ViTSo400m 图像编码器、Gemma-2B v1.0 解码器语言模型和一个线性投影层。图像编码器将输入图像转换为一系列标记,而语言模型使用 SentencePiece 标记器处理文本。线性投影层对齐图像和文本标记的维度,使它们可以连接在一起。这种简单而有效的架构使 PaliGemma 能够通过灵活的图像+文本输入、文本输出 API 处理各种任务,包括图像分类、字幕和视觉问答。

🚀 PaliGemma 的训练过程涉及多个阶段,以确保全面的视觉语言理解。它从各个组件的单模态预训练开始,然后在各种任务的混合数据集上进行多模态预训练。值得注意的是,图像编码器在此阶段没有被冻结,这使得空间和关系理解得到了改善。训练继续进行分辨率提升阶段,增强模型处理高分辨率图像和复杂任务的能力。最后,一个迁移阶段将基础模型适应特定的任务或用例,展示了 PaliGemma 在各种应用中的多功能性和有效性。

📊 PaliGemma 在广泛的视觉语言任务中表现出令人印象深刻的性能。该模型在图像字幕方面表现出色,在 COCO-Captions 和 TextCaps 等基准测试中取得了高分。在视觉问答方面,PaliGemma 在各种数据集上表现出强劲的性能,包括 VQAv2、GQA 和 ScienceQA。该模型在更专业的任务中也表现良好,例如图表理解(ChartQA)和 OCR 相关任务(TextVQA、DocVQA)。值得注意的是,当将图像分辨率从 224px 提高到 448px 和 896px 时,PaliGemma 表现出显著的改进,特别是对于涉及细粒度细节或文本识别的任务。该模型的多功能性还体现在它能够处理视频输入任务和图像分割挑战。

💡 研究人员还提出了 PaliGemma 研究中值得注意的发现: * 简单方形调整大小 (224×224) 在分割任务中的表现与复杂的纵横比保持技术一样好。 * 研究人员引入了 CountBenchQA,这是一个新的数据集,用于解决 TallyQA 中评估 VLM 计数能力的局限性。 * 在先前发布的 WidgetCaps 数字中发现了差异,使一些比较失效。 * 图像注释(例如,红色框)与文本提示一样有效,用于指示要加字幕的小部件。 * 在分辨率上调期间 (阶段 2) 对图像标记进行 RoPE 插值没有显示出明显的优势。 * PaliGemma 在没有特定训练的情况下,意外地对来自 Objaverse 的 3D 渲染进行了零样本泛化。 * 该模型在 MMVP 上取得了最先进的性能,明显优于 GPT4-V 和 Gemini 等更大的模型。

Vision-language models have evolved significantly over the past few years, with two distinct generations emerging. The first generation, exemplified by CLIP and ALIGN, expanded on large-scale classification pretraining by utilizing web-scale data without requiring extensive human labeling. These models used caption embeddings obtained from language encoders to broaden the vocabulary for classification and retrieval tasks. The second generation, akin to T5 in language modeling, unified captioning and question-answering tasks through generative encoder-decoder modeling. Models like Flamingo, BLIP-2, and PaLI further scaled up these approaches. Recent developments have introduced an additional “instruction tuning” step to enhance user-friendliness. Alongside these advancements, systematic studies have aimed to identify the critical factors in vision-language models. 

Building on this progress, DeepMind researchers present PaliGemma, an open vision-language model combining the strengths of the PaLI vision-language model series with the Gemma family of language models. This innovative approach builds upon the success of previous PaLI iterations, which demonstrated impressive scaling capabilities and performance improvements. PaliGemma integrates a 400M SigLIP vision model with a 2B Gemma language model, resulting in a sub-3B vision-language model that rivals the performance of much larger predecessors like PaLI-X, PaLM-E, and PaLI-3. The Gemma component, derived from the same technology powering the Gemini models, contributes its auto-regressive decoder-only architecture to enhance PaliGemma’s capabilities—this fusion of advanced vision and language processing techniques positions PaliGemma as a significant advancement in multimodal AI.

PaliGemma’s architecture comprises three key components: a SigLIP ViTSo400m image encoder, a Gemma-2B v1.0 decoder-only language model, and a linear projection layer. The image encoder transforms input images into a sequence of tokens, while the language model processes text using its SentencePiece tokenizer. The linear projection layer aligns the dimensions of image and text tokens, allowing them to be concatenated. This simple yet effective design enables PaliGemma to handle various tasks, including image classification, captioning, and visual question-answering, through a flexible image+text in, text out API.

The model’s input sequence structure is carefully designed for optimal performance. Image tokens are placed at the beginning, followed by a BOS token, prefix tokens (task description), a SEP token, suffix tokens (prediction), an EOS token, and PAD tokens. This arrangement allows for full attention across the entire input, enabling image tokens to consider the task context when updating their representations. The suffix, which forms the output, is covered by an auto-regressive mask to maintain the generation process’s integrity.

PaliGemma’s training process involves multiple stages to ensure comprehensive visual-language understanding. It begins with unimodal pretraining of individual components, followed by multimodal pretraining on a diverse mixture of tasks. Notably, the image encoder is not frozen during this stage, allowing for improved spatial and relational understanding. The training continues with a resolution increase stage, enhancing the model’s ability to handle high-resolution images and complex tasks. Finally, a transfer stage adapts the base model to specific tasks or use cases, demonstrating PaliGemma’s versatility and effectiveness across various applications.

The results demonstrate PaliGemma’s impressive performance across a wide range of visual-language tasks. The model excels in image captioning, achieving high scores on benchmarks like COCO-Captions and TextCaps. In visual question answering, PaliGemma shows strong performance on various datasets, including VQAv2, GQA, and ScienceQA. The model also performs well on more specialized tasks such as chart understanding (ChartQA) and OCR-related tasks (TextVQA, DocVQA). Notably, PaliGemma exhibits significant improvements when increasing image resolution from 224px to 448px and 896px, especially for tasks involving fine-grained details or text recognition. The model’s versatility is further demonstrated by its ability to handle video input tasks and image segmentation challenges.

Researchers also present the noteworthy findings from the PaliGemma research:

This research introduces PaliGemma, a robust, compact open-base VLM that excels in transfer learning across diverse tasks. This research demonstrates that smaller VLMs can achieve state-of-the-art performance on a wide spectrum of benchmarks, challenging the notion that larger models are always superior. By releasing the base model without instruction tuning, the researchers aim to provide a valuable foundation for further studies in instruction tuning and specific applications. This approach encourages a clearer distinction between base models and fine-tuned versions in VLM research, potentially opening new avenues for more efficient and versatile AI systems in the field of visual-language understanding.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post Google DeepMind Unveils PaliGemma: A Versatile 3B Vision-Language Model VLM with Large-Scale Ambitions appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

PaliGemma 视觉语言模型 视觉问答 图像字幕 人工智能
相关文章