MarkTechPost@AI 2024年12月06日
Google DeepMind Just Released PaliGemma 2: A New Family of Open-Weight Vision Language Models (3B, 10B and 28B)
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

谷歌DeepMind发布了PaliGemma 2系列,这是一个新的视觉语言模型(VLM)家族,包含30亿、100亿和280亿参数的模型。该系列模型支持224x224、448x448和896x896像素的分辨率,并已在30多个迁移任务上进行了测试,包括图像字幕生成、视觉问答、视频任务和OCR相关任务。PaliGemma 2在不同任务中表现出色,更大的模型和更高的分辨率通常表现更好,例如,280亿参数的模型在性能方面表现最佳,但需要更多的计算资源。该系列模型的开源特性使其易于采用和改进,为用户提供了更大的灵活性,推动了视觉语言模型的发展。

🚀 **PaliGemma 2系列模型包含不同参数规模和分辨率的版本:**该系列模型包括30亿、100亿和280亿参数的模型,并支持224x224、448x448和896x896像素的分辨率,提供了灵活性和可扩展性,以满足不同任务的需求和计算资源限制。

💡 **模型在多个任务中表现出色,并取得了领先的成绩:**PaliGemma 2已在30多个迁移任务上进行了测试,包括图像字幕生成、视觉问答、视频任务和OCR相关任务,例如表格结构识别和分子结构识别。在HierText OCR基准测试中,896像素版本的PaliGemma 2在词级识别准确率方面优于之前的模型。

🔄 **模型采用开源权重,方便用户采用和改进:**PaliGemma 2是开源的,用户可以轻松地将其作为原始PaliGemma的直接替代或升级,从而更灵活地进行迁移学习和微调,促进模型的应用和发展。

📊 **模型规模和分辨率与性能息息相关:**更大的模型和更高的分辨率通常能够带来更好的性能,但在计算资源消耗方面也会更高。例如,280亿参数的模型性能最佳,但需要更多的计算资源,适合对延迟要求不高的场景。

🌐 **PaliGemma 2推动了视觉语言模型的发展:**通过提供多种规模和分辨率的模型,PaliGemma 2满足了从资源受限的场景到高性能研究任务的广泛应用和用户需求,为未来AI的发展提供了灵活而有效的解决方案。

Vision-language models (VLMs) have come a long way, but they still face significant challenges when it comes to effectively generalizing across different tasks. These models often struggle with diverse input data types, like images of various resolutions or text prompts that require subtle understanding. On top of that, finding a balance between computational efficiency and model scalability is no easy feat. These challenges make it hard for VLMs to be practical for many users, especially those who need adaptable solutions that perform consistently well across a wide range of real-world applications, from document recognition to detailed image captioning.

Google DeepMind Just Released PaliGemma 2: A New Family of Open-Weight Vision Language Models (3B, 10B and 28B) recently introduced the PaliGemma 2 series, a new family of Vision-Language Models (VLMs) with parameter sizes of 3 billion (3B), 10 billion (10B), and 28 billion (28B). The models support resolutions of 224×224, 448×448, and 896×896 pixels. This release includes nine pre-trained models with different combinations of sizes and resolutions, making them versatile for a variety of use cases. Two of these models are also fine-tuned on the DOCCI dataset, which contains image-text caption pairs, and support parameter sizes of 3B and 10B at a resolution of 448×448 pixels. Since these models are open-weight, they can be easily adopted as a direct replacement or upgrade for the original PaliGemma, offering users more flexibility for transfer learning and fine-tuning.

Technical Details

PaliGemma 2 builds on the original PaliGemma model by incorporating the SigLIP-So400m vision encoder along with the Gemma 2 language models. The models are trained in three stages, using different image resolutions (224px, 448px, and 896px) to allow for flexibility and scalability based on the specific needs of each task. PaliGemma 2 has been tested on more than 30 transfer tasks, including image captioning, visual question answering (VQA), video tasks, and OCR-related tasks like table structure recognition and molecular structure identification. The different variants of PaliGemma 2 excel under different conditions, with larger models and higher resolutions generally performing better. For example, the 28B variant offers the highest performance, though it requires more computational resources, making it suitable for more demanding scenarios where latency is not a major concern.

The PaliGemma 2 series is notable for several reasons. First, offering models at different scales and resolutions allows researchers and developers to adapt performance according to their specific needs, computational resources, and desired balance between efficiency and accuracy. Second, the models have shown strong performance across a range of challenging tasks. For instance, PaliGemma 2 has achieved top scores in benchmarks involving text detection, optical music score recognition, and radiography report generation. In the HierText benchmark for OCR, the 896px variant of PaliGemma 2 outperformed previous models in word-level recognition accuracy, showing improvements in both precision and recall. Benchmark results also suggest that increasing model size and resolution generally leads to better performance across diverse tasks, highlighting the effective combination of visual and textual data representation.

Conclusion

Google’s release of PaliGemma 2 represents a meaningful step forward in vision-language models. By providing nine models across three scales with open-weight availability, PaliGemma 2 addresses a wide range of applications and user needs, from resource-constrained scenarios to high-performance research tasks. The versatility of these models and their ability to handle diverse transfer tasks make them valuable tools for both academic and industry applications. As more use cases integrate multimodal inputs, PaliGemma 2 is well-positioned to provide flexible and effective solutions for the future of AI.


Check out the Paper and Models on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

[Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)

The post Google DeepMind Just Released PaliGemma 2: A New Family of Open-Weight Vision Language Models (3B, 10B and 28B) appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

PaliGemma 2 视觉语言模型 开源 深度学习 人工智能
相关文章