MarkTechPost@AI 04月04日 13:25
Meet Open-Qwen2VL: A Fully Open and Compute-Efficient Multimodal Large Language Model
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Open-Qwen2VL是一个由加州大学圣塔芭芭拉分校、字节跳动和英伟达研究人员合作开发的20亿参数的多模态大语言模型。该模型旨在解决多模态大语言模型研究中可复现性和资源限制的问题。研究团队开源了包括训练代码、数据过滤脚本、WebDataset格式的预训练数据和模型检查点等全套资源。Open-Qwen2VL基于Qwen2.5-1.5B-Instruct LLM,并结合SigLIP-SO-400M视觉编码器,通过创新的视觉token处理和多模态序列打包技术,在有限的计算资源下实现了与更大模型相当甚至更优的性能,为学术界提供了新的研究基准。

👁️‍🗨️ Open-Qwen2VL 基于 Qwen2.5-1.5B-Instruct LLM 和 SigLIP-SO-400M 视觉编码器构建,总参数量为 20 亿。

💡 该模型采用自适应平均池化视觉投影器,在预训练阶段将视觉token数量从729减少到144,从而提高计算效率。在监督微调阶段,token数量恢复到729,以保持图像理解能力。

📦 为了进一步提高训练效率,Open-Qwen2VL 实现了多模态序列打包,允许将多个图文对连接成约4096个token的序列,从而最大限度地减少填充和计算开销。

📊 Open-Qwen2VL 在仅使用 Qwen2-VL 0.36% 的token数量下,在多个基准测试中表现出与 Qwen2-VL 相当或更优的性能,例如在 MMBench 上获得80.9分。

🚀 Open-Qwen2VL 展示了强大的少样本多模态上下文学习能力,在 GQA 和 TextVQA 等数据集上,从0-shot到8-shot场景实现了3%到12%的准确率提升。

Multimodal Large Language Models (MLLMs) have advanced the integration of visual and textual modalities, enabling progress in tasks such as image captioning, visual question answering, and document interpretation. However, the replication and further development of these models are often hindered by a lack of transparency. Many state-of-the-art MLLMs do not release key components, including training code, data curation methodologies, and pretraining datasets. Furthermore, the substantial computational resources required for training these models pose a significant barrier, particularly for academic researchers with limited infrastructure. This lack of accessibility impedes reproducibility and slows the dissemination of new techniques within the research community.

Researchers from UC Santa Barbara, Bytedance and NVIDIA introduce Open-Qwen2VL, a 2-billion parameter Multimodal Large Language Model that has been pre-trained on 29 million image-text pairs using approximately 220 A100-40G GPU hours. Developed collaboratively by researchers from UC Santa Barbara, ByteDance, and Nvidia Research, Open-Qwen2VL is designed to address reproducibility and resource constraints in MLLM research. The project provides a complete suite of open-source resources, including the training codebase, data filtering scripts, WebDataset-formatted pretraining data, and both base and instruction-tuned model checkpoints. This comprehensive release aims to support transparent experimentation and method development in the multimodal learning domain.

Open-Qwen2VL is based on the Qwen2.5-1.5B-Instruct LLM backbone, coupled with a SigLIP-SO-400M vision encoder. An Adaptive Average-Pooling Visual Projector reduces the number of visual tokens from 729 to 144 during pretraining, which improves computational efficiency. The token count is increased back to 729 during the supervised fine-tuning (SFT) stage. This low-to-high resolution strategy maintains image understanding capabilities while optimizing for resource usage.

To further enhance training efficiency, Open-Qwen2VL implements multimodal sequence packing, allowing the concatenation of multiple image-text pairs into sequences of approximately 4096 tokens, thereby minimizing padding and computational overhead. The vision encoder parameters remain frozen during pretraining to conserve resources and are optionally unfrozen during SFT to improve downstream performance.

Open-Qwen2VL is trained on only 0.36% of the token count used in Qwen2-VL, yet demonstrates comparable or superior performance across several benchmarks. The model achieves a score of 80.9 on MMBench, and performs competitively on SEEDBench (72.5), MMStar (49.7), and MathVista (53.1). Ablation studies indicate that integrating a small subset (5M samples) of high-quality image-text pairs filtered using MLM-based techniques can result in measurable performance improvements, highlighting the importance of data quality over volume.

In addition, Open-Qwen2VL exhibits robust few-shot multimodal in-context learning capabilities. When evaluated on datasets such as GQA and TextVQA, the model shows 3% to 12% accuracy gains from 0-shot to 8-shot scenarios. Fine-tuning performance scales predictably with the size of the instruction tuning dataset, with performance gains plateauing around 8M examples from the MAmmoTH-VL-10M dataset.

Open-Qwen2VL introduces a reproducible and resource-efficient pipeline for training multimodal large language models. By systematically addressing the limitations of prior models in terms of openness and compute requirements, it enables broader participation in MLLM research. The model’s design choices—including efficient visual token handling, multimodal sequence packing, and judicious data selection—illustrate a viable path forward for academic institutions aiming to contribute to the field. Open-Qwen2VL establishes a reproducible baseline and provides a foundation for future work on scalable, high-performance MLLMs within constrained computational environments.


Check out the Paper, Model, Data and Code. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post Meet Open-Qwen2VL: A Fully Open and Compute-Efficient Multimodal Large Language Model appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Open-Qwen2VL 多模态大语言模型 开源 AI
相关文章