MarkTechPost@AI 06月02日 23:55
MiMo-VL-7B: A Powerful Vision-Language Model to Enhance General Visual Understanding and Multimodal Reasoning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

小米的研究人员推出了MiMo-VL-7B,一款紧凑而强大的视觉语言模型(VLM)。该模型由三个关键组件构成,包括一个原生分辨率的视觉Transformer编码器、一个用于高效跨模态对齐的多层感知器投影器以及一个针对复杂推理任务优化的MiMo-7B语言模型。MiMo-VL-7B通过两个连续的训练过程进行优化,包括预训练阶段和引入混合策略强化学习(MORL)的后训练阶段。MiMo-VL-7B在50个任务的综合评估中展现了其卓越性能,在通用视觉语言任务、文档理解和多模态推理任务中均表现出色,并在GUI理解和定位能力方面实现了显著提升。

👁️ MiMo-VL-7B的核心架构包括三个部分:视觉Transformer(ViT)用于编码视觉输入,一个将视觉编码映射到与LLM对齐的潜在空间的投影器,以及LLM本身,用于文本理解和推理。

📚 MiMo-VL-7B经历了两个主要的训练阶段。首先是预训练阶段,包括投影器预热、视觉-语言对齐、通用多模态预训练和长上下文监督微调,使用了2.4万亿个token。其次是后训练阶段,引入混合策略强化学习(MORL),整合了多种奖励信号。

🚀 评估结果显示,MiMo-VL-7B在多个基准测试中表现出色。在通用视觉语言任务上,MiMo-VL-7B-SFT和MiMo-VL-7B-RL分别获得了64.6%和66.7%的MMMUval分数,超过了Gemma 3 27B等更大的模型。在文档理解和多模态推理任务中,MiMo-VL-7B-RL也显著优于其他开源模型。

💻 MiMo-VL-7B在GUI理解和定位能力方面表现出色,其RL模型在Screenspot-Pro和OSWorld-G等具有挑战性的基准测试中,性能优于所有已比较的通用VLM,并与GUI专业模型相当或更优。

💡 研究人员通过精心策划的、高质量的预训练数据集和MORL框架,实现了MiMo-VL-7B的卓越性能。关键的开发见解包括:在预训练后期阶段纳入推理数据可以持续提高性能,基于策略的RL优于vanilla GRPO,以及在应用于不同能力时任务干扰带来的挑战。

Vision-language models (VLMs) have become foundational components for multimodal AI systems, enabling autonomous agents to understand visual environments, reason over multimodal content, and interact with both digital and physical worlds. The significance of these capabilities has led to extensive research across architectural designs and training methodologies, resulting in rapid advancements in the field. Researchers from Xiaomi introduce MiMo-VL-7B, a compact yet powerful VLM comprising three key components: a native-resolution Vision Transformer encoder that preserves fine-grained visual details, a Multi-Layer Perceptron projector for efficient cross-modal alignment, and the MiMo-7B language model optimized for complex reasoning tasks.

MiMo-VL-7B undergoes two sequential training processes. The first process is a four-stage pre-training phase, including projector warmup, vision-language alignment, general multimodal pre-training, and long-context supervised fine-tuning, which consumes 2.4 trillion tokens from curated high-quality datasets. This yields the MiMo-VL-7B-SFT model. The second process is the post-training phase, which introduces Mixed On-policy Reinforcement Learning (MORL), integrating diverse reward signals spanning perception accuracy, visual grounding precision, logical reasoning capabilities, and human preferences. This yields the MiMo-VL-7B-RL model. Key findings reveal that incorporating high-quality, broad-coverage reasoning data from the pre-training stage enhances model performance, while achieving stable simultaneous improvements remains challenging.

The MiMo-VL-7B architecture contains three components, (a) a Vision Transformer (ViT) for encoding visual inputs such as images and videos, (b) a projector that maps the visual encodings into a latent space aligned with the LLM,  and (c) the LLM itself, which performs textual understanding and reasoning. The Qwen2.5-ViT is adopted as a visual encoder to support native resolution inputs. The LLM backbone with MiMo-7B-Base as its strong reasoning capability, and a randomly initialized Multi-Layer Perceptron (MLP) as the projector are used in the model’s architecture. The model’s pre-training dataset comprises 2.4 trillion tokens, diverse multimodal data, image captions, interleaved data, Optical Character Recognition (OCR) data, grounding data, video content, GUI interactions, reasoning examples, and text-only sequences.

The post-training phase further enhances MiMo-VL-7B on challenging reasoning tasks and with human preference alignment by utilizing the MORL framework that seamlessly integrates Reinforcement Learning with Verifiable Rewards (RLVR) and RLHF. RLVR utilizes rule-based reward functions for continuous self-improvement, so multiple verifiable reasoning and perception tasks are designed to validate the final answer precisely using predefined rules. RLHF is employed in this verifiable reward framework to address human preference alignment and mitigate undesirable behaviors. Moreover, the MORL is implemented to optimize RLVR and RLHF objectives simultaneously.

Comprehensive evaluation across 50 tasks demonstrates MiMo-VL-7B’s state-of-the-art performance among open-source models. In general capabilities, the models achieve exceptional results on general vision-language tasks, with MiMo-VL-7B-SFT and MiMo-VL-7B-RL obtaining 64.6% and 66.7% on MMMUval, respectively, outperforming larger models like Gemma 3 27B. For document understanding, MiMo-VL-7B-RL excels with 56.5% on CharXivRQ, significantly exceeding Qwen2.5-VL by 14.0 points and InternVL3 by 18.9 points. In multimodal reasoning tasks, both the RL and SFT models substantially outperform open-source baselines, with MiMo-VL-7B-SFT even surpassing much larger models, including Qwen2.5-VL-72B and QVQ-72B-Preview. The RL variant achieves further improvements, boosting MathVision accuracy from 57.9% to 60.4%.

MiMo-VL-7B demonstrates exceptional GUI understanding and grounding capabilities, with the RL model outperforming all compared general VLMs and achieving comparable or superior performance to GUI-specialized models on challenging benchmarks like Screenspot-Pro and OSWorld-G. The model achieves the highest Elo rating among all evaluated open-source VLMs, ranking first across models spanning 7B to 72B parameters and closely approaching proprietary models like Claude 3.7 Sonnet. MORL provides a significant 22+ point boost to the SFT model, validating the effectiveness of the training methodology and highlighting the competitive capability of this general-purpose VLM approach.

In conclusion, researchers introduced MiMo-VL-7B models that achieve state-of-the-art performance through curated, high-quality pre-training datasets and the MORL frameworks. Key development insights include consistent performance gains from incorporating reasoning data in later pre-training stages,  the advantages of on-policy RL over vanilla GRPO, and challenges of task interference when applying MORL across diverse capabilities. The researchers open-source the comprehensive evaluation suite to promote transparency and reproducibility in multimodal research. This work advances capable open-source vision-language models and provides valuable insights for the community.


Check out the PaperGitHub Page and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

The post MiMo-VL-7B: A Powerful Vision-Language Model to Enhance General Visual Understanding and Multimodal Reasoning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MiMo-VL-7B 视觉语言模型 多模态 小米
相关文章