MarkTechPost@AI 07月18日 10:37
Mirage: Multimodal Reasoning in VLMs Without Rendering Images
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了一种名为Mirage的新型框架,旨在解决视觉语言模型(VLMs)在处理需要视觉思考的任务时,仅依赖文本推理的局限性。Mirage受人类心理意象的启发,使VLMs能够将紧凑的视觉线索交织到其文本输出中,而无需生成完整的图像。该框架通过两阶段训练实现:首先,利用辅助图像和联合监督将压缩的视觉特征(称为潜在令牌)固定到推理过程中;然后,放宽此约束,允许模型生成并使用其潜在令牌来指导推理,从而实现交错的多模态推理。最后,通过强化学习进一步优化模型。在空间推理任务上的评估结果表明,Mirage显著优于纯文本和多模态基线模型,证明了其在提升模型理解和准确性方面的有效性。

💡 Mirage框架的核心创新在于模仿人类的心理意象过程,使VLMs能够通过内部产生的、与任务相关的简洁视觉线索来进行推理,从而克服了仅依赖文本理解的瓶颈。这种方法避免了生成完整图像的计算成本和训练难度,同时保留了对复杂多模态任务的有效性。

🔧 Mirage框架的训练过程分为两个关键阶段:第一阶段,模型通过辅助图像和联合监督,将压缩的视觉特征(潜在令牌)锚定到推理流程中,确保了视觉信息的有效关联。第二阶段,模型被允许自由生成和利用这些潜在令牌来指导推理,实现了多模态信息的交错整合,这是实现其独特推理能力的基础。

🚀 在空间推理任务(如视觉谜题和几何问题)的评估中,Mirage框架表现出色, consistently优于纯文本模型和现有的多模态模型。即使在需要复杂规划的任务(如迷宫求解)中,Mirage也展现出强大的性能,证明了其在提升模型准确性和推理能力方面的有效性,即使使用较小的数据集和模型规模也表现稳健。

⚠️ 尽管Mirage框架在多模态推理方面取得了显著进展,但研究仍面临一些挑战,包括如何将此方法扩展到更广泛的任务类型,以及如何进一步提升用于训练的合成数据的质量。这些方面的改进将有助于充分释放VLMs在视觉推理领域的潜力。

While VLMs are strong at understanding both text and images, they often rely solely on text when reasoning, limiting their ability to solve tasks that require visual thinking, such as spatial puzzles. People naturally visualize solutions rather than describing every detail, but VLMs struggle to do the same. Although some recent models can generate both text and images, training them for image generation often weakens their ability to reason. Producing images also doesn’t support step-by-step visual reasoning. As a result, unlocking the full potential of VLMs for complex, visually grounded thinking remains a key challenge in the field. 

CoT prompting encourages models to reason through problems step by step using examples with intermediate explanations. This idea has been extended to multimodal tasks, where visual information is integrated into the reasoning flow. Methods like ICoT embed image regions within text sequences, whereas Visual CoT utilizes visual annotations to train models for improved spatial understanding. Some recent models can generate both text and images simultaneously; however, they require heavy supervision and incur high computational costs. Separately, researchers are exploring ways to embed reasoning internally within models by guiding their hidden states, using special tokens or latent representations instead of explicit reasoning steps. 

Researchers from the University of Massachusetts Amherst and MIT propose an approach inspired by how humans use mental imagery, which involves forming simple, task-relevant visuals internally while thinking. They introduce Mirage, a framework that enables VLMs to interleave visual reasoning directly into their text outputs without generating full images. Instead, the model inserts compact visual cues derived from its hidden states. It’s trained in two phases: first with both text and visual supervision, then with text-only guidance. Reinforcement learning further refines its reasoning skills. Mirage enables VLMs to think more like humans, thereby improving their performance on complex, multimodal tasks. 

Mirage is a framework inspired by human mental imagery that enables VLMs to reason using compact visual cues instead of generating full images. It employs two training stages: first, it grounds compressed visual features, known as latent tokens, within the reasoning process using helper images and joint supervision. Then, it relaxes this constraint, allowing the model to generate its latent tokens and use them to guide reasoning. This setup enables interleaved multimodal reasoning. A final reinforcement learning stage further fine-tunes the model using accuracy and formatting rewards, encouraging both correct answers and structured thought processes. 

The study evaluates the model on four spatial reasoning tasks, such as visual puzzles and geometry problems, using a small dataset of 1,000 training samples. To support reasoning, it generates synthetic helper images and thought steps, mimicking how humans use sketches and cues to facilitate thought processes. The model consistently outperforms both text-only and multimodal baselines, even in tasks that require extensive planning, such as maze solving. A smaller version of the model also yields strong results, demonstrating that the method is robust. Ablation studies confirm that grounding latent visual tokens first, followed by flexible training, is key. Overall, interleaving visual and text reasoning without real images boosts both understanding and accuracy. 

In conclusion, inspired by how humans use mental imagery to reason, the study introduces a lightweight approach that lets VLMs think visually, without ever generating actual images. By interleaving compact visual cues with text during decoding, the model learns to reason multimodally through a two-phase training process: first, anchoring these cues to real image features, then allowing them to evolve freely to support reasoning. A final reinforcement learning step sharpens performance. Tested on spatial reasoning tasks, the method consistently outperforms traditional text-only models. However, challenges remain in scaling to other tasks and improving the quality of the synthetic training data. 


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project.

The post Mirage: Multimodal Reasoning in VLMs Without Rendering Images appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Mirage 视觉语言模型 多模态推理 心理意象 AI
相关文章