MarkTechPost@AI 07月18日 10:50
GLM-4.1V-Thinking: Advancing General-Purpose Multimodal Understanding and Reasoning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

GLM-4.1V-Thinking是由智谱AI和清华大学的研究人员提出的一款先进的视觉语言模型(VLM),旨在提升通用多模态理解和推理能力。该模型通过引入课程采样强化学习(RLCS)技术,在STEM问题解决、视频理解、内容识别、编码、接地、GUI代理和长文档理解等多个领域取得了显著进展。其开源的GLM-4.1V-9B-Thinking模型在同等参数规模的模型中树立了新标杆,并在多项挑战性任务上表现出与GPT-4o等闭源模型相当甚至更优的性能,尤其在长文档理解和STEM推理方面优势明显。

✨ **GLM-4.1V-Thinking的创新架构与技术**:该模型的核心组件包括AIMv2-Huge视觉编码器和GLM语言模型,通过用3D卷积替换2D卷积来优化时间下采样,并引入2D-RoPE支持任意分辨率和长宽比的图像处理,甚至能处理超过200:1的极端长宽比和4K以上的高分辨率图像。此外,将RoPE扩展至3D-RoPE以增强多模态情境下的空间理解能力,并通过在每帧后添加时间索引令牌和编码时间戳来改进视频中的时间建模。

🚀 **先进的预训练与微调策略**:GLM-4.1V-Thinking在预训练阶段结合了大型学术语料库和富含知识的图像-文本交织数据,同时保留了纯文本数据以维持模型的核心语言能力,从而在同等规模模型中实现更高的pass@k性能。随后的监督微调阶段利用精心策划的长CoT语料库,使模型能够进行长篇的连贯推理,涵盖STEM问题解决和指令遵循等可验证及不可验证的任务。最终,通过RL阶段的RLVR和RLHF技术,模型在STEM问题解决、接地、OCR、GUI代理和多模态编码等多个领域进行了大规模训练。

🏆 **卓越的性能表现与广泛的应用场景**:GLM-4.1V-9B-Thinking在通用视觉问答(VQA)任务上超越了所有参数量低于10B的开源模型,并在MMMUVAl、MMMUPro、VideoMMMU和AI2D等STEM基准测试中取得了领先地位。在OCR和图表领域,该模型在ChartQAPro和ChartMuseum上创下新纪录。在长文档理解方面,GLM-4.1V-9B-Thinking在MMLongBench上表现优于所有模型,并在GUI代理和多模态编码任务中确立了新的最佳性能。此外,模型在视频理解方面也展现出强大的能力,优于VideoMME、MMVU和MotionBench等基准。

⚠️ **模型局限性与未来发展方向**:尽管GLM-4.1V-Thinking在通用多模态推理方面取得了重大突破,但仍存在一些挑战,例如强化学习带来的推理质量提升不一致、训练过程中的不稳定性以及处理复杂案例的困难。未来的研究方向应聚焦于改进模型推理的监督和评估方法,开发能够评估中间推理步骤并检测幻觉和逻辑不一致性的奖励模型。同时,探索防止主观评估任务中奖励函数被操纵(reward hacking)的策略,对于实现通用人工智能至关重要。

Vision-language models (VLMs) play a crucial role in today’s intelligent systems by enabling a detailed understanding of visual content. The complexity of multimodal intelligence tasks has grown, ranging from scientific problem-solving to the development of autonomous agents. Current demands on VLMs have far exceeded simple visual content perception, with increasing attention on advanced reasoning. While recent works show that long-form reasoning and scalable RL significantly enhance LLMs’ problem-solving abilities, current efforts mainly focus on specific domains to improve VLM reasoning. The open-source community currently lacks a multimodal reasoning model that outperforms traditional non-thinking models of comparable parameter scale across diverse tasks.

Researchers from Zhipu AI and Tsinghua University have proposed GLM-4.1V-Thinking, a VLM designed to advance general-purpose multimodal understanding and reasoning. The approach then introduces Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the model’s full potential, enabling improvements across STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding. Researchers open-sourced GLM-4.1V-9B-Thinking, which sets a new benchmark among similarly sized models. It also delivers competitive, and in some cases superior performance compared to proprietary models like GPT-4o on challenging tasks such as long document understanding and STEM reasoning.

GLM-4.1V-Thinking contains three core components: a vision encoder, an MLP adapter, and an LLM decoder. It utilizes AIMv2-Huge as the vision encoder and GLM as the LLM, replacing the original 2D convolutions with 3D convolutions for temporal downsampling. The model integrates 2D-RoPE to support arbitrary image resolutions and aspect ratios, and process extreme aspect ratios over 200:1 and high resolutions beyond 4K. Researchers extend RoPE to 3D-RoPE in the LLM to improve spatial understanding in multimodal contexts. For temporal modeling in videos, time index tokens are added after each frame token, with timestamps encoded as strings to help the model understand real-world temporal gaps between frames

During pre-training, the researchers use a variety of datasets, combining large academic corpora with interleaved image-text data rich in knowledge. By including pure text data, the model’s core language capabilities are preserved, resulting in better pass@k performance than other state-of-the-art pre-trained base models of similar size. The supervised fine-tuning stage transforms the base VLM into one capable of long CoT inference using a curated long-CoT corpus across verifiable, like STEM problems, and non-verifiable tasks such as instruction following. Finally, the RL phase employs a combination of RLVR and RLHF to conduct large-scale training across all multimodal domains, including STEM problem solving, grounding, optical character recognition, GUI agents, and many more.

GLM-4.1V-9B-Thinking outperforms all competing open-source models under 10B parameters in General VQA tasks covering both single-image and multi-image settings. It achieves the highest performance on challenging STEM benchmarks, including MMMU_Val, MMMU_Pro, VideoMMMU, and AI2D. In the OCR and Chart domains, the model sets new state-of-the-art scores on ChartQAPro and ChartMuseum. For Long Document Understanding, GLM-4.1V-9B-Thinking outperforms all other models on MMLongBench, while establishing new state-of-the-art results in GUI Agents and multimodal Coding tasks. Lastly, the model shows robust Video Understanding performance, outperforming VideoMME, MMVU, and MotionBench benchmarks.

In conclusion, researchers introduced GLM-4.1V-Thinking, which represents a step toward general-purpose multimodal reasoning. Its 9B-parameter model outperforms larger models like the one that exceeds 70B parameters. However, several limitations remain, such as inconsistent improvements in reasoning quality through RL, instability during training, and difficulties with complex cases. Future developments should focus on improving supervision and evaluation of model reasoning, with reward models evaluating intermediate reasoning steps while detecting hallucinations and logical inconsistencies. Moreover, exploring strategies to prevent reward hacking in subjective evaluation tasks is crucial to achieve general-purpose intelligence.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project.

Sponsorship Opportunity
Reach the most influential AI developers worldwide. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]

The post GLM-4.1V-Thinking: Advancing General-Purpose Multimodal Understanding and Reasoning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GLM-4.1V-Thinking 多模态理解 视觉语言模型 AI推理 强化学习
相关文章