GLM-4.1V-Thinking: Advancing General-Purpose Multimodal Understanding and Reasoning

Vision-language models (VLMs) play a crucial role in today’s intelligent systems by enabling a detailed understanding of visual content. The complexity of multimodal intelligence tasks has grown, ranging from scientific problem-solving to the development of autonomous agents. Current demands on VLMs have far exceeded simple visual content perception, with increasing attention on advanced reasoning. While recent works show that long-form reasoning and scalable RL significantly enhance LLMs’ problem-solving abilities, current efforts mainly focus on specific domains to improve VLM reasoning. The open-source community currently lacks a multimodal reasoning model that outperforms traditional non-thinking models of comparable parameter scale across diverse tasks.

Researchers from Zhipu AI and Tsinghua University have proposed GLM-4.1V-Thinking, a VLM designed to advance general-purpose multimodal understanding and reasoning. The approach then introduces Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the model’s full potential, enabling improvements across STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding. Researchers open-sourced GLM-4.1V-9B-Thinking, which sets a new benchmark among similarly sized models. It also delivers competitive, and in some cases superior performance compared to proprietary models like GPT-4o on challenging tasks such as long document understanding and STEM reasoning.

GLM-4.1V-Thinking contains three core components: a vision encoder, an MLP adapter, and an LLM decoder. It utilizes AIMv2-Huge as the vision encoder and GLM as the LLM, replacing the original 2D convolutions with 3D convolutions for temporal downsampling. The model integrates 2D-RoPE to support arbitrary image resolutions and aspect ratios, and process extreme aspect ratios over 200:1 and high resolutions beyond 4K. Researchers extend RoPE to 3D-RoPE in the LLM to improve spatial understanding in multimodal contexts. For temporal modeling in videos, time index tokens are added after each frame token, with timestamps encoded as strings to help the model understand real-world temporal gaps between frames

During pre-training, the researchers use a variety of datasets, combining large academic corpora with interleaved image-text data rich in knowledge. By including pure text data, the model’s core language capabilities are preserved, resulting in better pass@k performance than other state-of-the-art pre-trained base models of similar size. The supervised fine-tuning stage transforms the base VLM into one capable of long CoT inference using a curated long-CoT corpus across verifiable, like STEM problems, and non-verifiable tasks such as instruction following. Finally, the RL phase employs a combination of RLVR and RLHF to conduct large-scale training across all multimodal domains, including STEM problem solving, grounding, optical character recognition, GUI agents, and many more.

GLM-4.1V-9B-Thinking outperforms all competing open-source models under 10B parameters in General VQA tasks covering both single-image and multi-image settings. It achieves the highest performance on challenging STEM benchmarks, including MMMU_Val, MMMU_Pro, VideoMMMU, and AI2D. In the OCR and Chart domains, the model sets new state-of-the-art scores on ChartQAPro and ChartMuseum. For Long Document Understanding, GLM-4.1V-9B-Thinking outperforms all other models on MMLongBench, while establishing new state-of-the-art results in GUI Agents and multimodal Coding tasks. Lastly, the model shows robust Video Understanding performance, outperforming VideoMME, MMVU, and MotionBench benchmarks.

In conclusion, researchers introduced GLM-4.1V-Thinking, which represents a step toward general-purpose multimodal reasoning. Its 9B-parameter model outperforms larger models like the one that exceeds 70B parameters. However, several limitations remain, such as inconsistent improvements in reasoning quality through RL, instability during training, and difficulties with complex cases. Future developments should focus on improving supervision and evaluation of model reasoning, with reward models evaluating intermediate reasoning steps while detecting hallucinations and logical inconsistencies. Moreover, exploring strategies to prevent reward hacking in subjective evaluation tasks is crucial to achieve general-purpose intelligence.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project.

Sponsorship Opportunity
Reach the most influential AI developers worldwide. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]

The post GLM-4.1V-Thinking: Advancing General-Purpose Multimodal Understanding and Reasoning appeared first on MarkTechPost.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签