MarkTechPost@AI 04月12日 11:45
Moonsight AI Released Kimi-VL: A Compact and Powerful Vision-Language Model Series Redefining Multimodal Reasoning, Long-Context Understanding, and High-Resolution Visual Processing
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Moonshot AI 发布了 Kimi-VL 系列视觉-语言模型,该模型以其紧凑的架构和强大的性能,重新定义了多模态推理、长上下文理解和高分辨率视觉处理。Kimi-VL 通过创新的 MoE 架构,在保持高效的同时,实现了对高分辨率图像和长文本的有效处理。它在多种基准测试中表现出色,尤其是在需要深度推理的任务中,如 OCR 和 UI 理解。Kimi-VL 的发布标志着多模态 AI 领域在效率和性能方面取得了新的突破。

🖼️ Kimi-VL 采用 MoE 架构,仅激活 28 亿参数,在保证性能的同时,实现了高效的计算资源利用。

👁️‍🗨️ MoonViT 视觉编码器能够原生处理高分辨率图像,无需进行图像分割,从而在 OCR 和 UI 界面理解等任务中表现出色。

📚 Kimi-VL 支持高达 128K 的上下文窗口,在文本/视频任务中,64K 标记的召回准确率为 100%,128K 标记的准确率为 87.0%。

🧠 Kimi-VL-Thinking 模型在 MMMU、MathVision 和 MathVista 等推理密集型基准测试中表现出色,超越了许多更大的 VLM 模型。

📊 Kimi-VL 在 InfoVQA 上得分 83.2,在 ScreenSpot-Pro 上得分 34.5,展示了其在基于感知的评估中的精确性。

⚙️ 模型的预训练总共使用了 4.4T 个标记,涵盖文本、视频、文档和合成多模态数据。

🚀 优化方面采用了定制的 Muon 优化器,并结合了 ZeRO-1 等内存高效策略。

Multimodal AI enables machines to process and reason across various input formats, such as images, text, videos, and complex documents. This domain has seen increased interest as traditional language models, while powerful, are inadequate when confronted with visual data or when contextual interpretation spans across multiple input types. The real world is inherently multimodal, so systems aiming to assist in real-time tasks, analyzing user interfaces, understanding academic materials, or interpreting complex scenes require intelligence that functions beyond textual reasoning. Newer models are now being developed to simultaneously decode language and vision cues to approach tasks with improved contextual awareness, reasoning depth, and adaptability to different data input forms.

A limitation in multimodal systems today lies in their inability to process long contexts efficiently and to generalize across high-resolution or diverse input structures without compromising performance. Many open-source models limit the input to a few thousand tokens or demand excessive computational resources to maintain performance at scale. These constraints result in models that may perform well on standard benchmarks but struggle with real-world applications that involve complex, multi-image inputs, extended dialogues, or academic tasks like OCR-based document analysis and mathematical problem-solving. There’s also a gap in reasoning ability, particularly long-horizon thinking, which prevents current systems from handling tasks that require step-by-step logic or deep contextual alignment between different data modalities.

Previous tools have attempted to address these challenges but often fell short in scalability or flexibility. The Qwen2.5-VL series and Gemma-3 models, while notable for their dense architectures, lack built-in support for reasoning through longer chains of thought. Models like DeepSeek-VL2 and Aria adopted mixture-of-experts (MoE) strategies but had fixed vision encoders that restricted their ability to adapt to various resolutions and forms of visual input. Also, these models typically supported only short context windows, 4K tokens in DeepSeek-VL2, and had limited success in complex OCR or multi-image scenarios. As such, most existing systems failed to balance low resource consumption with the ability to tackle tasks involving long context and diverse visual data.

Researchers at Moonshot AI introduced Kimi-VL, a novel vision-language model utilizing an MoE architecture. This system activates only 2.8 billion parameters in its decoder, significantly lighter than many competitors while maintaining powerful multimodal capabilities. The two released models based on this architecture on Hugging Face are Kimi-VL-A3B-Thinking and Kimi-VL-A3B-Instruct. It incorporates a native-resolution visual encoder named MoonViT and supports context windows of up to 128K tokens. The model has three integrated components: the MoonViT encoder, an MLP projector for transitioning visual features to language embeddings, and the Moonlight MoE decoder. Researchers further developed an advanced version, Kimi-VL-Thinking, designed specifically for long-horizon reasoning tasks through chain-of-thought supervised fine-tuning and reinforcement learning. Together, these models aim to redefine efficiency benchmarks in vision-language reasoning.

The architectural innovation in Kimi-VL lies in its adaptability and processing capability. MoonViT processes high-resolution images in their original form, eliminating the need for sub-image fragmentation. To ensure spatial consistency across varied image resolutions, the model uses interpolated absolute positional embeddings combined with two-dimensional rotary positional embeddings across both height and width. These design choices allow MoonViT to preserve fine-grained detail even in large-scale image inputs. Outputs from the vision encoder are passed through a two-layer MLP that uses pixel shuffle operations to downsample spatial dimensions and convert features into LLM-compatible embeddings. On the language side, the 2.8B activated parameter MoE decoder supports 16B total parameters and integrates seamlessly with visual representations, enabling highly efficient training and inference across different input types. The entire training process used an enhanced Muon optimizer with weight decay and ZeRO-1-based memory optimization for handling the large parameter count.

The training data composition reflects a focus on diverse multimodal learning. Starting with 2.0T tokens for ViT training using image-caption pairs, the team added another 0.1T to align the encoder with the decoder. Joint pre-training consumed 1.4T tokens, followed by 0.6T in cooldown and 0.3T in long-context activation, totaling 4.4T tokens. These stages included academic visual datasets, OCR samples, long video data, and synthetic mathematical and code-based QA pairs. For long-context learning, the model was progressively trained to handle sequences from 8K up to 128K tokens, using RoPE embeddings extended from a base frequency of 50,000 to 800,000. This allowed the model to maintain a token recall accuracy of 100% up to 64K tokens, with a slight drop to 87.0% at 128K, still outperforming most alternatives.

Kimi-VL demonstrated strong results across a range of benchmarks. On the LongVideoBench, it scored 64.5; on MMLongBench-Doc, it achieved 35.1; and on the InfoVQA benchmark, it led with 83.2. On ScreenSpot-Pro, which tests understanding of UI screens, it scored 34.5. The Kimi-VL-Thinking variant excelled in reasoning-intensive benchmarks like MMMU (61.7), MathVision (36.8), and MathVista (71.3). For agent tasks such as OSWorld, the model matched or exceeded performance from larger models like GPT-4o while activating significantly fewer parameters. Its compact design and strong reasoning capabilities make it a leading candidate among open-source multimodal solutions.

Some Key Takeaways from the Research on Kimi-VL:


Check out Instruct Model and Reasoning Model. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Moonsight AI Released Kimi-VL: A Compact and Powerful Vision-Language Model Series Redefining Multimodal Reasoning, Long-Context Understanding, and High-Resolution Visual Processing appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Kimi-VL 多模态AI 视觉语言模型 Moonshot AI
相关文章