MiniCPM-V 2.6: A GPT-4V Level Multimodal LLMs for Single Image, Multi-Image, and Video on Your Phone

MarkTechPost@AI 2024年08月07日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

MiniCPM-V 2.6 是 MiniCPM-V 系列的最新版本，基于 SigLip-400M 和 Qwen2-7B 架构构建，拥有 80 亿个参数。该模型在性能方面取得了显著提升，并为多图像和视频理解量身定制了新功能，在性能方面超越了其前身 MiniCPM-Llama3-V 2.5。MiniCPM-V 2.6 在 OpenCompass 上的平均得分达到 65.2，超越了 GPT-4o mini、GPT-4V、Gemini 1.5 Pro 和 Claude 3.5 Sonnet 等知名专有模型。

😊 **领先的性能:** MiniCPM-V 2.6 在 OpenCompass 上的平均得分达到 65.2，该平台涵盖了八个流行基准的综合评估。凭借其 80 亿个参数，该模型在单图像理解方面超越了 GPT-4o mini、GPT-4V、Gemini 1.5 Pro 和 Claude 3.5 Sonnet 等知名专有模型。

🤩 **多图像理解和上下文学习:** MiniCPM-V 2.6 能够对多个图像进行对话和推理，在 Mantis-Eval、BLINK、Mathverse mv 和 Sciverse mv 等多图像基准测试中取得了最先进的结果。它还表现出有希望的上下文学习能力。

😎 **视频理解:** MiniCPM-V 2.6 接受视频输入，为时空信息提供对话和密集字幕。在 Video-MME 上，它超越了 GPT-4V、Claude 3.5 Sonnet 和 LLaVA-NeXT-Video-34B 等模型，无论是否有字幕。

🥳 **强大的 OCR 能力:** MiniCPM-V 2.6 处理各种长宽比和高达 180 万像素的图像，在 OCRBench 上树立了新的标准，超越了 GPT-4o、GPT-4V 和 Gemini 1.5 Pro 等专有模型。它利用最新的 RLAIF-V 和 VisCPM 技术，确保在 Object HalBench 上以显着更低的幻觉率实现可信的行为，支持英语、中文、德语、法语、意大利语和韩语的多语言能力。

🤩 **卓越的效率:** 尽管尺寸紧凑，但 MiniCPM-V 2.6 表现出最先进的令牌密度，将 180 万像素的图像编码为仅 640 个令牌，比大多数模型少 75%。这提高了推理速度、首个令牌延迟、内存使用量和功耗，从而在 iPad 等设备上实现高效的实时视频理解。

😍 **易于使用:** MiniCPM-V 2.6 在应用方面用途广泛，支持通过 llama.cpp 和 ollama 在本地设备上进行高效的 CPU 推理，提供 16 种尺寸的 int4 和 GGUF 格式的量化模型，vLLM 支持高吞吐量和内存高效的推理、特定领域的微调、使用 Gradio 快速设置本地 WebUI 演示以及在线 Web 演示。

🤯 **MiniCPM-V 2.6 代表着视觉理解机器学习的重大飞跃，在单图像、多图像和视频处理任务方面提供无与伦比的性能、效率和可用性。**

MiniCPM-V 2.6 represents the latest and most advanced iteration in the MiniCPM-V series, constructed on the SigLip-400M and Qwen2-7B frameworks, boasting a total of 8 billion parameters. This model introduces significant enhancements in performance and new features tailored for multi-image and video understanding, achieving substantial advancements over its predecessor, MiniCPM-Llama3-V 2.5.

Key Features of MiniCPM-V 2.6:

Leading Performance

Multi-Image Understanding and In-context Learning

Video Understanding:

Strong OCR Capability

Superior Efficiency

Ease of Use

MiniCPM-V 2.6 represents a significant leap in machine learning for visual understanding, offering unmatched performance, efficiency, and usability across single image, multi-image, and video processing tasks

Check out the HF Model and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

Arcee AI Released DistillKit: An Open Source, Easy-to-Use Tool Transforming Model Distillation for Creating Efficient, High-Performance Small Language Models

The post MiniCPM-V 2.6: A GPT-4V Level Multimodal LLMs for Single Image, Multi-Image, and Video on Your Phone appeared first on MarkTechPost.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签