MarkTechPost@AI 2024年08月07日
MiniCPM-V 2.6: A GPT-4V Level Multimodal LLMs for Single Image, Multi-Image, and Video on Your Phone
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

MiniCPM-V 2.6 是 MiniCPM-V 系列的最新版本,基于 SigLip-400M 和 Qwen2-7B 架构构建,拥有 80 亿个参数。该模型在性能方面取得了显著提升,并为多图像和视频理解量身定制了新功能,在性能方面超越了其前身 MiniCPM-Llama3-V 2.5。MiniCPM-V 2.6 在 OpenCompass 上的平均得分达到 65.2,超越了 GPT-4o mini、GPT-4V、Gemini 1.5 Pro 和 Claude 3.5 Sonnet 等知名专有模型。

😊 **领先的性能:** MiniCPM-V 2.6 在 OpenCompass 上的平均得分达到 65.2,该平台涵盖了八个流行基准的综合评估。凭借其 80 亿个参数,该模型在单图像理解方面超越了 GPT-4o mini、GPT-4V、Gemini 1.5 Pro 和 Claude 3.5 Sonnet 等知名专有模型。

🤩 **多图像理解和上下文学习:** MiniCPM-V 2.6 能够对多个图像进行对话和推理,在 Mantis-Eval、BLINK、Mathverse mv 和 Sciverse mv 等多图像基准测试中取得了最先进的结果。它还表现出有希望的上下文学习能力。

😎 **视频理解:** MiniCPM-V 2.6 接受视频输入,为时空信息提供对话和密集字幕。在 Video-MME 上,它超越了 GPT-4V、Claude 3.5 Sonnet 和 LLaVA-NeXT-Video-34B 等模型,无论是否有字幕。

🥳 **强大的 OCR 能力:** MiniCPM-V 2.6 处理各种长宽比和高达 180 万像素的图像,在 OCRBench 上树立了新的标准,超越了 GPT-4o、GPT-4V 和 Gemini 1.5 Pro 等专有模型。它利用最新的 RLAIF-V 和 VisCPM 技术,确保在 Object HalBench 上以显着更低的幻觉率实现可信的行为,支持英语、中文、德语、法语、意大利语和韩语的多语言能力。

🤩 **卓越的效率:** 尽管尺寸紧凑,但 MiniCPM-V 2.6 表现出最先进的令牌密度,将 180 万像素的图像编码为仅 640 个令牌,比大多数模型少 75%。这提高了推理速度、首个令牌延迟、内存使用量和功耗,从而在 iPad 等设备上实现高效的实时视频理解。

😍 **易于使用:** MiniCPM-V 2.6 在应用方面用途广泛,支持通过 llama.cpp 和 ollama 在本地设备上进行高效的 CPU 推理,提供 16 种尺寸的 int4 和 GGUF 格式的量化模型,vLLM 支持高吞吐量和内存高效的推理、特定领域的微调、使用 Gradio 快速设置本地 WebUI 演示以及在线 Web 演示。

🤯 **MiniCPM-V 2.6 代表着视觉理解机器学习的重大飞跃,在单图像、多图像和视频处理任务方面提供无与伦比的性能、效率和可用性。**

MiniCPM-V 2.6 represents the latest and most advanced iteration in the MiniCPM-V series, constructed on the SigLip-400M and Qwen2-7B frameworks, boasting a total of 8 billion parameters. This model introduces significant enhancements in performance and new features tailored for multi-image and video understanding, achieving substantial advancements over its predecessor, MiniCPM-Llama3-V 2.5.

Key Features of MiniCPM-V 2.6:

    Leading Performance: MiniCPM-V 2.6 attains an average score of 65.2 on OpenCompass, a comprehensive evaluation across eight popular benchmarks. With its 8 billion parameters, this model surpasses prominent proprietary models such as GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet in single image understanding.
    Multi-Image Understanding and In-context Learning: Capable of conversation and reasoning over multiple images, MiniCPM-V 2.6 achieves state-of-the-art results on multi-image benchmarks including Mantis-Eval, BLINK, Mathverse mv, and Sciverse mv. It also exhibits promising in-context learning abilities.
    Video Understanding: Accepting video inputs, MiniCPM-V 2.6 provides conversation and dense captions for spatial-temporal information. It outperforms models like GPT-4V, Claude 3.5 Sonnet, and LLaVA-NeXT-Video-34B on Video-MME, both with and without subtitles.
    Strong OCR Capability: Processing images with various aspect ratios and up to 1.8 million pixels, MiniCPM-V 2.6 sets a new standard on OCRBench, outperforming proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro. Leveraging the latest RLAIF-V and VisCPM techniques, it ensures trustworthy behaviors with significantly lower hallucination rates on Object HalBench, supporting multilingual capabilities across English, Chinese, German, French, Italian, and Korean.
    Superior Efficiency: Despite its compact size, MiniCPM-V 2.6 exhibits state-of-the-art token density, encoding a 1.8 million pixel image into just 640 tokens, 75% fewer than most models. This enhances inference speed, first-token latency, memory usage, and power consumption, enabling efficient real-time video understanding on devices such as iPads.
    Ease of Use: MiniCPM-V 2.6 is versatile in its application, supporting efficient CPU inference on local devices through llama.cpp and ollama, offering quantized models in int4 and GGUF formats in 16 sizes, vLLM support for high-throughput and memory-efficient inference, domain-specific fine-tuning, quick local WebUI demo setup with Gradio, and online web demos.

MiniCPM-V 2.6 represents a significant leap in machine learning for visual understanding, offering unmatched performance, efficiency, and usability across single image, multi-image, and video processing tasks


Check out the HF Model and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here


The post MiniCPM-V 2.6: A GPT-4V Level Multimodal LLMs for Single Image, Multi-Image, and Video on Your Phone appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MiniCPM-V 2.6 多模态大模型 视频理解 图像理解
相关文章