DeepSeek Releases Open-Source Multimodal AI Model Janus-Pro, Surpassing DALL-E 3 and Stable Diffusion

TMTPOST -- In the early hours of Tuesday, the AI community was abuzz as Hugging Face announced the release of DeepSeek's latest open-source multimodal AI model, Janus-Pro. Available in two configurations with 1 billion and 7 billion parameters, the model marks a significant leap in AI capabilities.

The Janus-Pro-7B model has outperformed OpenAI's DALL-E 3 and Stable Diffusion in benchmark tests such as GenEval and DPG-Bench, establishing its superiority in both image generation and understanding.

Janus-Pro integrates cutting-edge advancements in multimodal AI. The model's ability to process and understand images is powered by the innovative SigLIP-L architecture, while its image generation capabilities draw inspiration from LlamaGen. The model is offered in two sizes, with configurations at 1.5 billion and 7 billion parameters, catering to a range of computational needs.

This launch comes at a time when OpenAI's highly anticipated multimodal image-generation model, GPT-4o, remains unavailable to the public, adding to the excitement surrounding Janus-Pro's open-source debut.

DeepSeek has been at the forefront of multimodal generative AI research. The company launched its original Janus model in late 2024 as a unified framework for understanding and generating multimodal content. Built on DeepSeek-LLM-1.3b-base, Janus utilized a massive dataset of 500 billion text tokens for training. Its design decoupled visual encoding to optimize both understanding and generation tasks, employing advanced techniques like SigLIP-L for visual input and an innovative rectified flow for image generation.

This progress culminated in Janus-Pro, an enhanced self-regressive framework with significant architectural refinements. By decoupling visual encoding into independent pathways, Janus-Pro eliminates previous conflicts in understanding and generation tasks while maintaining a unified Transformer architecture. This modularity improves flexibility and task-specific performance.

Janus-Pro is built on DeepSeek-LLM-1.5b-base and DeepSeek-LLM-7b-base, trained using HAI-LLM, a high-performance distributed training framework on PyTorch. The training involved clusters of 16 to 32 nodes, each equipped with 8 Nvidia A100 GPUs, and required 7–14 days depending on the model size.

The complete Janus-Pro codebase is now available on GitHub: Janus GitHub Repository.

DeepSeek’s rapid advancements in multimodal AI may heighten competition with industry giants such as OpenAI, Meta, and Nvidia. However, the company has faced challenges, including recent large-scale cyberattacks on its online services. To mitigate these issues, DeepSeek has temporarily restricted new user registrations outside China, requiring international users to register using virtual numbers.

With Janus-Pro setting new standards for multimodal AI, the industry eagerly anticipates further developments, including potential advancements in text-to-image and text-to-video capabilities.

更多精彩内容，关注钛媒体微信号（ID：taimeiti），或者下载钛媒体App

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签