MarkTechPost@AI 2024年10月18日
DeepSeek AI Releases Janus: A 1.3B Multimodal Model with Image Generation Capabilities
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Janus是一种新型的自回归框架,由DeepSeek-AI等机构的研究人员提出,它通过采用两个不同的视觉编码路径来统一多模态理解和生成。该模型解决了现有方法中因理解和生成任务需求不同而导致的性能不佳问题,在多模态理解和视觉生成任务中表现出色,具有较高的效率和准确性。

🎯Janus采用两个不同的视觉编码路径,分别为理解任务和生成任务设计专门的通路,通过统一的变压器进行处理,缓解了现有模型中的内在冲突,增强了灵活性。

📋Janus的架构包括理解编码器和生成编码器两个主要组件,分别以不同方式处理多模态输入。对于多模态理解,使用高维语义特征提取方法;对于视觉生成,利用VQ标记器将视觉数据转换为离散表示。

📈Janus的训练分为三个阶段,包括训练适配器、统一预训练和监督微调,增强了其多模态能力并保持不同任务的一致性。实验结果表明,Janus在各种基准测试中显著优于先前的模型。

🎉Janus在多模态理解和视觉生成方面表现出色,在多个多模态基准测试中取得了优异成绩,如在MMBench、SEED-Bench和POPE上分别获得69.4、63.7和87.0的分数,在MSCOCO - 30K上的Fréchet Inception Distance(FID)为8.53。

Multimodal AI models are powerful tools capable of both understanding and generating visual content. However, existing approaches often use a single visual encoder for both tasks, which leads to suboptimal performance due to the fundamentally different requirements of understanding and generation. Understanding requires high-level semantic abstraction, while generation focuses on local details and global consistency. This mismatch results in conflicts that limit the overall efficiency and accuracy of the model.

Researchers from DeepSeek-AI, the University of Hong Kong, and Peking University propose Janus, a novel autoregressive framework that unifies multimodal understanding and generation by employing two distinct visual encoding pathways. Unlike prior models that use a single encoder, Janus introduces a specialized pathway for each task, both of which are processed through a unified transformer. This unique design alleviates conflicts inherent in prior models and provides enhanced flexibility, enabling different encoding methods that best suit each modality. The name “Janus” aptly represents this duality, much like the Roman god, with two faces representing transitions and coexistence.

The architecture of Janus consists of two main components: an Understanding Encoder and a Generation Encoder, each tasked with handling multimodal inputs differently. For multimodal understanding, Janus uses a high-dimensional semantic feature extraction approach through SigLIP, transforming the features into a sequence compatible with the language model. For visual generation, Janus utilizes a VQ tokenizer that converts visual data into discrete representations, enabling detailed image synthesis. Both tasks are processed by a shared transformer, enabling the model to operate in an autoregressive fashion. This approach allows the model to decouple the requirements of each visual task, simplifying implementation and improving scalability.

The training is divided into three stages: training adaptors, unified pretraining, and supervised fine-tuning, all of which enhance its multimodal capabilities while maintaining consistency across different tasks.

The experimental results demonstrate that Janus significantly outperforms prior models across various benchmarks. In multimodal understanding, Janus achieved impressive results, surpassing LLaVA-v1.5 and other unified models while even matching or exceeding task-specific models in certain cases. Specifically, Janus obtained scores of 69.4, 63.7, and 87.0 on multimodal benchmarks such as MMBench, SEED-Bench, and POPE, respectively, outperforming larger models like Qwen-VL-Chat (7B). In visual generation tasks, Janus showed superior performance as well, achieving a Fréchet Inception Distance (FID) of 8.53 on MSCOCO-30K, demonstrating better consistency with user prompts than competing models such as DALL-E 2 and SDXL. Notably, these results show that Janus offers a balanced capability of understanding and generating visual content while being more parameter-efficient.

In conclusion, Janus presents a major step forward in developing unified multimodal AI models by resolving the conflicts between understanding and generation. Its decoupling approach proves to be both effective and efficient, allowing for high-quality semantic understanding alongside detailed visual generation. This flexibility makes Janus a promising candidate for future developments in multimodal AI, with potential applications extending into additional modalities, such as point clouds or audio data. The extensibility, flexibility, and robust performance of Janus highlight its potential to serve as an inspiration for the next generation of unified multimodal models.


Check out the Paper, Model Card on Hugging Face, and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)

The post DeepSeek AI Releases Janus: A 1.3B Multimodal Model with Image Generation Capabilities appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Janus 多模态AI 视觉编码 实验结果
相关文章