MarkTechPost@AI 11小时前
BAAI Launches OmniGen2: A Unified Diffusion and Transformer Model for Multimodal AI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

北京智源人工智能研究院(BAAI)推出了新一代开源多模态生成模型OmniGen2。该模型在OmniGen的基础上,通过单一的Transformer框架统一了文本生成图像、图像编辑和主题驱动生成。OmniGen2创新性地解耦了文本和图像生成,引入了反射训练机制,并开发了OmniContext基准来评估上下文一致性。它在文本到图像生成、图像编辑和上下文生成方面均表现出色,为可控、一致的图像-文本生成研究奠定了基础,并开源了模型、数据集和代码。

🖼️ OmniGen2采用解耦的多模态架构,将文本生成和图像合成分离。它使用自回归Transformer进行文本生成,而图像合成则采用扩散Transformer,并引入名为Omni-RoPE的定位策略,以灵活处理序列、空间坐标和模态区分,从而实现高保真图像生成和编辑。

🔄 OmniGen2的核心创新之一是反射机制,该机制通过在训练过程中集成反馈循环,使模型能够分析其生成的输出,识别不一致之处并提出改进方案。这种机制模仿了测试时自校正,显著提高了指令遵循的准确性和视觉连贯性,尤其是在修改颜色、对象数量或位置等细微任务中。

📊 为了严格评估上下文生成能力,OmniGen2引入了OmniContext基准,该基准包含三种主要任务类型:SINGLE、MULTIPLE和SCENE,涵盖角色、对象和场景类别。OmniGen2在该领域表现出色,总分达到7.18分,优于BAGEL和UniWorld-V1等领先模型。

Beijing Academy of Artificial Intelligence (BAAI) introduces OmniGen2, a next-generation, open-source multimodal generative model. Expanding on its predecessor OmniGen, the new architecture unifies text-to-image generation, image editing, and subject-driven generation within a single transformer framework. It innovates by decoupling the modeling of text and image generation, incorporating a reflective training mechanism, and implementing a purpose-built benchmark—OmniContext—to evaluate contextual consistency.

A Decoupled Multimodal Architecture

Unlike prior models that use shared parameters across text and image modalities, OmniGen2 introduces two distinct pathways: an autoregressive transformer for text generation and a diffusion-based transformer for image synthesis. It also employs a novel positioning strategy named Omni-RoPE, which allows flexible handling of sequences, spatial coordinates, and modality distinctions, enabling high-fidelity image generation and editing.

To preserve the pretrained text generation ability of the underlying MLLM (based on Qwen2.5-VL-3B), OmniGen2 feeds VAE-derived features only to the diffusion pathway. This avoids compromising the model’s text understanding and generation capabilities while maintaining rich visual representation for the image synthesis module.

Reflection Mechanism for Iterative Generation

One of the standout features in OmniGen2 is the reflection mechanism. By integrating feedback loops during training, the model is capable of analyzing its generated outputs, identifying inconsistencies, and proposing refinements. This process mimics test-time self-correction and significantly enhances instruction-following accuracy and visual coherence, especially for nuanced tasks like modifying color, object count, or positioning.

The reflection dataset was constructed using multi-turn feedback, enabling the model to learn how to revise and terminate generation based on content evaluation. This mechanism is particularly useful in bridging the quality gap between open-source and commercial models.

OmniContext Benchmark: Evaluating Contextual Consistency

To rigorously assess in-context generation, the team introduces OmniContext, a benchmark comprising three primary task types: SINGLE, MULTIPLE, and SCENE, across Character, Object, and Scene categories. OmniGen2 demonstrates state-of-the-art performance among open-source models in this domain, scoring 7.18 overall—outperforming other leading models like BAGEL and UniWorld-V1.

The evaluation uses three core metrics: Prompt Following (PF), Subject Consistency (SC), and Overall Score (geometric mean), each validated through GPT-4.1-based reasoning. This benchmarking framework emphasizes not just visual realism but semantic alignment with prompts and cross-image consistency.

Data Pipeline and Training Corpus

OmniGen2 was trained on 140M T2I samples and 10M proprietary images, supplemented by meticulously curated datasets for in-context generation and editing. These datasets were constructed using a video-based pipeline that extracts semantically consistent frame pairs and automatically generates instructions using Qwen2.5-VL models. The resulting annotations cover fine-grained image manipulations, motion variations, and compositional changes.

For training, the MLLM parameters remain largely frozen to retain general understanding, while the diffusion module is trained from scratch and optimized for joint visual-textual attention. A special token “<|img|>” triggers image generation within output sequences, streamlining the multimodal synthesis process.

Performance Across Tasks

OmniGen2 delivers strong results across multiple domains:

Conclusion

OmniGen2 is a robust and efficient multimodal generative system that advances unified modeling through architectural separation, high-quality data pipelines, and an integrated reflection mechanism. By open-sourcing models, datasets, and code, the project lays a solid foundation for future research in controllable, consistent image-text generation. Upcoming improvements may focus on reinforcement learning for reflection refinement and expanding multilingual and low-quality robustness.

The post BAAI Launches OmniGen2: A Unified Diffusion and Transformer Model for Multimodal AI appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OmniGen2 多模态 AI生成模型
相关文章