MarkTechPost@AI 05月09日 14:38
Ming-Lite-Uni: An Open-Source AI Framework Designed to Unify Text and Vision through an Autoregressive Multimodal Structure
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Ming-Lite-Uni是由Inclusion AI和蚂蚁集团的研究人员推出的开源框架,旨在通过自回归多模态结构统一文本和视觉。该系统基于大型语言模型和微调的扩散图像生成器,引入了多尺度可学习tokens,作为可解释的视觉单元,并通过多尺度对齐策略来保持图像尺度之间的一致性。该模型在文本到图像生成、风格迁移和图像编辑等多种任务中表现出色,并通过开放模型权重和实现来支持社区研究,是通往通用人工智能的原型。

🖼️ Ming-Lite-Uni 引入多尺度可学习tokens,将视觉输入压缩成结构化的token序列,不同尺度代表图像不同层次的细节,例如布局和纹理。

🧠 模型采用多尺度表示对齐策略,通过均方误差损失对齐中间和输出特征,确保各层之间的一致性,从而显著提高图像重建质量和生成评估分数。

🎨 该系统在多种多模态任务中进行了测试,包括文本到图像生成、风格迁移和图像编辑,即使在抽象或风格化提示下,也能保持强大的视觉质量。

📚 Ming-Lite-Uni 使用超过22.5亿个样本进行训练,这些样本来自LAION-5B、COYO和Zero等公共来源,以及Midjourney和Wukong等过滤后的数据集,从而增强了模型的性能。

🤝 该框架采用冻结语言模型并微调图像生成器的方式,允许更快的更新和更高效的扩展,同时开放模型权重和实现,鼓励社区进行复制和扩展。

Multimodal AI rapidly evolves to create systems that can understand, generate, and respond using multiple data types within a single conversation or task, such as text, images, and even video or audio. These systems are expected to function across diverse interaction formats, enabling more seamless human-AI communication. With users increasingly engaging AI for tasks like image captioning, text-based photo editing, and style transfers, it has become important for these models to process inputs and interact across modalities in real time. The frontier of research in this domain is focused on merging capabilities once handled by separate models into unified systems that can perform fluently and precisely.

A major obstacle in this area stems from the misalignment between language-based semantic understanding and the visual fidelity required in image synthesis or editing. When separate models handle different modalities, the outputs often become inconsistent, leading to poor coherence or inaccuracies in tasks that require interpretation and generation. The visual model might excel in reproducing an image but fail to grasp the nuanced instructions behind it. In contrast, the language model might understand the prompt but cannot shape it visually. There is also a scalability concern when models are trained in isolation; this approach demands significant compute resources and retraining efforts for each domain. The inability to seamlessly link vision and language into a coherent and interactive experience remains one of the fundamental problems in advancing intelligent systems.

In recent attempts to bridge this gap, researchers have combined architectures with fixed visual encoders and separate decoders that function through diffusion-based techniques. Tools such as TokenFlow and Janus integrate token-based language models with image generation backends, but they typically emphasize pixel accuracy over semantic depth. These approaches can produce visually rich content, yet they often miss the contextual nuances of user input. Others, like GPT-4o, have moved toward native image generation capabilities but still operate with limitations in deeply integrated understanding. The friction lies in translating abstract text prompts into meaningful and context-aware visuals in a fluid interaction without splitting the pipeline into disjointed parts.

Researchers from Inclusion AI, Ant Group introduced Ming-Lite-Uni, an open-source framework designed to unify text and vision through an autoregressive multimodal structure. The system features a native autoregressive model built on top of a fixed large language model and a fine-tuned diffusion image generator. This design is based on two core frameworks: MetaQueries and M2-omni. Ming-Lite-Uni introduces an innovative component of multi-scale learnable tokens, which act as interpretable visual units, and a corresponding multi-scale alignment strategy to maintain coherence between various image scales. The researchers provided all the model weights and implementation openly to support community research, positioning Ming-Lite-Uni as a prototype moving toward general artificial intelligence.

The core mechanism behind the model involves compressing visual inputs into structured token sequences across multiple scales, such as 4×4, 8×8, and 16×16 image patches, each representing different levels of detail, from layout to textures. These tokens are processed alongside text tokens using a large autoregressive transformer. Each resolution level is marked with unique start and end tokens and assigned custom positional encodings. The model employs a multi-scale representation alignment strategy that aligns intermediate and output features through a mean squared error loss, ensuring consistency across layers. This technique boosts image reconstruction quality by over 2 dB in PSNR and improves generation evaluation (GenEval) scores by 1.5%. Unlike other systems that retrain all components, Ming-Lite-Uni keeps the language model frozen and only fine-tunes the image generator, allowing faster updates and more efficient scaling.

The system was tested on various multimodal tasks, including text-to-image generation, style transfer, and detailed image editing using instructions like “make the sheep wear tiny sunglasses” or “remove two of the flowers in the image.” The model handled these tasks with high fidelity and contextual fluency. It maintained strong visual quality even when given abstract or stylistic prompts such as “Hayao Miyazaki’s style” or “Adorable 3D.” The training set spanned over 2.25 billion samples, combining LAION-5B (1.55B), COYO (62M), and Zero (151M), supplemented with filtered samples from Midjourney (5.4M), Wukong (35M), and other web sources (441M). Furthermore, it incorporated fine-grained datasets for aesthetic assessment, including AVA (255K samples), TAD66K (66K), AesMMIT (21.9K), and APDD (10K), which enhanced the model’s ability to generate visually appealing outputs according to human aesthetic standards.

The model combines semantic robustness with high-resolution image generation in a single pass. It achieves this by aligning image and text representations at the token level across scales, rather than depending on a fixed encoder-decoder split. The approach allows autoregressive models to carry out complex editing tasks with contextual guidance, which was previously hard to achieve. FlowMatching loss and scale-specific boundary markers support better interaction between the transformer and the diffusion layers. Overall, the model strikes a rare balance between language comprehension and visual output, positioning it as a significant step toward practical multimodal AI systems.

Several Key Takeaways from the Research on Ming-Lite-Uni:


Check out the Paper, Model on Hugging Face and GitHub Page. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

The post Ming-Lite-Uni: An Open-Source AI Framework Designed to Unify Text and Vision through an Autoregressive Multimodal Structure appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Ming-Lite-Uni 多模态AI 自回归模型 图像生成 开源框架
相关文章