MarkTechPost@AI 2024年09月29日
SELMA: A Novel AI Approach to Enhance Text-to-Image Generation Models Using Auto-Generated Data and Skill-Specific Learning Techniques
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

SELMA是一种新方法,可增强文本到图像生成模型,不依赖人工标注数据,利用LLMs生成特定技能文本提示,经多阶段操作提升模型性能

🎯SELMA利用大语言模型(LLMs)自动生成特定技能的文本提示,确保数据集的多样性。T2I模型依据这些提示生成相应图像,创建无人干预的丰富数据集

🔧SELMA采用低秩适配(LoRA)方法,在特定技能数据集上微调T2I模型,形成多个特定技能的专家模型,再将这些专家模型合并,创建统一的多技能T2I模型

📈SELMA的性能通过与先进的T2I模型对比得到验证,在多个基准测试中提高了文本忠实度和人类偏好指标,如提升HPS 3.7分,PickScore提升0.4,ImageReward提升0.39

💡研究发现,使用较弱模型生成的图像对较强的T2I模型进行微调,可带来性能提升,显示了在T2I模型中从弱到强泛化的潜力

Text-to-image (T2I) models have seen rapid progress in recent years, allowing the generation of complex images based on natural language inputs. However, even state-of-the-art T2I models need help accurately capture and reflect all the semantics in given prompts, leading to images that may miss crucial details, such as multiple subjects or specific spatial relationships. For instance, generating a composition like “a cat with wings flying over a field of donuts” poses challenges and hurdles due to the inherent complexity and specificity of the prompt. As these models attempt to understand and replicate the nuances of text descriptions, their limitations become apparent. Moreover, enhancing these models is often hindered by the need for high-quality, large-scale annotated datasets, making it both resource-intensive and laborious. The result is a bottleneck in achieving models that can generate consistently faithful and semantically accurate images across diverse scenarios.

A key problem addressed by researchers is the need for help to create images that are truly faithful to complex textual descriptions. This misalignment often results in missing objects, incorrect spatial arrangements, or inconsistent rendering of multiple elements. For example, when asked to generate an image of a park scene featuring a bench, a bird, and a tree, T2I models might need to maintain the correct spatial relationships between these entities, leading to unrealistic images. Current solutions attempt to improve this faithfulness through supervised fine-tuning with annotated data or re-captioned text prompts. Although these methods show improvement, they rely heavily on the availability of extensive human-annotated data. This reliance introduces high training costs and complexity. Thus, there is a pressing need for a solution that can enhance image faithfulness without depending on manual data annotation, which is both costly and time-consuming.

Many existing solutions have attempted to address these challenges. One popular approach is supervised fine-tuning methods, where T2I models are trained using high-quality image-text pairs or manually curated datasets. Another line of research focuses on aligning T2I models with human preference data through reinforcement learning. This involves ranking and scoring images based on how well they match textual descriptions and using these scores to fine-tune the models further. Although these methods have shown promise in improving alignment, they depend on extensive manual annotations and high-quality data. Moreover, integrating additional components, such as bounding boxes or object layouts, to guide image generation has been explored. However, these techniques often require significant human effort and data curation, making them impractical at scale.

Researchers from the University of North Carolina at Chapel Hill have introduced SELMA: Skill-Specific Expert Learning and Merging with Auto-Generated Data. SELMA presents a novel approach to enhance T2I models without relying on human-annotated data. This method leverages the capabilities of Large Language Models (LLMs) to generate skill-specific text prompts automatically. The T2I models then use these prompts to produce corresponding images, creating a rich dataset without human intervention. The researchers employ a method known as Low-Rank Adaptation (LoRA) to fine-tune the T2I models on these skill-specific datasets, resulting in multiple skill-specific expert models. By merging these expert models, SELMA creates a unified multi-skill T2I model that can generate high-quality images with improved faithfulness and semantic alignment.

SELMA operates through a four-stage pipeline. First, skill-specific prompts are generated using LLMs, which helps ensure diversity in the dataset. The second stage involves generating corresponding images based on these prompts using T2I models. Next, the model is fine-tuned using LoRA modules to specialize in each skill. Finally, these skill-specific experts are merged to produce a robust T2I model capable of handling diverse prompts. This merging process effectively reduces knowledge conflicts between different skills, resulting in a model that can generate more accurate images than traditional multi-skill models. On average, SELMA showed a +2.1% improvement in the TIFA text-image alignment benchmark and a +6.9% enhancement in the DSG benchmark, indicating its effectiveness in improving faithfulness.

The performance of SELMA was validated against state-of-the-art T2I models, such as Stable Diffusion v1.4, v2, and XL. Empirical results demonstrated that SELMA improved text faithfulness and human preference metrics across multiple benchmarks, including PickScore, ImageReward, and Human Preference Score (HPS). For example, fine-tuning with SELMA improved HPS by 3.7 points and human preference metrics by 0.4 on PickScore and 0.39 on ImageReward. Notably, fine-tuning with auto-generated datasets performed comparable to fine-tuning with ground-truth data. The results suggest that SELMA is a cost-effective alternative without extensive manual annotation. The researchers found that fine-tuning a strong T2I model, such as SDXL, using images generated by a weaker model, such as SD v2, led to performance gains, suggesting the potential for weak-to-strong generalization in T2I models.

Key Takeaways from the SELMA Research:

In conclusion, SELMA offers a robust and efficient approach to enhance the faithfulness and semantic alignment of T2I models. By leveraging auto-generated data and a novel merging mechanism for skill-specific experts, SELMA eliminates the need for costly human-annotated data. This method addresses the key limitations of current T2I models and sets the stage for future advancements in text-to-image generation.


Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

The post SELMA: A Novel AI Approach to Enhance Text-to-Image Generation Models Using Auto-Generated Data and Skill-Specific Learning Techniques appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SELMA 文本到图像 模型增强 LLMs 性能提升
相关文章