MarkTechPost@AI 05月16日 13:35
DanceGRPO: A Unified Framework for Reinforcement Learning in Visual Generation Across Multiple Paradigms and Tasks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

ByteDance Seed和香港大学的研究者提出了DanceGRPO,一个统一的框架,用于改进视觉生成模型,包括扩散模型和修正流模型。该框架适用于多种任务,如文本到图像、文本到视频和图像到视频的生成。DanceGRPO集成了四个基础模型和五个奖励模型,涵盖图像/视频美学、文本-图像对齐、视频运动质量和二元奖励评估。实验结果表明,DanceGRPO在关键基准测试中表现优于其他方法,最高提升达181%。该框架通过与人类偏好高效对齐,并能稳健地扩展到复杂的多任务设置,从而实现卓越的性能。

🎨DanceGRPO是一个统一的框架,旨在通过强化学习改进视觉生成,它适用于扩散模型和修正流模型,并能处理文本到图像、文本到视频以及图像到视频等多种任务。

🏆该框架集成了四个基础模型(Stable Diffusion, HunyuanVideo, FLUX, SkyReels-I2V)和五个专业奖励模型,这些模型用于优化视觉生成质量,包括图像美学、文本-图像对齐、视频美学质量、视频运动质量以及阈值二元奖励。

📈实验结果表明,DanceGRPO在多个基准测试中显著优于现有方法。例如,在Stable Diffusion v1.4上,HPS得分从0.239提升至0.365,CLIP得分从0.363提升至0.395。在HunyuanVideo-T2I上,HPS-v2.1模型的平均奖励得分从0.23提升至0.33。

🎬DanceGRPO通过优化HunyuanVideo,在视觉和运动质量指标上分别实现了56%和181%的相对提升。在使用VideoAlign奖励模型的运动质量指标上,实现了高达91%的相对改进。

Recent advances in generative models, especially diffusion models and rectified flows, have revolutionized visual content creation with enhanced output quality and versatility. Human feedback integration during training is essential for aligning outputs with human preferences and aesthetic standards. Current approaches like ReFL methods depend on differentiable reward models that introduce VRAM inefficiency for video generation. DPO variants achieve only marginal visual improvements. Further, RL-based methods face challenges including conflicts between ODE-based sampling of rectified flow models and Markov Decision Process formulations, instability when scaling beyond small datasets, and a lack of validation for video generation tasks.

Aligning LLMs employs Reinforcement Learning from Human Feedback (RLHF), which trains reward functions based on comparison data to capture human preferences. Policy gradient methods have proven effective but are computationally intensive and require extensive tuning, while Direct Policy Optimization (DPO) offers cost efficiency but delivers inferior performance. DeepSeek-R1 recently showed that large-scale RL with specialized reward functions can guide LLMs toward self-emergent thought processes. Current approaches include DPO-style methods, direct backpropagation with reward signals like ReFL, and policy gradient-based methods such as DPOK and DDPO. Production models primarily utilize DPO and ReFL due to the instability of policy gradient methods in large-scale applications.

Researchers from ByteDance Seed and the University of Hong Kong have proposed DanceGRPO, a unified framework adapting Group Relative Policy Optimization to visual generation paradigms. This solution operates seamlessly across diffusion models and rectified flows, handling text-to-image, text-to-video, and image-to-video tasks. The framework integrates with four foundation models (Stable Diffusion, HunyuanVideo, FLUX, SkyReels-I2V) and five reward models covering image/video aesthetics, text-image alignment, video motion quality, and binary reward assessments. DanceGRPO outperforms baselines by up to 181% on key benchmarks, including HPS-v2.1, CLIP Score, VideoAlign, and GenEval.

The architecture utilizes five specialized reward models to optimize visual generation quality:

DanceGRPO shows significant improvements in reward metrics for Stable Diffusion v1.4 with an increase in the HPS score from 0.239 to 0.365, and CLIP Score from 0.363 to 0.395. Pick-a-Pic and GenEval evaluations confirm the method’s effectiveness, with DanceGRPO outperforming all competing approaches. For HunyuanVideo-T2I, optimization using the HPS-v2.1 model increases the mean reward score from 0.23 to 0.33, showing enhanced alignment with human aesthetic preferences. With HunyuanVideo, despite excluding text-video alignment due to instability, the methodology achieves relative improvements of 56% and 181% in visual and motion quality metrics, respectively. DanceGRPO uses the VideoAlign reward model’s motion quality metric, achieving a substantial 91% relative improvement in this dimension.

In this paper, researchers have introduced DanceGRPO, a unified framework for enhancing diffusion models and rectified flows across text-to-image, text-to-video, and image-to-video tasks. It addresses critical limitations of prior methods by bridging the gap between language and visual modalities, achieving superior performance through efficient alignment with human preferences and robust scaling to complex, multi-task settings. Experiments demonstrate substantial improvements in visual fidelity, motion quality, and text-image alignment. Future work will explore GRPO’s extension to multimodal generation, further unifying optimization paradigms across Generative AI.


Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.

The post DanceGRPO: A Unified Framework for Reinforcement Learning in Visual Generation Across Multiple Paradigms and Tasks appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DanceGRPO 强化学习 视觉生成 扩散模型 修正流
相关文章