MarkTechPost@AI 01月01日
This AI Paper from NVIDIA and SUTD Singapore Introduces TANGOFLUX and CRPO: Efficient and High-Quality Text-to-Audio Generation with Flow Matching
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

TANGOFLUX是由新加坡科技设计大学(SUTD)和NVIDIA的研究人员推出的一种先进的文本到音频生成模型。它利用CLAP-Ranked Preference Optimization (CRPO) 框架迭代优化音频生成,确保与文本描述的高度一致性。该模型采用混合架构,结合了Diffusion Transformer (DiT) 和Multimodal Diffusion Transformer (MMDiT) 模块,支持可变时长的音频生成。与传统的基于扩散的模型不同,TANGOFLUX使用流匹配框架,减少了计算步骤,提高了生成效率。实验结果表明,TANGOFLUX在多个指标上均优于其他模型,如CLAP得分和FD得分,同时在多事件场景中表现出色,并具备良好的实时应用潜力。

🚀 TANGOFLUX模型采用CLAP-Ranked Preference Optimization (CRPO) 框架,通过迭代优化音频生成,确保输出音频与文本描述高度一致,这是其核心创新点。

⏱️ TANGOFLUX采用混合架构,结合了Diffusion Transformer (DiT) 和Multimodal Diffusion Transformer (MMDiT) 模块,并利用流匹配框架,相比传统扩散模型,大大减少了计算步骤,提高了生成效率,仅需3.7秒即可生成30秒音频。

🎯 在性能方面,TANGOFLUX在多项指标上均超越了其他模型,例如CLAP得分高达0.48,FD得分为75.1,尤其在处理包含多个独立事件的复杂文本提示时,展现了其卓越的细节捕捉和时间关系处理能力。

🗣️ 人类评估也证实了TANGOFLUX的优越性,其在整体质量和提示相关性等主观指标上均获得了最高评分,表明其生成的音频在清晰度和与文本提示的对齐度方面表现出色。

Text-to-audio generation has transformed how audio content is created, automating processes that traditionally required significant expertise and time. This technology enables the conversion of textual prompts into diverse and expressive audio, streamlining workflows in audio production and creative industries. Bridging textual input with realistic audio outputs has opened possibilities in applications like multimedia storytelling, music, and sound design.

One of the significant challenges in text-to-audio systems is ensuring that generated audio aligns faithfully with textual prompts. Current models often fail to capture intricate details, leading to inconsistencies fully. Some outputs omit essential elements or introduce unintended audio artifacts. The lack of standardized methods for optimizing these systems further exacerbates the problem. Unlike language models, text-to-audio systems do not benefit from robust alignment strategies, such as reinforcement learning with human feedback, leaving much room for improvement.

Previous approaches to text-to-audio generation relied heavily on diffusion-based models, such as AudioLDM and Stable Audio Open. While these models deliver decent quality, they come with limitations. Their reliance on extensive denoising steps makes them computationally expensive and time-intensive. Furthermore, many models are trained on proprietary datasets, which limits their accessibility and reproducibility. These constraints hinder their scalability and ability to handle diverse and complex prompts effectively.

To address these challenges, researchers from the Singapore University of Technology and Design (SUTD) and NVIDIA introduced TANGOFLUX, an advanced text-to-audio generation model. This model is designed for efficiency and high-quality output, achieving significant improvements over previous methods. TANGOFLUX utilizes the CLAP-Ranked Preference Optimization (CRPO) framework to refine audio generation and ensure alignment with textual descriptions iteratively. Its compact architecture and innovative training strategies allow it to perform exceptionally well while requiring fewer parameters.

TANGOFLUX integrates advanced methodologies to achieve state-of-the-art results. It employs a hybrid architecture combining Diffusion Transformer (DiT) and Multimodal Diffusion Transformer (MMDiT) blocks, enabling it to handle variable-duration audio generation. Unlike traditional diffusion-based models, which depend on multiple denoising steps, TANGOFLUX uses a flow-matching framework to create a direct and rectified path from noise to output. This rectified flow approach reduces the computational steps required for high-quality audio generation. During training, the system incorporates textual and duration conditioning to ensure precision in capturing input prompts’ nuances and the audio output’s desired length. The CLAP model evaluates the alignment between audio and textual prompts by generating preference pairs and optimizing them iteratively, a process inspired by alignment techniques used in language models.

In terms of performance, TANGOFLUX outshines its predecessors across multiple metrics. It generates 30 seconds of audio in just 3.7 seconds using a single A40 GPU, demonstrating exceptional efficiency. The model achieves a CLAP score of 0.48 and an FD score of 75.1, both indicative of high-quality and text-aligned audio outputs. Compared to Stable Audio Open, which achieves a CLAP score of 0.29, TANGOFLUX significantly improves alignment accuracy. In multi-event scenarios, where prompts include multiple distinct events, TANGOFLUX excels, showcasing its ability to capture intricate details and temporal relationships effectively. The system’s robustness is further highlighted by its ability to maintain performance even with reduced sampling steps, a feature that enhances its practicality in real-time applications.

Human evaluations corroborate these results, with TANGOFLUX scoring the highest in subjective metrics such as overall quality and prompt relevance. Annotators consistently rated its outputs as clearer and more aligned than other models like AudioLDM and Tango 2. The researchers also emphasized the importance of the CRPO framework, which allowed for creating a preference dataset that outperformed alternatives such as BATON and Audio-Alpaca. The model avoided performance degradation typically associated with offline datasets by generating new synthetic data during each training iteration.

The research successfully addresses critical limitations in text-to-audio systems by introducing TANGOFLUX, which combines efficiency with superior performance. Its innovative use of rectified flow and preference optimization sets a benchmark for future advancements in the field. This development enhances the quality and alignment of generated audio and demonstrates scalability, making it a practical solution for widespread adoption. The work of SUTD and NVIDIA represents a significant leap forward in text-to-audio technology, pushing the boundaries of what is achievable in this rapidly evolving domain.


Check out the Paper, Code Repo, and Pre-Trained Model. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation IntelligenceJoin this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

The post This AI Paper from NVIDIA and SUTD Singapore Introduces TANGOFLUX and CRPO: Efficient and High-Quality Text-to-Audio Generation with Flow Matching appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

TANGOFLUX 文本到音频 CRPO 流匹配 音频生成
相关文章