ΑΙhub 20小时前
Interview with Yuki Mitsufuji: Text-to-sound generation
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Sony AI的Lead Research Scientist Yuki Mitsufuji及其团队在ICLR 2025上提出了SoundCTM,一项旨在解决现有文本到声音(T2S)生成模型速度慢、质量不稳定问题的创新技术。SoundCTM结合了基于分数和一致性模型,实现了快速的单步生成和高质量的多步确定性采样,为多媒体内容创作者提供了前所未有的灵活性和控制力。该模型能够生成复杂、全频带的声音,并确保语义内容在不同采样步长间保持一致,这在内容创作领域具有重要意义,能够显著提升创意迭代效率和最终声音的质量。

🚀 **SoundCTM模型创新性**:Sony AI提出的SoundCTM(Sound Consistency Trajectory Models)技术,成功融合了基于分数和一致性模型,实现了文本到声音(T2S)生成的高度灵活性。它能够在快速的单步生成和高质量的多步确定性采样之间无缝切换,解决了现有T2S模型在速度与质量上的权衡难题,使创作者能够高效地进行声音匹配和优化。

💡 **解决现有T2S模型痛点**:传统T2S生成模型,尤其是基于扩散模型的方法,虽然能生成高质量声音,但速度较慢,影响了创作者的快速实验。而1步生成模型速度虽快,但质量常不达专业标准。多步采样虽能提升质量,却易导致语义内容漂移。SoundCTM通过其独特的架构设计,在保证生成速度的同时,提升了声音质量,并维持了语义内容的一致性。

🔬 **技术实现与评估**:SoundCTM基于计算机视觉领域的CTM(Consistency Trajectory Models)研究,将其扩展至音频领域,能够生成具有速度、清晰度和控制力的全频带声音。开发过程中,研究团队引入了新颖的特征距离用于蒸馏损失,并采用了ν-sampling策略。通过Fréchet Distance、KL散度和CLAP score等客观指标以及主观听觉测试,证明了SoundCTM-DiT-1B是首个实现显著1步和多步全频带T2S生成的大规模蒸馏模型,且其多步确定性采样能保持语义一致性。

🌐 **对内容创作的意义**:SoundCTM的出现,为游戏、电影等媒体的音频创作带来了革命性的进步。创作者可以快速尝试声音创意,并能在不改变声音含义的前提下,逐步提升声音的细节和质量,极大地提高了工作流程的效率和创作的可能性,使得声音设计更加精准且富有表现力。

Earlier this year, we spoke to Yuki Mitsufuji, Lead Research Scientist at Sony AI, about work concerning different aspects of image generation. Yuki and his team have since extended their work to sound generation, presenting work at ICLR 2025 entitled: SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation. We caught up with Yuki to find out more.

In our previous interview, you mentioned that real-time sound generation was one of the projects you were working on. What were the problems with the existing text-to-sound generators that you were trying to solve with your work?

Creating sounds for different types of multimedia, such as video games and movies, takes a lot of experimenting, as artists try to match sounds to their evolving creative ideas. New high-quality diffusion-based Text-to-Sound (T2S) generative models can help with this process, but they are often slow, which makes it harder for creators to experiment quickly. Existing T2S distillation models address this limitation through 1-step generation, but often the quality isn’t good enough for professional use. Additionally, while multi-step sampling in the aforementioned distillation models improves sample quality, the semantic content changes because they don’t produce consistent results each time.

Could you tell us about the model that you’ve introduced – what are the main contributions of this work?

We proposed Sound Consistency Trajectory Models (SoundCTM), which allows flexible transitions between high-quality 1-step sound generation and superior sound quality through multi-step deterministic sampling. SoundCTM combines score-based diffusion and consistency models into a single architecture that supports both fast one-step sampling and high-fidelity multi-step generation for audio. This can empower creators to try out ideas quickly, match the sound to what they have in mind, and then improve the sound quality without changing its meaning.

How did you go about developing the model – what was the methodology?

SoundCTM builds directly on our previous computer vision CTM (Consistency Trajectory Models) research, which reimagined how diffusion models can learn from the trajectory of data as it transforms over time. By extending CTM into the audio domain, SoundCTM makes it possible to generate complex, full-band sound with speed, clarity, and control, while avoiding the training bottlenecks that slow down other models.

To develop SoundCTM, we addressed the limitations of the CTM framework by proposing a novel feature distance for distillation loss, a strategy for distilling CFG trajectories, and a ν-sampling that combines text-conditional and unconditional student jumps.

How did you evaluate the model, and what were the results?

Through our research, we demonstrate that SoundCTM-DiT-1B is the first large-scale distillation model to achieve notable 1-step and multi-step full-band text-to-sound generation.

When evaluating the model, in addition to standard objective metrics such as Fréchet Distance (FD), Kullback–Leibler divergence (KL), and CLAP score evaluated in full-band settings, we conducted subjective listening tests. A unique aspect of our evaluation was the use of sample-wise reconstruction error in the CLAP audio encoder’s feature space to compare outputs from 1-step and 16-step generations.

This approach allowed us to objectively verify whether semantic content remained consistent between 1-step and multi-step generations. Our findings revealed that only our unique multi-step deterministic sampling preserved semantic consistency when compared to 1-step generation. This is a significant result that, to our knowledge, has not yet been achieved by any other distillation-based sound generator.

While this outcome is theoretically expected, our empirical validation adds strong support—especially in the context of content creation, where semantic fidelity is crucial.

Audio samples are available here.

Read the work in full

SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation, Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong, Yuhta Takida, Yuki Mitsufuji.

About Yuki Mitsufuji

Yuki Mitsufuji is a Lead Research Scientist at Sony AI. In addition to his role at Sony AI, he is a Distinguished Engineer for Sony Group Corporation and the Head of Creative AI Lab for Sony R&D. Yuki holds a PhD in Information Science & Technology from the University of Tokyo. His groundbreaking work has made him a pioneer in foundational music and sound work, such as sound separation and other generative models that can be applied to music, sound, and other modalities.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SoundCTM 文本到声音 AI音频生成 Sony AI 深度学习
相关文章