MarkTechPost@AI 05月16日 02:40
Stability AI Introduces Adversarial Relativistic-Contrastive (ARC) Post-Training and Stable Audio Open Small: A Distillation-Free Breakthrough for Fast, Diverse, and Efficient Text-to-Audio Generation Across Devices
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Stability AI的研究人员推出了一种名为对抗相对对比(ARC)的后训练方法和Stable Audio Open Small模型,无需传统的模型蒸馏或无分类器指导,即可实现快速、多样且高效的文本到音频生成。ARC通过引入相对对抗损失和对比判别器损失来增强预训练的校正流生成器,从而在减少生成步骤的同时保持与文本提示的强对齐。结合Stable Audio Open框架,该系统能够在H100 GPU上仅用75毫秒生成12秒的44.1 kHz立体声音频,在移动设备上约为7秒。这为在移动音频工具和嵌入式系统等设备上实现实时创意应用铺平了道路。

🚀 ARC后训练方法避免了模型蒸馏和无分类器指导,而是依赖于对抗和对比损失,从而简化了训练流程并提高了效率。

⏱️ ARC能够在H100 GPU上以75毫秒的速度生成12秒的44.1 kHz立体声音频,在移动CPU上则为7秒,展示了其卓越的生成速度。

🎶 Stable Audio Open Small模型拥有4.97亿参数,支持8步生成,并且兼容移动设备部署,使其成为资源受限环境的理想选择。

📱 在Vivo X200 Pro手机上,应用动态Int8量化后,推理延迟从15.3秒降至6.6秒,内存占用减半,凸显了其在移动设备上的实用性。

Text-to-audio generation has emerged as a transformative approach for synthesizing sound directly from textual prompts, offering practical use in music production, gaming, and virtual experiences. Under the hood, these models typically employ Gaussian flow-based techniques such as diffusion or rectified flows. These methods model the incremental steps that transition from random noise to structured audio. While highly effective in producing high-quality soundscapes, the slow inference speeds have posed a barrier to real-time interactivity. It is particularly limiting when creative users expect an instrument-like responsiveness from these tools.

Latency is the primary issue with these systems. Current text-to-audio models can take several seconds or even minutes to generate a few seconds of audio. The core bottleneck lies in their step-based inference architecture, requiring between 50 and 100 iterations per output. Previous acceleration strategies focus on distillation methods where smaller models are trained under the supervision of larger teacher models to replicate multi-step inference in fewer steps. However, these distillation methods are computationally expensive. They demand large-scale storage for intermediate training outputs or require simultaneous operation of several models in memory, which hinders their adoption, especially on mobile or edge devices. Also, such methods often sacrifice output diversity and introduce over-saturation artifacts.

While a few adversarial post-training methods have been attempted to bypass the cost of distillation, their success has been limited. Most existing implementations rely on partial distillation for initialization or do not scale well to complex audio synthesis. Also, audio applications have seen fewer fully adversarial solutions. Tools like Presto integrate adversarial objectives but still depend on teacher models and CFG-based training for prompt adherence, which restricts their generative diversity.

Researchers from UC San Diego, Stability AI, and Arm introduced Adversarial Relativistic-Contrastive (ARC) post-training. This approach sidesteps the need for teacher models, distillation, or classifier-free guidance. Instead, ARC enhances an existing pre-trained rectified flow generator by integrating two novel training objectives: a relativistic adversarial loss and a contrastive discriminator loss. These help the generator produce high-fidelity audio in fewer steps while maintaining strong alignment with text prompts. When paired with the Stable Audio Open (SAO) framework, the result was a system capable of generating 12 seconds of 44.1 kHz stereo audio in only 75 milliseconds on an H100 GPU and around 7 seconds on mobile devices.

With ARC methodology, they introduced Stable Audio Open Small, a compact and efficient version of SAO tailored for resource-constrained environments. This model contains 497 million parameters and uses an architecture built on a latent diffusion transformer. It consists of three main components: a waveform-compressing autoencoder, a T5-based text embedding system for semantic conditioning, and a DiT (Diffusion Transformer) that operates within the latent space of the autoencoder. Stable Audio Open Small can generate stereo audio up to 11 seconds long at 44.1 kHz. It is designed to be deployed using the ‘stable-audio-tools’ library and supports ping-pong sampling, enabling efficient few-step generation. The model demonstrated exceptional inference efficiency, achieving generation speeds of under 7 seconds on a Vivo X200 Pro phone after applying dynamic Int8 quantization, which also cut RAM usage from 6.5GB to 3.6 GB. This makes it especially viable for on-device creative applications like mobile audio tools and embedded systems.

The ARC training approach involves replacing the traditional L2 loss with an adversarial formulation where generated and real samples, paired with identical prompts, are evaluated by a discriminator trained to distinguish between them. A contrastive objective teaches the discriminator to rank accurate audio-text pairs higher than mismatched ones to improve prompt relevance. These paired objectives eliminate the need for CFG while achieving better prompt adherence. Also, ARC adopts ping-pong sampling to refine the audio output through alternating denoising and re-noising cycles, reducing inference steps without compromising quality.

ARC’s performance was evaluated extensively. In objective tests, it achieved an FDopenl3 score of 84.43, a KLpasst score of 2.24, and a CLAP score of 0.27, indicating balanced quality and semantic precision. Diversity was notably strong, with a CLAP Conditional Diversity Score (CCDS) of 0.41. Real-Time Factor reached 156.42, reflecting outstanding generation speed, while GPU memory usage remained at a practical 4.06 GB. Subjectively, ARC scored 4.4 for diversity, 4.2 for quality, and 4.2 for prompt adherence in human evaluations involving 14 participants. Unlike distillation-based models like Presto, which scored higher on quality but dropped to 2.7 on diversity, ARC presented a more balanced and practical solution.

Several Key Takeaways from the Research by Stability AI on Adversarial Relativistic-Contrastive (ARC) post-training and  Stable Audio Open Small include: 

In conclusion, the combination of ARC post-training and Stable Audio Open Small eliminates the reliance on resource-intensive distillation and classifier-free guidance, enabling researchers to deliver a streamlined adversarial framework that accelerates inference without compromising output quality or prompt adherence. ARC enables fast, diverse, and semantically rich audio synthesis in high-performance and mobile environments. With Stable Audio Open Small optimized for lightweight deployment, this research lays the groundwork for integrating responsive, generative audio tools into everyday creative workflows, from professional sound design to real-time applications on edge devices.


Check out the Paper, GitHub Page and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.

The post Stability AI Introduces Adversarial Relativistic-Contrastive (ARC) Post-Training and Stable Audio Open Small: A Distillation-Free Breakthrough for Fast, Diverse, and Efficient Text-to-Audio Generation Across Devices appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ARC 文本到音频生成 Stability AI Stable Audio Open Small
相关文章
Stability AI: Ever wondered what happens when you prompt SD3 with the LaTeX for Bayes Theorem? We got some unexpected, yet stunningly creative results...
Stability AI: Looking for some creative inspiration? ? ✨ We have put together a collection of stunning images created with Stable Assistant, compl...
Stability AI: We’re excited to announce Stable Audio Open, an open source model optimised for generating short audio samples, sound effects and produ...
Stability AI: What an incredible moment at COMPUTEX TAIPEI! ? Thank you to @AMD for including Stability AI in the opening keynote speech alongside ...
Stability AI: Join us with @thehugxyz at the Innovation Laboratory, a 4-week guided summer course on how to train your own AI model, showcasing your c...
Stability AI: Today we are announcing Stable Artisan, bringing the capabilities of Stability AI’s Developer Platform API accessible to a wider audien...
Stability AI: Our API now has two new features– Sketch: Upgrades rough hand-drawn sketches to refined outputs with precise control. For non-sketch im...
Stability AI: Our SVP of Integrity, @ellagirwin, will be joining @Thorn, @AllTechIsHuman, and other industry leaders for a livestream on April 25 at 1...
Stability AI: We’ve teamed up with @thorn, @AllTechIsHuman and other leading tech companies to commit to implementing child safety principles into ou...
Stability AI: ↩️ Prompt: A cardboard box with the phrase “they say it's not good to think in here”, the cardboard box is large and sits on a theat...