MarkTechPost@AI 2024年07月23日
Stability AI Open-Sources Stable Audio Open: An Audio Generation Model with Variable-Length (up to 47s) Stereo Audio at 44.1kHz from Text Prompts
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Stability AI发布了Stable Audio Open,一个开源的文本到音频生成模型,它可以从文本提示生成长达47秒的立体声音频,采样率为44.1kHz。该模型使用Creative Commons许可的音频数据进行训练,确保了数据的道德和合法性,并为研究人员提供了一个强大的工具。

😄 **开源权重:**与许多专有模型不同,Stable Audio Open的权重是公开的,这意味着研究人员和开发人员可以查看、修改和扩展模型,因为它被公开设计和参数化。

😎 **道德数据使用:**该模型仅使用Creative Commons许可的音频文件进行训练,这确保了训练数据的道德和法律合规性。通过使用Creative Commons许可的数据,开发人员促进数据方法的开放性,并避免潜在的版权问题。

🚀 **高保真音频合成:**该模型的架构旨在提供易于访问的高质量音频合成。它可以生成44.1kHz采样率的高质量立体声音频,确保生成的音频符合严格的清晰度和真实性标准。

💪 **全面评估:**为了确保新模型能够达到或超过现有模型的标准,对其性能进行了全面评估。FDopenl3是用于评估的主要指标之一,它衡量了生成的音频的真实性。该指标的结果表明,该模型在与业界领先模型的比较中表现出色,证明了其生成高质量音频的能力。

📊 **模型比较:**为了评估模型的功能并确定改进的领域,将其性能与其他表现良好的模型进行了比较。这项比较研究证明了新模型的卓越质量和可用性。

In the field of Artificial Intelligence, open, generative models stand out as a cornerstone for progress. These models are vital for advancing research and fostering creativity by allowing fine-tuning and serving as benchmarks for new innovations. However, a significant challenge persists as many state-of-the-art text-to-audio models remain proprietary, limiting their accessibility for researchers. 

Recently, a team of researchers from Stability AI has introduced a new open-weight text-to-audio model that is trained exclusively on Creative Commons data. This paradigm is intended to guarantee openness and moral data use while offering the AI community a potent tool. Its key features are as follows:

    This new model has open weights, in contrast to numerous proprietary models. This enables researchers and developers to examine, alter, and expand upon the model because its design and parameters are made available to the general public. 
    Only audio files with Creative Commons licenses have been used to train the model. This decision guarantees the training materials’ ethical and legal soundness. The developers have encouraged openness in data methods and steered clear of possible copyright issues by using data that is available under Creative Commons.

The architecture of the new model is intended to provide accessible, high-quality audio synthesis, which is as follows:

    The model makes use of a sophisticated architecture that provides remarkable fidelity in text-to-audio generation. At a sampling rate of 44.1kHz, it can generate high-quality stereo sound, guaranteeing that the resulting audio satisfies strict requirements for clarity and realism.
    A variety of audio files with Creative Commons licenses have been used in the instruction process. This method guarantees that the model can produce realistic and varied audio outputs while also assisting it in learning from a wide variety of soundscapes.

To make sure the new model matches or exceeds the standards set by the previous models, its performance has been thoroughly assessed. Measuring the realism of the generated audio, FDopenl3 is one of the primary assessment metrics employed. This metric’s findings showcased the model’s capacity to generate high-quality audio by showing that it performs on par with the industry’s top models. To evaluate the model’s capabilities and pinpoint areas for development, its performance has been compared to that of other well-performing models. This comparative study attests to the new model’s superior quality and usability. 

In conclusion, the development of generative audio technology has advanced significantly with the release of this open-weight text-to-audio model. The concept solves many of the existing problems in the industry by emphasizing openness, ethical data utilization, and high-quality audio synthesis. It sets new standards for text-to-audio production and is a significant resource for scholars, artists, and developers. 


Check out the Paper, Model, and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

Find Upcoming AI Webinars here

The post Stability AI Open-Sources Stable Audio Open: An Audio Generation Model with Variable-Length (up to 47s) Stereo Audio at 44.1kHz from Text Prompts appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Stability AI Stable Audio Open 文本到音频生成 开源模型 人工智能
相关文章