MarkTechPost@AI 2024年11月26日
NVIDIA AI Unveils Fugatto: A 2.5 Billion Parameter Audio Model that Generates Music, Voice, and Sound from Text and Audio Input
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA推出了Fugatto,一个拥有25亿参数的AI模型,旨在生成和操控音乐、语音和声音。Fugatto结合文本提示和高级音频合成能力,使得声音输入在创意实验中具有高度灵活性,例如将钢琴旋律转换为人声演唱或使小号发出意想不到的声音。该模型支持文本和可选的音频输入,能够以超越传统音频生成模型的方式创建和操控声音,允许艺术家和开发者实时实验,生成新型声音或流畅地修改现有音频。Fugatto的架构利用Transformer模型并进行特定修改,使其能够胜任歌唱合成、声音转换和效果操控等任务,使其适用于广泛的音频应用场景,为创意音频制作提供了新的可能性。

🤔Fugatto是一个拥有25亿参数的AI模型,能够生成和操控音乐、语音和声音,它结合了文本提示和高级音频合成能力,使得声音输入在创意实验中具有高度灵活性。

🎤Fugatto支持文本和可选的音频输入,能够以超越传统音频生成模型的方式创建和操控声音,例如将钢琴旋律转换为人声演唱或使小号发出意想不到的声音。

🎼Fugatto的架构利用Transformer模型并进行特定修改,例如自适应层归一化,使其能够胜任歌唱合成、声音转换和效果操控等任务,使其适用于广泛的音频应用场景。

🔄Fugatto采用了一种创新的数据生成方法,不仅使用常规数据集,还使用专门的数据集生成技术来创建各种音频和转换任务,并利用大型语言模型(LLM)来增强指令生成,更好地理解和解释音频和文本提示之间的关系。

🎨Fugatto引入了可组合音频表示转换(ComposableART)技术,将分类器自由引导扩展到组合指令,使用户能够精确地控制合成,混合不同的声音并生成独特的声音现象。

Creating, editing, and transforming music and sounds present both technical and creative challenges. Current AI models often struggle with versatility, specializing in narrow tasks or lacking the ability to generalize effectively. This limits AI-assisted production and hinders creative adaptability. For AI to genuinely contribute to music and audio production, it must be versatile, compositional, and responsive to creative prompts, allowing artists to craft unique sounds. There is a clear need for a generalist model that can navigate the nuances of audio and text interaction, perform creative transformations, and deliver high-quality output.

NVIDIA has introduced Fugatto, an AI model with 2.5 billion parameters designed for generating and manipulating music, voices, and sounds. Fugatto blends text prompts with advanced audio synthesis capabilities, making sound inputs highly flexible for creative experimentation—such as changing a piano line into a human voice singing or making a trumpet produce unexpected sounds.

The model supports both text and optional audio inputs, enabling it to create and manipulate sounds in ways that go beyond conventional audio generation models. This versatile approach allows for real-time experimentation, enabling artists and developers to generate new types of sounds or modify existing audio fluidly. NVIDIA’s emphasis on flexibility allows Fugatto to excel at tasks involving complex compositional transformations, making it a valuable tool for artists and audio producers.

Technical Details

Fugatto operates using an innovative data generation approach that extends beyond conventional supervised learning. Its training involved not just regular datasets but also a specialized dataset generation technique to create a wide range of audio and transformation tasks. It uses large language models (LLMs) to enhance instruction generation, allowing it to better understand and interpret the relationship between audio and textual prompts. This dataset enrichment strategy has given Fugatto the capability to learn from diverse contexts, building a robust foundation for multitask learning.

A key innovation is the Composable Audio Representation Transformation (ComposableART), an inference-time technique developed to extend classifier-free guidance to compositional instructions. This enables Fugatto to combine, interpolate, or negate different audio generation instructions smoothly, opening new possibilities in sound creation. ComposableART provides a high level of control over synthesis, allowing users to navigate Fugatto’s sonic palette with precision, blending different sounds and generating unique sonic phenomena.

Fugatto’s architecture leverages Transformer models enhanced by specific modifications like Adaptive Layer Normalization, which helps maintain consistency across diverse inputs and supports compositional instructions better than existing models. This translates into a model capable of tasks like singing synthesis, sound transformations, and effects manipulations, making it suitable for a wide range of audio applications.

Fugatto’s versatility lies in its ability to perform at the intersection of creativity and technology. Specialized models have traditionally required manual intervention or narrowly defined tasks, often lacking the flexibility needed for creative experimentation. Fugatto, however, can be adapted for numerous purposes, which brings its utility to the forefront in the audio creation landscape. Early tests of Fugatto show that it performs competitively with other specialized models on common benchmarks, but its real strength lies in emergent abilities.

The results have been promising: Fugatto’s evaluations indicate competitive or superior performance compared to specialized models for audio synthesis and transformation. When tasked with synthesizing new sounds or following compositional instructions, Fugatto outperformed several benchmarks. For instance, it has demonstrated capabilities like creating novel sounds, such as synthesizing a saxophone with unusual characteristics or generating speech that integrates smoothly with background soundscapes—tasks that were previously challenging for other models.

Furthermore, Fugatto’s ability to generate emergent sounds—sonic phenomena that go beyond typical training data—opens new possibilities for creative sound design. Its use of ComposableART for compositional synthesis means users can merge multiple attributes dynamically, making it a valuable tool for audio producers seeking creative control.

Conclusion

Fugatto is a notable advancement in generative AI for audio, offering capabilities that challenge traditional limits and enhance creative sound manipulation. NVIDIA has integrated large language models with the intricacies of sound and music, resulting in a tool that is both powerful and versatile. Fugatto’s ability to manage nuanced audio tasks, from straightforward sound generation to complex compositional modifications, makes it a valuable contribution to the future of creative AI tools. This model has significant implications not only for artists but also for industries such as gaming, entertainment, and education, where AI tools are increasingly supporting and inspiring human creativity.


Check out the Paper and NVIDIA Blog. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniques’ Read the Full Report (Promoted)

The post NVIDIA AI Unveils Fugatto: A 2.5 Billion Parameter Audio Model that Generates Music, Voice, and Sound from Text and Audio Input appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Fugatto AI音频 音频生成 NVIDIA Transformer模型
相关文章