MarkTechPost@AI 19小时前
NVIDIA Just Released Audio Flamingo 3: An Open-Source Model Advancing Audio General Intelligence
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA 发布了 Audio Flamingo 3 (AF3),这是一个开源的大型音频语言模型 (LALM),它在音频理解和推理方面取得了重大进展。AF3 能够处理长达 10 分钟的音频输入,支持多轮多音频对话,并具备按需思考能力,甚至可以进行语音到语音的交互。该模型在多个基准测试中超越了现有模型,并开源了模型权重、训练配方、推理代码和四个数据集,为音频 AI 研究开辟了新方向。

🎤 AF-Whisper:AF3 采用 AF-Whisper 编码器,该编码器源于 Whisper-v3,统一处理语音、环境声音和音乐,解决了早期 LALM 使用独立编码器导致的不一致性问题。 AF-Whisper 利用音频字幕数据集、合成元数据和密集的 1280 维嵌入空间,与文本表示对齐。

🤔 链式思考能力:AF3 具备‘思考’能力,使用 AF-Think 数据集(25 万个示例),可以在提示时执行链式思考推理,从而在得出答案之前解释其推理步骤,这是实现透明音频 AI 的关键一步。

🗣️ 多轮多音频对话:通过 AF-Chat 数据集(7.5 万个对话),AF3 可以进行涉及多音频输入的多轮上下文对话。它还使用流式文本转语音模块引入语音到语音的对话。

⏳ 长音频推理:AF3 是第一个能够对长达 10 分钟的音频输入进行推理的完全开放的模型。通过 LongAudio-XL(125 万个示例)的训练,该模型支持会议摘要、播客理解、讽刺检测和时间定位等任务。

📈 卓越的基准测试表现:AF3 在 20 多个基准测试中超越了开放和封闭模型,包括 MMAU (平均值):73.14%(比 Qwen2.5-O 高 2.14%)。LibriSpeech (ASR):1.57% WER,优于 Phi-4-mm。ClothAQA:91.1%(vs. Qwen2.5-O 的 89.2%)。

Heard about Artificial General Intelligence (AGI)? Meet its auditory counterpart—Audio General Intelligence. With Audio Flamingo 3 (AF3), NVIDIA introduces a major leap in how machines understand and reason about sound. While past models could transcribe speech or classify audio clips, they lacked the ability to interpret audio in a context-rich, human-like way—across speech, ambient sound, and music, and over extended durations. AF3 changes that.

With Audio Flamingo 3, NVIDIA introduces a fully open-source large audio-language model (LALM) that not only hears but also understands and reasons. Built on a five-stage curriculum and powered by the AF-Whisper encoder, AF3 supports long audio inputs (up to 10 minutes), multi-turn multi-audio chat, on-demand thinking, and even voice-to-voice interactions. This sets a new bar for how AI systems interact with sound, bringing us a step closer to AGI.

The Core Innovations Behind Audio Flamingo 3

    AF-Whisper: A Unified Audio Encoder AF3 uses AF-Whisper, a novel encoder adapted from Whisper-v3. It processes speech, ambient sounds, and music using the same architecture—solving a major limitation of earlier LALMs which used separate encoders, leading to inconsistencies. AF-Whisper leverages audio-caption datasets, synthesized metadata, and a dense 1280-dimension embedding space to align with text representations.Chain-of-Thought for Audio: On-Demand Reasoning Unlike static QA systems, AF3 is equipped with ‘thinking’ capabilities. Using the AF-Think dataset (250k examples), the model can perform chain-of-thought reasoning when prompted, enabling it to explain its inference steps before arriving at an answer—a key step toward transparent audio AI.Multi-Turn, Multi-Audio Conversations Through the AF-Chat dataset (75k dialogues), AF3 can hold contextual conversations involving multiple audio inputs across turns. This mimics real-world interactions, where humans refer back to previous audio cues. It also introduces voice-to-voice conversations using a streaming text-to-speech module.Long Audio Reasoning AF3 is the first fully open model capable of reasoning over audio inputs up to 10 minutes. Trained with LongAudio-XL (1.25M examples), the model supports tasks like meeting summarization, podcast understanding, sarcasm detection, and temporal grounding.

State-of-the-Art Benchmarks and Real-World Capability

AF3 surpasses both open and closed models on over 20 benchmarks, including:

These improvements aren’t just marginal; they redefine what’s expected from audio-language systems. AF3 also introduces benchmarking in voice chat and speech generation, achieving 5.94s generation latency (vs. 14.62s for Qwen2.5) and better similarity scores.

The Data Pipeline: Datasets That Teach Audio Reasoning

NVIDIA didn’t just scale compute—they rethought the data:

Each dataset is fully open-sourced, along with training code and recipes, enabling reproducibility and future research.

Open Source

AF3 is not just a model drop. NVIDIA released:

This transparency makes AF3 the most accessible state-of-the-art audio-language model. It opens new research directions in auditory reasoning, low-latency audio agents, music comprehension, and multi-modal interaction.

Conclusion: Toward General Audio Intelligence

Audio Flamingo 3 demonstrates that deep audio understanding is not just possible but reproducible and open. By combining scale, novel training strategies, and diverse data, NVIDIA delivers a model that listens, understands, and reasons in ways previous LALMs could not.


Check out the Paper, Codes and Model on Hugging Face. All credit for this research goes to the researchers of this project.

Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience [Learn More]

The post NVIDIA Just Released Audio Flamingo 3: An Open-Source Model Advancing Audio General Intelligence appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Audio Flamingo 3 音频AI NVIDIA 开源
相关文章