Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

cs.AI updates on arXiv.org 07月14日 12:08

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

本文介绍了音频语言模型AF3，其通过创新策略实现跨模态学习，并具备长音频理解和推理、多轮对话等先进功能，在音频理解与推理领域取得突破。

arXiv:2507.08128v1 Announce Type: cross Abstract: We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

音频语言模型跨模态学习长音频理解

相关文章

Qwen2-Audio Released: A Revolutionary Audio-Language Model Overcoming Complex Audio Challenges with Unmatched Precision and Versatile Interaction Capabilities

阿里通义开源音频语言模型Qwen2-Audio，相关论文入选顶会ACL 2024

阿里通义千问开源 Qwen2-Audio 7B 语音交互大模型：自由互动，无需输入文本

开源版GPT-4o来了，AI大神Karpathy盛赞，67页技术报告全公开

NexaAI, 一行命令运行魔搭社区模型，首次在设备上运行 Qwen2-Audio

UC Berkeley Researchers Explore the Role of Task Vectors in Vision-Language Models

Nexa AI Releases OmniAudio-2.6B: A Fast Audio Language Model for Edge Deployment

只给一张图，AI找到对应合适BGM，央音清华等构建全球化音乐信息检索新范式

只给一张图，AI找到对应合适BGM，央音清华等构建全球化音乐信息检索新范式

【 vLLM 学习】Audio Language