NVIDIA AI Just Released the Largest Open-Source Speech AI Dataset and State-of-the-Art Models for European Languages

MarkTechPost@AI 17小时前

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

Nvidia近日发布了Granary，这是迄今为止最大的开源欧洲语言语音数据集，并推出了Canary-1b-v2和Parakeet-tdt-0.6b-v3两款先进模型。Granary包含近25种欧洲语言的约百万小时音频数据，特别关注数据较少的语言，其伪标签处理流程能有效降低手动标注成本，并支持语音识别（ASR）和语音翻译（AST）任务，显著提升模型训练效率。Canary-1b-v2是一款拥有十亿参数的多语言模型，能在英语与24种欧洲语言间实现高质量的语音识别与翻译，推理速度快且功能全面。Parakeet-tdt-0.6b-v3则是一款专为实时多语言语音识别设计的模型，支持自动语言检测，效率高且商业级可用。这些资源的发布将极大地推动欧洲多语言语音AI应用的开发与普及。

🟢 Granary数据集是Nvidia与卡内基梅隆大学及Fondazione Bruno Kessler合作推出的，包含约100万小时的音频数据，覆盖25种欧洲语言，其中65万小时用于语音识别，35万小时用于语音翻译。该数据集特别关注数据标注较少的语言，如克罗地亚语、爱沙尼亚语和马耳他语，并通过伪标签流水线处理未标注的公共音频数据，以结构化和提高数据质量，减少手动标注的需求，从而加速模型收敛，使开发者仅需一半数据即可达到目标准确率。

🔵 Canary-1b-v2是一款拥有十亿参数的Encoder-Decoder模型，基于Granary数据集进行训练，能够实现英语与24种支持语言之间的语音识别与翻译。该模型在准确性、多任务能力和推理速度上表现出色，其性能可与体量三倍大的模型媲美，但推理速度却快10倍。它支持自动标点、大写、词语和片段级时间戳，甚至翻译输出的时间戳，采用FastConformer Encoder和Transformer Decoder架构，并使用统一的SentencePiece分词器处理所有语言，在嘈杂环境下表现稳定。

🟣 Parakeet-tdt-0.6b-v3是一款6亿参数的多语言语音识别模型，专为高吞吐量或大批量转录设计，支持全部25种欧洲语言。该模型扩展了原有的Parakeet系列，实现了欧洲语言的全面覆盖，并具备自动语言检测功能，无需额外提示即可转录输入音频。它支持实时处理，一次推理即可转录长达24分钟的音频，并优先考虑低延迟、批量处理和准确性，提供词语级时间戳、标点和大小写，即使在处理数字、歌词等复杂内容和音频挑战时也表现可靠。

🟡 Nvidia此次发布的Granary数据集及模型套件，旨在加速欧洲多语言语音AI的民主化进程，使开发者能够更便捷地构建支持语言多样性的高质量应用，如多语言聊天机器人、客户服务语音助手以及近乎实时的翻译服务。这些资源以开放获取的方式提供，极大地降低了开发门槛，促进了语音AI技术的普惠和创新。

🟢 Canary-1b-v2模型在AMI数据集上的ASR词错误率（WER）为7.15%，在LibriSpeech Clean数据集上为10.82%；在X→English的AST COMET得分上为79.3，在English→X上为84.56。该模型已在CC BY 4.0许可下发布，并针对Nvidia GPU加速系统进行了优化，确保了快速训练和推理能力，适用于大规模生产环境。

Nvidia has taken a major leap in the development of multilingual speech AI, unveiling Granary, the largest open-source speech dataset for European languages, and two state-of-the-art models: Canary-1b-v2 and Parakeet-tdt-0.6b-v3. This release sets a new standard for accessible, high-quality resources in automatic speech recognition (ASR) and speech translation (AST), especially for underrepresented European languages.

Granary: The Foundation of Multilingual Speech AI

Granary is a massive, multilingual corpus developed in collaboration with Carnegie Mellon University and Fondazione Bruno Kessler. It delivers around one million hours of audio, with 650,000 hours for speech recognition and 350,000 for speech translation. The dataset covers 25 European languages—representing nearly all official EU languages, plus Russian and Ukrainian—with a critical focus on languages with limited annotated data, such as Croatian, Estonian, and Maltese.

Key features:

Largest open-source speech dataset

Pseudo-labeling pipeline:

Supports both ASR and AST:

Open access:

By leveraging clean, high-quality data, Granary enables significantly faster model convergence. Research demonstrates that developers need half as much Granary data to reach target accuracies compared to competing datasets, making it especially valuable for resource-constrained languages and rapid prototyping.

Canary-1b-v2: Multilingual ASR + Translation (En 24 Languages)

Canary-1b-v2 is a billion-parameter Encoder-Decoder model trained on Granary, delivering high-quality transcription and translation between English and 24 supported European languages.

It’s architected for accuracy and multitask capabilities:

Languages supported:

State-of-the-art performance:

up to 10× faster inference

Multitask capability:

Features:

Architecture:

Robustness:

Evaluation highlights:

ASR Word Error Rate (WER):

AST COMET Scores:

Deployment:

Parakeet-tdt-0.6b-v3: Real-Time Multilingual ASR

Parakeet-tdt-0.6b-v3 is a 600-million-parameter multilingual ASR model designed for high-throughput or large-volume transcription in all 25 supported languages. It extends the Parakeet family (previously English-centric) to full European coverage.

Automatic language detection:

Real-time capability:

Fast, scalable, and commercial-ready:

Robustness:

Impact on Speech AI Development

Nvidia’s Granary dataset and model suite accelerate the democratization of speech AI for Europe, enabling scalable development of:

Multilingual chatbots

Customer service voice agents

Near-real-time translation services

Developers, researchers, and businesses can now build inclusive, high-quality applications supporting linguistic diversity, with open access to these super cool models and datasets

Check out the Granary, NVIDIA Canary-1b-v2 and NVIDIA Parakeet-tdt-0.6b-v3. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post NVIDIA AI Just Released the Largest Open-Source Speech AI Dataset and State-of-the-Art Models for European Languages appeared first on MarkTechPost.

Granary: The Foundation of Multilingual Speech AI

Key features:

Canary-1b-v2: Multilingual ASR + Translation (En 24 Languages)

It’s architected for accuracy and multitask capabilities:

Evaluation highlights:

Parakeet-tdt-0.6b-v3: Real-Time Multilingual ASR

Impact on Speech AI Development

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签