Nvidia Blog 6小时前
Now We’re Talking: NVIDIA Releases Open Dataset, Models for Multilingual Speech AI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

为解决全球语言在AI模型支持上的不足,NVIDIA推出了名为Granary的新数据集和先进的AI模型,显著扩展了对25种欧洲语言(包括克罗地亚语、爱沙尼亚语、马耳他语等数据稀缺语言)的语音识别和翻译能力。Granary包含约百万小时的多语种语音数据,并配合NVIDIA NeMo工具包进行高效处理,无需大量人工标注即可转化为高质量的训练数据。新发布的Canary-1b-v2和Parakeet-tdt-0.6b-v3模型,分别针对准确性和实时性进行了优化,可赋能开发者更轻松地构建多语言聊天机器人、客户服务语音代理及近实时翻译等规模化AI应用,加速全球语音AI技术的创新与普及。

🎯 **解决语言支持鸿沟,赋能欧洲多语种AI**:NVIDIA通过推出Granary数据集和相关模型,有效解决了全球约7000种语言中,仅有极少数被AI语言模型支持的普遍问题。该项目专注于25种欧洲语言,特别关注了克罗地亚语、爱沙尼亚语和马耳他语等数据资源有限的语言,旨在推动高质量的语音识别和翻译AI的开发,从而更好地服务于欧洲地区的广泛用户群体。

📚 **Granary数据集:海量数据与高效处理的结合**:Granary是一个庞大的开源多语种语音数据集,包含了近百万小时的音频数据,其中约65万小时用于语音识别,超过35万小时用于语音翻译。该数据集的生成得益于NVIDIA NeMo Speech Data Processor工具包的创新处理流程,能够将未标记的音频转化为结构化、高质量的AI训练数据,显著降低了对昂贵且耗时的人工标注的依赖,为开发者提供了即用型的数据资源。

🚀 **NVIDIA Canary与Parakeet模型:提升AI性能与效率**:NVIDIA发布了Canary-1b-v2和Parakeet-tdt-0.6b-v3两款模型。Canary-1b-v2是一款拥有十亿参数的模型,在Granary数据集上训练,能够提供高质量的欧洲语言转录和英汉互译,其性能可与规模大三倍的模型媲美,同时推理速度提升高达十倍。Parakeet-tdt-0.6b-v3则是一款6亿参数的精简模型,专为实时或大批量转录设计,在保证高吞吐量的同时,还能自动检测音频语言并提供准确的标点、大小写和词级时间戳。

💡 **开放共享与加速创新**:NVIDIA不仅提供了数据集和模型,还开放了其数据处理方法论。开发者可以基于此流程,将Granary数据应用于其他ASR或AST模型,或扩展到更多语言,从而加速全球语音AI领域的创新。这种开放性使得社区能够共同构建更具包容性和多样性的语音技术,更好地反映全球语言的丰富性。

Of around 7,000 languages in the world, a tiny fraction are supported by AI language models. NVIDIA is tackling the problem with a new dataset and models that support the development of high-quality speech recognition and translation AI for 25 European languages — including languages with limited available data like Croatian, Estonian and Maltese.

These tools will enable developers to more easily scale AI applications to support global users with fast, accurate speech technology for production-scale use cases such as multilingual chatbots, customer service voice agents and near-real-time translation services. They include:

The paper behind Granary will be presented at Interspeech, a language processing conference taking place in the Netherlands, Aug. 17-21. The dataset, as well as the new Canary and Parakeet models, are now available on Hugging Face.

How Granary Addresses Data Scarcity

To develop the Granary dataset, the NVIDIA speech AI team collaborated with researchers from Carnegie Mellon University and Fondazione Bruno Kessler. The team passed unlabeled audio through an innovative processing pipeline powered by NVIDIA NeMo Speech Data Processor toolkit that turned it into structured, high-quality data.

This pipeline allowed the researchers to enhance public speech data into a usable format for AI training, without the need for resource-intensive human annotation. It’s available in open source on GitHub.

With Granary’s clean, ready-to-use data, developers can get a head start building models that tackle transcription and translation tasks in nearly all of the European Union’s 24 official languages, plus Russian and Ukrainian.

For European languages underrepresented in human-annotated datasets, Granary provides a critical resource to develop more inclusive speech technologies that better reflect the linguistic diversity of the continent — all while using less training data.

The team demonstrated in their Interspeech paper that, compared to other popular datasets, it takes around half as much Granary training data to achieve a target accuracy level for automatic speech recognition (ASR) and automatic speech translation (AST).

Tapping NVIDIA NeMo to Turbocharge Transcription

The new Canary and Parakeet models offer examples of the kinds of models developers can build with Granary, customized to their target applications. Canary-1b-v2 is optimized for accuracy on complex tasks, while parakeet-tdt-0.6b-v3 is designed for high-speed, low-latency tasks.

By sharing the methodology behind the Granary dataset and these two models, NVIDIA is enabling the global speech AI developer community to adapt this data processing workflow to other ASR or AST models or additional languages, accelerating speech AI innovation.

Canary-1b-v2, available under a permissive license, expands the Canary family’s supported languages from four to 25. It offers transcription and translation quality comparable to models 3x larger while running inference up to 10x faster.

NVIDIA NeMo, a modular software suite for managing the AI agent lifecycle, accelerated speech AI model development. NeMo Curator, part of the software suite, enabled the team to filter out synthetic examples from the source data so that only high-quality samples were used for model training. The team also harnessed the NeMo Speech Data Processor toolkit for tasks like aligning transcripts with audio files and converting data into the required formats.

Parakeet-tdt-0.6b-v3 prioritizes high throughput and is capable of transcribing 24-minute audio segments in a single inference pass. The model automatically detects the input audio language and transcribes without additional prompting steps.

Both Canary and Parakeet models provide accurate punctuation, capitalization and word-level timestamps in their outputs.

Read more on GitHub and get started with Granary on Hugging Face.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NVIDIA Granary 语音AI 欧洲语言 自然语言处理
相关文章