MarkTechPost@AI 03月21日 02:10
NVIDIA AI Just Open Sourced Canary 1B and 180M Flash – Multilingual Speech Recognition and Translation Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

英伟达(NVIDIA)AI 开源了 Canary 1B Flash 和 Canary 180M Flash 模型,旨在推动多语种语音识别和翻译技术的发展。这些模型支持英语、德语、法语和西班牙语等多种语言,采用 Encoder-Decoder 架构,具有高准确率、低延迟和高效部署的特点。 Canary 1B Flash 和 Canary 180M Flash 模型在性能上表现出色,实现了实时处理,并且在多种语言的自动语音识别和自动语音翻译任务中均取得了优异的成绩。开源许可允许商业使用,鼓励 AI 社区的创新。

🗣️ **模型架构与技术细节:** Canary 1B Flash 和 Canary 180M Flash 均采用 Encoder-Decoder 架构,其中 Encoder 基于 FastConformer,用于高效处理音频特征,而 Transformer Decoder 则负责文本生成。模型通过 <target language>、<task>、<toggle timestamps> 和 <toggle PnC> 等特定任务的 tokens 引导输出。

🚀 **性能表现:** Canary 1B Flash 模型在开放的 ASR 榜单数据集上实现了超过 1000 RTFx 的推断速度,支持实时处理。在英语 ASR 任务中,其在 Librispeech Clean 数据集上的 WER 为 1.48%,在 Librispeech Other 数据集上的 WER 为 2.87%。在多语种 ASR 任务中,德语、西班牙语和法语的 WER 分别为 4.36%、2.69% 和 4.47%。在 AST 任务中,英语到德语、西班牙语和法语的 BLEU 分数分别为 32.27、22.6 和 41.22。

💡 **模型特点与应用:** 两款模型均支持词级和片段级时间戳,增强了在音频和文本之间需要精确对齐的应用中的实用性。它们的紧凑尺寸使其适合于设备端部署,支持离线处理,减少了对云服务的依赖。此外,它们的鲁棒性在翻译任务中产生更少的幻觉,确保了更可靠的输出。

🌍 **开源与社区影响:** 模型在 CC-BY-4.0 许可下开源,鼓励商业利用和社区进一步开发。这不仅推动了 AI 研究的进步,也赋能开发者和组织构建更具包容性和效率的沟通工具。

In the realm of artificial intelligence, multilingual speech recognition and translation have become essential tools for facilitating global communication. However, developing models that can accurately transcribe and translate multiple languages in real-time presents significant challenges. These challenges include managing diverse linguistic nuances, maintaining high accuracy, ensuring low latency, and deploying models efficiently across various devices.​

To address these challenges, NVIDIA AI has open-sourced two models: Canary 1B Flash and Canary 180M Flash. These models are designed for multilingual speech recognition and translation, supporting languages such as English, German, French, and Spanish. Released under the permissive CC-BY-4.0 license, these models are available for commercial use, encouraging innovation within the AI community.​

Technically, both models utilize an encoder-decoder architecture. The encoder is based on FastConformer, which efficiently processes audio features, while the Transformer Decoder handles text generation. Task-specific tokens, including <target language>, <task>, <toggle timestamps>, and <toggle PnC> (punctuation and capitalization), guide the model’s output. The Canary 1B Flash model comprises 32 encoder layers and 4 decoder layers, totaling 883 million parameters, whereas the Canary 180M Flash model consists of 17 encoder layers and 4 decoder layers, amounting to 182 million parameters. This design ensures scalability and adaptability to various languages and tasks. ​

Performance metrics indicate that the Canary 1B Flash model achieves an inference speed exceeding 1000 RTFx on open ASR leaderboard datasets, enabling real-time processing. In English automatic speech recognition (ASR) tasks, it attains a word error rate (WER) of 1.48% on the Librispeech Clean dataset and 2.87% on the Librispeech Other dataset. For multilingual ASR, the model achieves WERs of 4.36% for German, 2.69% for Spanish, and 4.47% for French on the MLS test set. In automatic speech translation (AST) tasks, the model demonstrates robust performance with BLEU scores of 32.27 for English to German, 22.6 for English to Spanish, and 41.22 for English to French on the FLEURS test set. ​

Data as of March 20 2025

The smaller Canary 180M Flash model also delivers impressive results, with an inference speed surpassing 1200 RTFx. It achieves a WER of 1.87% on the Librispeech Clean dataset and 3.83% on the Librispeech Other dataset for English ASR. For multilingual ASR, the model records WERs of 4.81% for German, 3.17% for Spanish, and 4.75% for French on the MLS test set. In AST tasks, it achieves BLEU scores of 28.18 for English to German, 20.47 for English to Spanish, and 36.66 for English to French on the FLEURS test set. ​

Both models support word-level and segment-level timestamping, enhancing their utility in applications requiring precise alignment between audio and text. Their compact sizes make them suitable for on-device deployment, enabling offline processing and reducing dependency on cloud services. Moreover, their robustness leads to fewer hallucinations during translation tasks, ensuring more reliable outputs. The open-source release under the CC-BY-4.0 license encourages commercial utilization and further development by the community.​

In conclusion, NVIDIA’s open-sourcing of the Canary 1B and 180M Flash models represents a significant advancement in multilingual speech recognition and translation. Their high accuracy, real-time processing capabilities, and adaptability for on-device deployment address many existing challenges in the field. By making these models publicly available, NVIDIA not only demonstrates its commitment to advancing AI research but also empowers developers and organizations to build more inclusive and efficient communication tools.


Check out the Canary 1B Model and Canary 180M Flash. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post NVIDIA AI Just Open Sourced Canary 1B and 180M Flash – Multilingual Speech Recognition and Translation Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NVIDIA 多语种 语音识别 机器翻译 开源
相关文章