MarkTechPost@AI 05月06日 13:55
NVIDIA Open Sources Parakeet TDT 0.6B: Achieving a New Standard for Automatic Speech Recognition ASR and Transcribes an Hour of Audio in One Second
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

英伟达开源了Parakeet TDT 0.6B,一款先进的自动语音识别(ASR)模型,拥有6亿参数,采用CC-BY-4.0许可协议,并在Hugging Face上完全开源。该模型速度惊人,实时因子(RTF)高达3386,为语音AI的性能和可访问性树立了新标准。它能在1秒内转录60分钟的音频,比许多现有开源ASR模型快50倍以上。在Hugging Face的Open ASR排行榜上,Parakeet V2实现了6.05%的词错误率(WER),在开源模型中表现最佳,为企业级语音应用带来了显著的飞跃。

🚀 Parakeet TDT 0.6B以其无与伦比的速度和转录质量为核心优势,可以在短短一秒内转录60分钟的音频,比许多现有的开源ASR模型快50倍以上。在Hugging Face的Open ASR Leaderboard上,Parakeet V2实现了6.05%的词错误率(WER),在开源模型中名列前茅。

⚙️ Parakeet TDT 0.6B基于transformer架构,并使用高质量的转录数据进行微调,针对NVIDIA硬件上的推理进行了优化。关键亮点包括:6亿参数的编码器-解码器模型、为最大推理效率而量化和融合的内核、针对TDT(Transducer Decoder Transformer)架构的优化,以及支持精确的时间戳格式、数字格式和标点符号恢复。

🎶 Parakeet不仅仅关注速度和词错误率,NVIDIA还在模型中嵌入了独特的功能:歌曲到歌词的转录,解锁了对唱歌内容的转录,扩展了在音乐索引和媒体平台中的使用案例;数字和时间戳格式,提高了在会议记录、法律记录和健康记录等结构化上下文中的可读性和可用性;标点符号恢复,增强了下游NLP应用程序的自然可读性。

🤝 Parakeet TDT 0.6B在Hugging Face上提供,包含模型权重、分词器和推理脚本。它在带有TensorRT的NVIDIA GPU上运行效果最佳,但也支持吞吐量降低的CPU环境。无论是构建转录服务、注释海量音频数据集还是将语音集成到产品中,Parakeet TDT 0.6B都提供了一种引人注目的开源替代方案。

NVIDIA has unveiled Parakeet TDT 0.6B, a state-of-the-art automatic speech recognition (ASR) model that is now fully open-sourced on Hugging Face. With 600 million parameters, a commercially permissive CC-BY-4.0 license, and a staggering real-time factor (RTF) of 3386, this model sets a new benchmark for performance and accessibility in speech AI.

Blazing Speed and Accuracy

At the heart of Parakeet TDT 0.6B’s appeal is its unmatched speed and transcription quality. The model can transcribe 60 minutes of audio in just one second, a performance that’s over 50x faster than many existing open ASR models. On Hugging Face’s Open ASR Leaderboard, Parakeet V2 achieves a 6.05% word error rate (WER)—the best-in-class among open models.

This performance represents a significant leap forward for enterprise-grade speech applications, including real-time transcription, voice-based analytics, call center intelligence, and audio content indexing.

Technical Overview

Parakeet TDT 0.6B builds on a transformer-based architecture fine-tuned with high-quality transcription data and optimized for inference on NVIDIA hardware. Here are the key highlights:

The model’s high-speed inference is powered by NVIDIA’s TensorRT and FP8 quantization, enabling it to reach a real-time factor of RTF = 3386, meaning it processes audio 3386 times faster than real-time.

Benchmark Leadership

On the Hugging Face Open ASR Leaderboard—a standardized benchmark for evaluating speech models across public datasets—Parakeet TDT 0.6B leads with the lowest WER recorded among open-source models. This positions it well above comparable models like Whisper from OpenAI and other community-driven efforts.

Data based on May 5 2025

This performance makes Parakeet V2 not only a leader in quality but also in deployment readiness for latency-sensitive applications.

Beyond Conventional Transcription

Parakeet is not just about speed and word error rate. NVIDIA has embedded unique capabilities into the model:

These features elevate the quality of transcripts and reduce the burden on post-processing or human editing, especially in enterprise-grade deployments.

Strategic Implications

The release of Parakeet TDT 0.6B represents another step in NVIDIA’s strategic investment in AI infrastructure and open ecosystem leadership. With strong momentum in foundational models (e.g., Nemotron for language and BioNeMo for protein design), NVIDIA is positioning itself as a full-stack AI company—from GPUs to state-of-the-art models.

For the AI developer community, this open release could become the new foundation for building speech interfaces in everything from smart devices and virtual assistants to multimodal AI agents.

Getting Started

Parakeet TDT 0.6B is available now on Hugging Face, complete with model weights, tokenizer, and inference scripts. It runs optimally on NVIDIA GPUs with TensorRT, but support is also available for CPU environments with reduced throughput.

Whether you’re building transcription services, annotating massive audio datasets, or integrating voice into your product, Parakeet TDT 0.6B offers a compelling open-source alternative to commercial APIs.


Check out the Model on Hugging Face. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

The post NVIDIA Open Sources Parakeet TDT 0.6B: Achieving a New Standard for Automatic Speech Recognition ASR and Transcribes an Hour of Audio in One Second appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NVIDIA Parakeet TDT 0.6B 自动语音识别 开源 Hugging Face
相关文章