3Play Media Releases Annual Study, Finds ASR Technology Showing Signs of Plateau

CDSA 05月28日 19:03

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

3Play Media的最新报告显示，尽管自动语音识别（ASR）技术日趋成熟，但为满足可访问性标准，人工校对仍然至关重要。研究评估了205小时的多样化音频内容，涵盖多个行业和用例，评估了八个ASR引擎以及Gemini的转录性能。报告发现，Whisper X与原始Whisper模型表现显著不同，未出现幻觉行为。AssemblyAI的Universal-2模型和Whisper X略优于Speechmatics。报告还指出，ASR的准确性因行业而异，LLM尚未能取代专用ASR引擎，体育内容仍是ASR技术的最大挑战。

📊 3Play Media的报告指出，尽管ASR技术取得了显著进步，但在英语预录内容的准确性提升方面，正面临瓶颈，领先引擎与其他引擎之间的差距扩大，但所有引擎的错误率仍未达到可访问性要求。

🔬 研究评估了205小时的多样化音频内容，涵盖多个行业和用例，评估了八个ASR引擎以及多模态大型语言模型（LLM）Gemini的转录性能。数据集包含超过170万个单词，提供了对真实世界ASR性能的深入了解。

🥇 Whisper X与原始Whisper模型表现显著不同，未观察到幻觉行为。AssemblyAI的Universal-2模型和Whisper X略优于Speechmatics，但三者均显著领先于其他受测引擎。

🎯 ASR的准确性因行业而异，体育内容是ASR技术的最大挑战，错误率是表现最佳行业的3倍，原因是复杂的噪音环境、无脚本语音、运动员和教练姓名以及具有独特措辞惯例的数字信息。

💡 报告表明，未来的ASR创新可能更多地关注实时应用和非英语语言能力，而不是英语预录内容准确性的增量改进。

While Automatic Speech Recognition technologies are maturing and becoming more sophisticated, human review remains essential for meeting accessibility standards, according to the latest State of ASR report by 3Play Media.

“Our research continues to show that while ASR technology has made remarkable strides, we’re witnessing an increasing plateau in accuracy improvements for English pre-recorded content,” Josh Miller, co-CEO and co-Founder, 3Play Media, said. “The gulf between the leading engines and the rest of the field has widened. However, the error rates across all engines still fall short of meeting accessibility requirements, reaffirming that human-in-the-loop workflows remain critical for captioning and transcription use cases.”

The study evaluated speech-to-text technology as it applies to captioning and transcription across 205 hours of diverse audio content, representing a 30% increase in testing volume from the previous year.

The expanded dataset of over 1.7 million words spans multiple industries and use cases, providing unparalleled insight into real-world ASR performance. The research evaluated eight ASR engines along with Gemini, a multimodal large language model (LLM) prompted to perform transcription.

A key finding from this year’s report is that Whisper X performs significantly differently from the original Whisper models, showing no signs of the hallucination behavior that was observed with Whisper Large V2 and V3, which demonstrated significantly higher rates of hallucinations compared to other engines. Meanwhile, AssemblyAI’s Universal-2 model and Whisper X slightly outperformed Speechmatics based on error rates, though all three stood substantially ahead of other engines tested.

As observed in previous years, ASR accuracy varies significantly across different industries, reinforcing the need for specialized approaches depending on content type and use case. The study also found that LLMs are not yet viable replacements for dedicated ASR engines in transcription tasks.

The greatest challenge for ASR technology remains sports content, with error rates 3x higher than the best performing industries due to complicated noise environments, unscripted speech, player and coach names, and numerical information with unique phrasing conventions.

Given the plateau in improvements, the report indicates that future ASR innovations are likely to focus less on incremental improvements to English pre-recorded content accuracy and more on real-time applications and non-English language capabilities.

To obtain a free copy of The 2025 State of ASR report, please visit: https://go.3playmedia.com/rs-2025-asr

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签