MarkTechPost@AI 04月04日 09:26
Researchers from Dataocean AI and Tsinghua University Introduces Dolphin: A Multilingual Automatic Speech Recognition ASR Model Optimized for Eastern Languages and Dialects
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Dataocean AI 和清华大学的研究人员推出了 Dolphin,一个基于扩展 Whisper 架构构建的综合多语言自动语音识别 (ASR) 模型。Dolphin 专注于解决现有 ASR 模型在识别东方语言方面的局限性,通过整合专有和公开数据集,有效支持 40 种东方语言和 22 种汉语方言。该模型采用混合 ASR 方法,结合 CTC 和基于注意力的机制,并利用双层语言标记系统,显著提升了在多语言环境下的识别准确性。Dolphin 的发布,为多语言 ASR 技术带来了新的突破,尤其是在提升东方语言和方言的识别能力方面。

🗣️ Dolphin 基于扩展的 Whisper 架构构建,专注于提升对东方语言和方言的识别能力,解决了现有 ASR 模型在处理这些语言时存在的不足。

🌍 Dolphin 整合了专有数据集和公开数据集,支持来自东亚、南亚、东南亚和中东地区的 40 种东方语言,以及 22 种汉语方言。

⚙️ Dolphin 采用混合 ASR 方法,结合了 CTC 和基于注意力的机制,并使用双层语言标记系统,以提高识别准确性和分辨率,特别是在方言丰富的语言(如汉语)中。

📈 实验评估表明,Dolphin 在多语言语音识别准确性方面相对于 Whisper 模型有显著提升,例如,Dolphin small 模型将 WER 降低了约 24.5%。

🔓 研究团队以 Apache 2.0 许可证公开了 Dolphin base 和 small 模型以及相关的推理代码,并使用了包含 2120 万小时录音的广泛数据集进行训练。

Automatic speech recognition (ASR) technologies have advanced significantly, yet notable disparities remain in their ability to accurately recognize diverse languages. Prominent ASR systems, such as OpenAI’s Whisper, exhibit pronounced performance gaps when processing Eastern languages compared to Western counterparts. This discrepancy presents tangible challenges in multilingual regions, particularly those characterized by numerous dialects and linguistic variations, underscoring the necessity for sophisticated multilingual ASR systems tailored specifically to Eastern languages.

Researchers from Dataocean AI and Tsinghua University have introduced Dolphin, a comprehensive multilingual automatic speech recognition model built upon an extended Whisper architecture, optimized to accommodate a broader spectrum of Eastern languages and dialects. Dolphin effectively addresses key limitations identified in current multilingual ASR models by integrating both proprietary datasets and publicly accessible datasets. The model proficiently supports 40 Eastern languages from East Asia, South Asia, Southeast Asia, and the Middle East, as well as 22 distinct dialects of Chinese.

Dolphin employs a hybrid ASR approach combining Connectionist Temporal Classification (CTC) with attention-based mechanisms. Its architecture incorporates an E-Branchformer encoder and a Transformer decoder, substantially enhancing the model’s capability to interpret complex linguistic patterns across diverse languages. Dolphin also utilizes a dual-level language tokenization system, distinguishing general language codes from region-specific dialect tokens. This mechanism improves recognition accuracy and resolution, particularly for dialect-intensive languages such as Chinese. Additionally, Dolphin incorporates a 4× subsampling layer to efficiently reduce input sequence lengths, enhancing computational speed and training effectiveness without compromising recognition accuracy.

Experimental evaluations demonstrate Dolphin’s marked improvements in multilingual speech recognition accuracy relative to Whisper models. For instance, the Dolphin small model reduced the Word Error Rate (WER) by approximately 24.5% compared to the base model, with further incremental improvements observed in medium and large variants. Specifically, the Dolphin base model attained an average WER of 31.8%, notably outperforming Whisper’s large-v3 model, which recorded an average WER of 52.3% across the same evaluation benchmarks. Assessments conducted on dialect-focused datasets, including KeSpeech, confirmed Dolphin’s capability to consistently handle intricate linguistic variations, with performance enhancements correlating positively with increased model size.

The research team released the Dolphin base and small models publicly under the Apache 2.0 license, along with associated inference code. Dolphin’s training utilized an extensive dataset encompassing 21.2 million hours of audio recordings, incorporating 7.4 million hours derived from open datasets such as Common Voice, ReazonSpeech, and GigaSpeech2, thereby ensuring robustness and replicability.

In summary, Dolphin constitutes a significant advancement in multilingual ASR technology, systematically addressing prevailing limitations in Eastern language and dialect recognition through methodological data integration, refined architectural frameworks, and commitment to open-source dissemination. This work sets an influential benchmark for future developments in multilingual ASR research, advancing linguistic inclusivity and system generalization.


Check out the Paper, Dolphin-small-model and Dolphin-base-model. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post Researchers from Dataocean AI and Tsinghua University Introduces Dolphin: A Multilingual Automatic Speech Recognition ASR Model Optimized for Eastern Languages and Dialects appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Dolphin 语音识别 多语言 人工智能
相关文章