MarkTechPost@AI 2024年07月05日
CMU Researchers Propose XEUS: A Cross-lingual Encoder for Universal Speech trained in 4000+ Languages
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

卡内基梅隆大学、上海交通大学和芝加哥丰田科技研究所的研究人员开发了 XEUS,这是一种用于通用语音的跨语言编码器。XEUS 在来自 4,057 种语言的超过 100 万小时的数据上进行训练,大大扩展了 SSL 模型的语言覆盖范围。该模型包含一个新颖的去混响目标,以提高鲁棒性。在各种基准测试(包括 ML-SUPERB)中,它优于最先进的模型。为了支持进一步的研究,研究人员将发布 XEUS、其代码、训练配置、检查点和训练日志。

🚀 **XEUS 的语言覆盖范围**:XEUS 是一个突破性的跨语言语音编码器,它在 4,057 种语言的 100 万小时数据上进行训练,远远超过了以往的模型,为更广泛的语言提供了支持。这种扩展的覆盖范围对于那些资源有限的语言来说尤其重要,因为它们通常无法获得足够的训练数据。

🎤 **XEUS 的鲁棒性**:XEUS 在训练过程中引入了新颖的去混响目标,以提高对噪声和各种说话环境的鲁棒性。这使得 XEUS 即使在存在混响或噪声的情况下,也能准确地识别和理解语音。

🔓 **XEUS 的开放性**:与许多只使用封闭数据集的模型不同,XEUS 完全开放,其数据、训练代码和详细文档都可供公众获取。这种开放性促进了对大规模多语言 SSL 的进一步研究。

🏆 **XEUS 的性能**:XEUS 在各种下游任务中表现出色,在多语言语音任务中,它在 ML-SUPERB 和 FLEURS 等基准测试中优于 XLS-R、MMS 和 w2v-BERT 等最先进的模型,尤其是在资源有限的语言环境中。此外,XEUS 在通用任务方面也表现出色,在英语单语言任务(如情感识别和说话人分割)中,其表现与领先模型相当或超过领先模型。在声学表示方面,XEUS 在生成高质量语音方面优于 WavLM 和 w2v-BERT 等模型,这从 MOS 和 WER 等指标中可以明显看出。

⚠️ **XEUS 的伦理考虑**:尽管 XEUS 的开放性促进了语音模型开发的民主化,但重要的是要考虑其伦理影响。尤其是在处理来自土著社区的语音数据时,需要谨慎对待,以防止滥用,例如生成音频深度伪造。

Self-supervised learning (SSL) has expanded the reach of speech technologies to many languages by minimizing the need for labeled data. However, current models only support 100-150 of the world’s 7,000+ languages. This limitation is largely due to the scarcity of transcribed speech, as only about half of these languages have formal writing systems, and even fewer have the resources to generate the extensive annotated data needed for training. While SSL models can operate with unlabeled data, they typically cover a narrow range of languages. Projects like MMS have extended coverage to over 1,000 languages but need help with data noise and a lack of diverse recording conditions.

Researchers from Carnegie Mellon University, Shanghai Jiaotong University, and Toyota Technological Institute in Chicago have developed XEUS, a Cross-lingual Encoder for Universal Speech. XEUS is trained on over 1 million hours of data from 4,057 languages, significantly increasing the language coverage of SSL models. This includes a new corpus of 7,413 hours from 4,057 languages, which will be publicly released. XEUS incorporates a novel dereverberation objective for enhanced robustness. It outperforms state-of-the-art models in various benchmarks, including ML-SUPERB. To support further research, the researchers will release XEUS, its code, training configurations, checkpoints, and training logs.

SSL has advanced speech processing by enabling neural networks to learn from large amounts of unlabeled data, which can then be fine-tuned for various tasks. Multilingual SSL models can leverage cross-lingual transfer learning but only scale to cover a few languages. XEUS, however, scales to 4,057 languages, surpassing models like Meta’s MMS. XEUS includes a novel dereverberation objective during training to handle noisy and diverse speech. Unlike state-of-the-art models that often use closed datasets and lack transparency, XEUS is fully open, with publicly available data, training code, and extensive documentation, facilitating further research into large-scale multilingual SSL.

XEUS is pre-trained using a vast dataset of 1.081 million hours across 4,057 languages, compiled from 37 public speech datasets and additional sources like Global Recordings Network, WikiTongues, and Jesus Dramas. Unique data types enhance its robustness, such as accented speech and code-switching. XEUS incorporates new objectives, including dereverberation and noise reduction, during training. The model architecture is based on HuBERT but includes enhancements like E-Branchformer layers and a simplified loss function. The training on 64 NVIDIA A100 GPUs uses advanced augmentation techniques and spans significantly more data than previous models.

The XEUS model is evaluated across various downstream tasks to assess its multilingual and acoustic representation capabilities. It excels in multilingual speech tasks, outperforming state-of-the-art models like XLS-R, MMS, and w2v-BERT on benchmarks such as ML-SUPERB and FLEURS, especially in low-resource language settings. Additionally, XEUS demonstrates strong performance in task universality by matching or exceeding leading models in English-only tasks like emotion recognition and speaker diarization. In acoustic representation, XEUS surpasses models like WavLM and w2v-BERT in generating high-quality speech, which is evident through metrics like MOS and WER.

XEUS is a robust SSL speech encoder trained on over 1 million hours of data spanning 4,057 languages, demonstrating superior performance across a wide range of multilingual and low-resource tasks. XEUS’s dereverberation task enhances its robustness, and despite the limited data for many languages, it still provides valuable results. XEUS advances multilingual research by offering open access to its data and model. However, ethical considerations are crucial, especially in handling speech data from indigenous communities and preventing misuse, such as generating audio deepfakes. XEUS’s integration with accessible platforms aims to democratize speech model development.


Check out the Paper, Dataset, and Model. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post CMU Researchers Propose XEUS: A Cross-lingual Encoder for Universal Speech trained in 4000+ Languages appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

XEUS 跨语言 语音识别 深度学习 多语言
相关文章