MarkTechPost@AI 2024年09月12日
Apple Researchers Propose a Novel AI Algorithm to Optimize a Byte-Level Representation for Automatic Speech Recognition ASR and Compare it with UTF-8 Representation
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

苹果研究人员提出一种新的AI算法,用于优化自动语音识别的字节级表示,并与UTF-8表示进行比较

🎯苹果研究人员为解决多语言自动语音识别中输出层规模过大的问题,提出优化字节级表示的方法。该方法旨在维持模型效率和性能,同时适应多种语言及其字符集。

💡此前解决多语言ASR挑战的尝试存在一些问题,如字节级表示会导致序列变长、增加错误率等。苹果研究人员的方法采用向量量化自动编码器,旨在优化E2E ASR任务的字节级表示。

🔧该方法的框架具有数据驱动性,包含四个关键组件,通过损失函数进行优化,并采用具有两个或三个代码簿的Residual VQ-VAE。同时,系统还通过标签解码器纳入错误纠正机制。

📊研究人员在双语英语和汉语普通话听写任务中对该方法进行评估,结果显示其在不同子词大小下均优于UTF - 8子词输出,在多种语言的ASR系统中表现出有效性和灵活性。

End-to-end (E2E) neural networks have emerged as flexible and accurate models for multilingual automatic speech recognition (ASR). However, as the number of supported languages increases, particularly those with large character sets like Chinese, Japanese, and Korean (CJK), the output layer size grows substantially. This expansion negatively impacts compute resources, memory usage, and asset size. The challenge becomes more pronounced in multilingual systems, where the output often consists of unions of characters or subwords from various languages. Researchers are thus grappling with the need to maintain model efficiency and performance while accommodating a diverse range of languages and their associated character sets in E2E ASR systems.

Previous attempts to address these challenges in multilingual ASR have focused on byte-level representations, particularly using UTF-8 codewords as base tokens. This approach allows for a fixed output vocabulary size of 256, providing compactness and universality across languages. However, byte-level representations often result in longer sequences, especially for CJK languages, potentially increasing error rates as multiple predictions are required for single characters. Researchers proposed byte-level subwords using byte pair encoding (BPE) on UTF-8 codeword sequences to mitigate this. While this reduced the number of decoding steps, it didn’t guarantee valid UTF-8 outputs. A dynamic programming algorithm was later introduced to recover valid characters from potentially invalid byte sequences, though this method optimized for character validity rather than ASR quality.

The state-of-the-art method reviewed by Apple researchers proposes a robust representation learning approach using a vector quantized auto-encoder. This method aims to optimize byte-level representation specifically for E2E ASR tasks, addressing the limitations of previous approaches. The framework is designed to be data-driven, incorporating information from both text and audio to enhance accuracy. It offers flexibility to include additional side information, such as lexicons or phonemes, making it adaptable to various ASR scenarios. Importantly, the method includes an error correction mechanism to handle invalid sequences, with recovery optimized for accuracy rather than other metrics. This approach aligns with the researchers’ criteria for an ideal byte-level representation: task-specific optimization, comprehensive information utilization, and effective error correction.

The proposed method formulates the representation problem as an optimization task with latent variables, using a vector quantized auto-encoder (VQ-AE) architecture. This auto-encoder consists of four key components: a label encoder, an acoustic encoder, a label decoder, and a vector quantizer. The system uses vector quantization as its bottleneck, with the indices of quantized embeddings serving as latent variables.

The auto-encoder is optimized using a loss function comprising four terms: cross-entropy losses for label and acoustic encoders, a CTC loss for the acoustic encoder, and a quantization loss. The method employs a Residual VQ-VAE (RVQ-VAE) with two or three codebooks, each containing 256 embeddings, allowing each label token to be represented by 2-3 bytes.

To handle potential errors in byte sequences, the system incorporates an error correction mechanism through the label decoder. This decoder estimates the most likely label sequence, optimizing for accuracy even when faced with invalid byte sequences. The proposed VQ-based representation offers advantages over UTF-8, including fixed-length coding, task-specific optimization, and improved error recovery.

The researchers evaluated their proposed VQ-based representation approach on bilingual English and Mandarin dictation tasks, comparing it with character-based and UTF-8 subword outputs. Using a CTC-AED model with approximately 120M parameters, they tested various output representations on datasets comprising 10k hours of English and 14k hours of Mandarin training data.

Results showed that the VQ-based representation consistently outperformed UTF-8 subword outputs across different subword sizes. With 8000 subwords, the VQ-based approach achieved a 5.8% relative reduction in Word Error Rate (WER) for English and a 3.7% relative reduction in Character Error Rate (CER) for Mandarin compared to UTF-8. When compared to character-based output, both VQ and UTF-8 representations performed better on English, while maintaining similar accuracy for Mandarin. Notably, the VQ-based method with 8000 subwords demonstrated a 14.8% relative error rate reduction for English and a 2.3% reduction for Mandarin compared to character-based output, highlighting its effectiveness and flexibility in multilingual ASR systems.

This study presents a robust algorithm for optimizing byte-level representation in ASR, offering an alternative to UTF-8 representation. This approach can be optimized using audio and text data, with an error correction mechanism designed to enhance accuracy. Testing on English and Mandarin dictation datasets demonstrated a 5% relative reduction in Token Error Rate (TER) compared to UTF-8-based methods. While the current study focused on bilingual ASR, the researchers acknowledge challenges in developing a universal representation for all languages, such as the index collapse issue. 


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

The post Apple Researchers Propose a Novel AI Algorithm to Optimize a Byte-Level Representation for Automatic Speech Recognition ASR and Compare it with UTF-8 Representation appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

自动语音识别 字节级表示 苹果研究 错误纠正
相关文章