MarkTechPost@AI 2024年07月17日
MELLE: A Novel Continuous-Valued Tokens-based Language Modeling Approach for Text-to-Speech Synthesis (TTS)
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

MELLE 是一种基于连续值音素的语言模型,用于零样本语音合成。它直接从文本和语音提示预测梅尔谱图,无需离散矢量量化和两步过程,从而克服了 VALL-E 等神经编码语言模型的局限性。MELLE 通过引入潜在采样和频谱通量损失,能够生成更多样化和鲁棒的预测。该模型的效率可以通过调整降维因子来进一步提高,从而实现更快的解码。值得注意的是,MELLE 在主观评估中取得了与人类相当的结果,标志着语音合成领域取得了重大进展。

🤔 MELLE 是一种基于连续值音素的语言模型,用于零样本语音合成。它直接从文本和语音提示预测梅尔谱图,无需离散矢量量化和两步过程,从而克服了 VALL-E 等神经编码语言模型的局限性。MELLE 通过引入潜在采样和频谱通量损失,能够生成更多样化和鲁棒的预测。该模型的效率可以通过调整降维因子来进一步提高,从而实现更快的解码。值得注意的是,MELLE 在主观评估中取得了与人类相当的结果,标志着语音合成领域取得了重大进展。

💡 MELLE 的架构包含一个嵌入层、一个自回归 Transformer 解码器和一个独特的潜在采样模块,该模块增强了输出的多样性。该模型还包括一个停止预测层和一个卷积后网络,用于频谱细化。与神经编码模型不同,MELLE 不需要单独的非自回归模型,从而提高了效率。它可以在每一步生成多个梅尔谱图帧,进一步提高性能。该架构以声码器结束,将梅尔谱图转换为波形,提供了一种简化的单步方法,在质量和效率方面都可能超过以前的方法。

🚀 MELLE 在零样本语音合成任务中表现出优于 VALL-E 及其变体的性能。它在鲁棒性和说话人相似性方面明显优于原始 VALL-E,在延续任务中 WER-H 相对减少了 47.9%,在跨句子任务中减少了 64.4%。虽然 VALL-E 2 表现出可比的结果,但 MELLE 在延续任务中的鲁棒性和说话人相似性方面表现更好,突出了其优越的上下文学习能力。

💪 MELLE 的性能即使在增加降维因子时也保持一致的高水平,从而实现更快的训练和推理。该模型在鲁棒性和说话人相似性方面都优于大多数最新作品,即使在较大的降维因子下也是如此。在较小的语料库上训练的 MELLE-limited 仍然优于 VALL-E 及其变体,除了 VALL-E 2。使用具有较大降维因子的多重采样可以提高性能并减少推理时间,正如五次采样结果所示,这些结果表明在不同的降维因子设置下始终保持高鲁棒性。

🌟 该研究介绍了 MELLE,它代表了零样本文本到语音合成领域的一项重大进步,引入了一种基于连续声学表示的语言建模方法。通过直接从文本内容和语音提示预测梅尔谱图,它消除了神经编码语言模型(如 VALL-E)中离散矢量量化和两步过程的必要性。潜在采样和频谱通量损失的加入使 MELLE 能够产生更多样化和鲁棒的预测。该模型的效率可以通过调整降维因子来进一步提高,从而实现更快的解码。值得注意的是,MELLE 在主观评估中取得了与人类相当的结果,标志着语音合成领域取得了重大进展。

In the realm of Large language models (LLMs), there has been a significant transformation in text generation, prompting researchers to explore their potential in audio synthesis. The challenge lies in adapting these models for text-to-speech (TTS) tasks while maintaining high-quality output. Current methodologies, such as neural codec language models like VALL-E, face several limitations. These include lower fidelity compared to mel-spectrograms, robustness issues stemming from random sampling strategies, and the need for complex two-pass decoding processes. These challenges hinder the efficiency and quality of audio synthesis, particularly in zero-shot TTS tasks that require multi-lingual, multi-speaker, and multi-domain capabilities.

Researchers have attempted to tackle the challenges in text-to-speech (TTS) synthesis. Traditional methods include concatenative systems, which reassemble audio segments, and parametric systems, which use acoustic parameters to synthesize speech. End-to-end neural TTS systems, such as Tacotron, TransformerTTS, and FastSpeech, simplified the process by generating mel-spectrograms directly from text.

Recent advancements focus on zero-shot TTS capabilities. Models like VALL-E treat TTS as a conditional language task, using neural codec codes as intermediate representations. VALL-E X extended this approach to multi-lingual scenarios. Mega-TTS proposed disentangling speech attributes for more efficient modeling. Other models like ELLA-V, RALL-E, and VALL-E R aimed to improve robustness and stability.

Some researchers explored non-autoregressive approaches for faster inference, such as SoundStorm’s parallel decoding scheme and StyleTTS 2’s diffusion model. However, these methods often struggle to maintain audio quality or efficiently handle multi-speaker, multi-lingual scenarios.

Researchers from The Chinese University of Hong Kong and Microsoft Corporation present MELLE, a unique approach to text-to-speech synthesis, utilizing continuous-valued tokens based on mel-spectrograms. This method aims to overcome the limitations of discrete codec codes by directly generating continuous mel-spectrogram frames from text input. The approach addresses two key challenges: setting an appropriate training objective for continuous representations and enabling sampling mechanisms in continuous space.

To tackle these challenges, MELLE employs regression loss with a spectrogram flux loss function instead of cross-entropy loss. This new loss function helps model the probability distribution of continuous-valued tokens more effectively. Also, MELLE incorporates variational inference to facilitate sampling mechanisms, enhancing output diversity and model robustness.

The model operates as a single-pass zero-shot TTS system, autoregressively predicting mel-spectrogram frames based on previous mel-spectrogram and text tokens. This approach aims to eliminate the robustness issues associated with sampling discrete codec codes, potentially offering improved fidelity and efficiency in speech synthesis.

MELLE’s architecture integrates several innovative components for efficient text-to-speech synthesis. It employs an embedding layer, an autoregressive Transformer decoder, and a unique latent sampling module that enhances output diversity. The model includes a stop prediction layer and a convolutional post-net for spectrogram refinement. Unlike neural codec models, MELLE doesn’t require a separate non-autoregressive model, improving efficiency. It can generate multiple mel-spectrogram frames per step, further enhancing performance. The architecture concludes with a vocoder to convert the mel-spectrogram into a waveform, offering a streamlined, single-pass approach that potentially surpasses previous methods in both quality and efficiency.

MELLE demonstrates superior performance in zero-shot speech synthesis tasks compared to VALL-E and its variants. It significantly outperforms vanilla VALL-E in robustness and speaker similarity, achieving a 47.9% relative reduction in WER-H on the continuation task and a 64.4% reduction on the cross-sentence task. While VALL-E 2 shows comparable results, MELLE exhibits better robustness and speaker similarity in the continuation task, highlighting its superior in-context learning ability.

MELLE’s performance remains consistently high even with increased reduction factors, allowing for faster training and inference. The model outperforms most recent works in both robustness and speaker similarity, even with larger reduction factors. MELLE-limited, trained on a smaller corpus, still surpasses VALL-E and its variants, except VALL-E 2. Using multiple sampling with a larger reduction factor can enhance performance while reducing inference time, as demonstrated by the five-time sampling results, which show consistent high robustness across different reduction factor settings.

This study introduces MELLE representing a significant advancement in zero-shot text-to-speech synthesis, introducing a continuous acoustic representation-based language modeling approach. By directly predicting mel-spectrograms from text content and speech prompts, it eliminates the need for discrete vector quantization and two-pass procedures typical of neural codec language models like VALL-E. The incorporation of latent sampling and spectrogram flux loss enables MELLE to produce more diverse and robust predictions. The model’s efficiency can be further enhanced by adjusting the reduction factor for faster decoding. Notably, MELLE achieves results comparable to human performance in subjective evaluations, marking a substantial step forward in the field of speech synthesis.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post MELLE: A Novel Continuous-Valued Tokens-based Language Modeling Approach for Text-to-Speech Synthesis (TTS) appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MELLE 语音合成 连续值音素 零样本 深度学习
相关文章