MarkTechPost@AI 2024年10月14日
F5-TTS: A Fully Non-Autoregressive Text-to-Speech System based on Flow Matching with Diffusion Transformer (DiT)
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

F5-TTS是一种非自回归的文本转语音系统,利用流匹配和扩散变压器,解决了传统模型的诸多问题,在合成质量和推理速度上表现出色,具有重要意义。

🎤F5-TTS是上海交通大学等机构推出的非自回归文本转语音系统,利用流匹配与扩散变压器,无需复杂元素如时长建模等。

💻该系统采用ConvNeXt架构优化文本表示,并有新颖的Sway Sampling策略,在推理时增强性能且无需重新训练。

🎉F5-TTS在合成质量和推理速度上优于其他先进系统,如在LibriSpeech-PC数据集上取得低词错率和高实时因子,提升了自然度和可理解性。

🚀F5-TTS成功简化了文本转语音合成流程,消除了一些需求,采用轻量架构并提供开源框架,以推动技术发展,同时强调了潜在滥用的伦理考虑。

The current challenges in text-to-speech (TTS) systems revolve around the inherent limitations of autoregressive models and their complexity in aligning text and speech accurately. Many conventional TTS models require complex elements such as duration modeling, phoneme alignment, and dedicated text encoders, which add significant overhead and complexity to the synthesis process. Furthermore, previous models like E2 TTS have faced issues with slow convergence, robustness, and maintaining accurate alignment between the input text and generated speech, making them challenging to optimize and deploy efficiently in real-world scenarios.

Researchers from Shanghai Jiao Tong University, the University of Cambridge, and Geely Automobile Research Institute introduced F5-TTS, a non-autoregressive text-to-speech (TTS) system that utilizes flow matching with a Diffusion Transformer (DiT). Unlike many conventional TTS models, F5-TTS does not require complex elements like duration modeling, phoneme alignment, or a dedicated text encoder. Instead, it introduces a simplified approach where text inputs are padded to match the length of the speech input, leveraging flow matching for effective synthesis. F5-TTS is designed to address the shortcomings of its predecessor, E2 TTS, which faced slow convergence and alignment issues between speech and text. Notable improvements include a ConvNeXt architecture to refine text representation and a novel Sway Sampling strategy during inference, significantly enhancing performance without retraining.

Structurally, F5-TTS leverages ConvNeXt and DiT to overcome alignment challenges between the text and generated speech. The input text is first processed by ConvNeXt blocks to prepare it for in-context learning with speech, allowing smoother alignment. The character sequence, padded with filler tokens, is fed into the model alongside a noisy version of the input speech. The Diffusion Transformer (DiT) backbone is used for training, employing flow matching to map a simple initial distribution to the data distribution effectively. Additionally, F5-TTS includes an innovative inference-time Sway Sampling technique that helps control flow steps, prioritizing early-stage inference to improve the alignment of generated speech with the input text.

The results presented in the paper demonstrate that F5-TTS outperforms other state-of-the-art TTS systems in terms of synthesis quality and inference speed. The model achieved a word error rate (WER) of 2.42 on the LibriSpeech-PC dataset using 32 function evaluations (NFE) and demonstrated a real-time factor (RTF) of 0.15 for inference. This performance is a significant improvement over diffusion-based models like E2 TTS, which required a longer convergence time and had difficulties with maintaining robustness across different input scenarios. The Sway Sampling strategy notably enhances naturalness and intelligibility, allowing the model to achieve smooth and expressive zero-shot generation. Evaluation metrics such as WER and speaker similarity scores confirm the competitive quality of the generated speech.

In conclusion, F5-TTS successfully introduces a simpler, highly efficient pipeline for TTS synthesis by eliminating the need for duration predictors, phoneme alignments, and explicit text encoders. The use of ConvNeXt for text processing and Sway Sampling for optimized flow control collectively improves alignment robustness, training efficiency, and speech quality. By maintaining a lightweight architecture and providing an open-source framework, F5-TTS aims to advance community-driven development in text-to-speech technologies. The researchers also highlight the ethical considerations for the potential misuse of such models, emphasizing the need for watermarking and detection systems to prevent fraudulent use.


Check out the Paper, Model on Hugging Face, and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)

The post F5-TTS: A Fully Non-Autoregressive Text-to-Speech System based on Flow Matching with Diffusion Transformer (DiT) appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

F5-TTS 文本转语音 流匹配 扩散变压器
相关文章