Adaptive Duration Model for Text Speech Alignment

cs.AI updates on arXiv.org 07月31日 12:48

Adaptive Duration Model for Text Speech Alignment

本文提出一种新型语音到文本对齐模型，有效解决传统模型在长文本和跨领域文本中的不足，提升神经文本到语音（TTS）模型的准确度和鲁棒性。

arXiv:2507.22612v1 Announce Type: cross Abstract: Speech-to-text alignment is a critical component of neural text to-speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line. However, these alignments tend to be brittle and often fail to generalize to long utterances and out-of-domain text, leading to missing or repeating words. Most non-autoregressive end to-end TTS models rely on durations extracted from external sources, using additional duration models for alignment. In this paper, we propose a novel duration prediction framework that can give compromising phoneme-level duration distribution with given text. In our experiments, the proposed duration model has more precise prediction and condition adaptation ability compared to previous baseline models. Numerically, it has roughly a 11.3 percents immprovement on alignment accuracy, and makes the performance of zero-shot TTS models more robust to the mismatch between prompt audio and input audio.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

TTS模型语音到文本对齐模型优化准确度提升

相关文章

Data Augmentation and Optimized Architectures for Computer Vision with Fatih Porikli - #635

Rethinking Model Size: Train Large, Then Compress with Joseph Gonzalez - #378

GitHub加速器資助11個開源AI專案，強調普及化、資安與問責應用

Show HN: 用人工智能在 5 分钟内微调人工智能模型

Show HN: MARS5，开放源代码的疯狂拟声 TTS 模型

Apple Intelligence边缘推理和模型隐私

拆解端到端迷局：算力奇迹、多元架构与落地挑战

Decoding How NVIDIA AI Workbench Powers App Development

Achieve up to ~2x higher throughput while reducing costs by up to ~50% for generative AI inference on Amazon SageMaker with the new inference optimization toolkit – Part 2

Meta開發10億以下參數量的小型LLM模型MobileLLM