在语音合成技术不断演进的背景下,早期版本的IndexTTS虽然在多场景应用中展现出良好的表现,但在情感表达的细腻度与时长控制的精准性方面仍存在提升空间。为了解决这些问题,并进一步推动零样本语音合成在实际场景中的落地能力,B站语音团队对模型架构与训练策略进行了深度优化,推出了全新一代语音合成模型——IndexTTS2 。相比于自回归(Autoregressive, AR)系统,非自回归(Non-Autoregressive, NAR)系统的一大优势在于生成时长可控,便于对语速、节奏进行精确编辑。而在 IndexTTS2 中,我们创新性地提出了一种通用于 AR 系统的“时间编码”机制,首次解决了传统 AR 模型难以精确控制语音时长的问题。这一设计让我们在保留 AR 架构在韵律自然性、风格迁移能力、多模态扩展性等方面优势的同时,也具备了合成定长语音的能力。该技术已率先应用于 B 站“原声视频翻译”功能,目前正在内测,部分用户已可体验。在实际使用中,往往难以同时获得音色匹配且情感准确的参考音频。为此,IndexTTS2 引入了音色与情感解耦建模机制,处理支持单音频参考以外,额外支持分别指定音色参考与情感参考,实现更加灵活、细腻的语音合成控制。同时,模型还具备基于文本描述的情感控制能力,可通过自然语言描述、使用场景描述、上下文线索等进行精准调节合成语音的情绪色彩。IndexTTS2 在灵活性与可控性之间实现了更优平衡,不仅支持高质量的零样本语音合成,还显著提升了语音在情感表达维度的真实感与表现力。其合成语音情绪自然饱满,贴近真人,广泛适用于 AI 配音、有声读物、动态漫、视频翻译、语音对话、播客创作等场景,是推动零样本 TTS 走向实用化的重要里程碑。为进一步推动语音合成技术的开放创新与行业应用落地,我们已将 IndexTTS2 的相关研究成果整理为论文《IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech》,现已发布于 arXiv,欢迎查阅:👉 https://arxiv.org/abs/2506.21619后续,我们也将全面开源 IndexTTS2 的推理代码与模型权重,诚邀关注项目进展,欢迎前往项目仓库点击 ⭐ 以获取最新开源信息:👉 https://github.com/index-tts/index-tts未来,我们还将持续优化模型性能,开放更多资源,与开发者社区携手共建开放繁荣的技术生态。 概述
[1] S. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “Bigvgan: A universal neural vocoder with large-scale training,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023, 2023.[2] W. Deng, S. Zhou, J. Shu, J. Wang, and L. Wang, “Indextts: An industrial-level controllable and efficient zero-shot text-to-speech system,” arXiv preprint arXiv:2502.05512, 2025.[3] M. V. Koroteev, “Bert: a review of applications in natural language processing and understanding,” arXiv preprint arXiv:2103.11943, 2021.[4] H. He, Z. Shang, C. Wang, X. Li, Y. Gu, H. Hua, L. Liu, C. Yang, J. Li, P. Shi et al., “Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation,” in 2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 885–890.[5] K. Zhou, B. Sisman, R. Liu, and H. Li, “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 920–924.[6] P. Anastassiou, J. Chen, J. Chen, Y. Chen, Z. Chen, Z. Chen, J. Cong, L. Deng, C. Ding, L. Gao et al., “Seed-tts: A family of high-quality versatile speech generation models,” arXiv preprint arXiv:2406.02430, 2024.[7] T. Guo, C. Wen, D. Jiang, N. Luo, R. Zhang, S. Zhao, W. Li, C. Gong, W. Zou, K. Han et al., “Didispeech: A large scale mandarin speech corpus,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6968–6972.[8] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.[9] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). IEEE, 2017, pp. 1–5.[10] Z. Gao, Z. Li, J. Wang, H. Luo, X. Shi, M. Chen, Y. Li, L. Zuo, Z. Du, and S. Zhang, “Funasr: A fundamental end-to-end speech recognition toolkit,” in 24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dublin, Ireland, August 20-24, 2023. ISCA, 2023, pp. 1593–1597.[11] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, vol. 202, pp. 28 492–28 518.[12] Z. Ma, Z. Zheng, J. Ye, J. Li, Z. Gao, S. Zhang, and X. Chen, “emotion2vec: Self-supervised pre-training for speech emotion representation,” in Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, L. Ku, A. Martins, and V. Srikumar, Eds. Association for Computational Linguistics, 2024, pp. 15 747–15 760.[13] Y. Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu, “Maskgct: Zero-shot text-to-speech with masked generative codec transformer,” arXiv preprint arXiv:2409.00750, 2024.[14] Y. Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. Zhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” arXiv preprint arXiv:2410.06885, 2024.[15] Z. Du, Y. Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao, Y. Yang, C. Gao, H. Wang et al., “Cosyvoice 2: Scalable streaming speech synthesis with large language models,” arXiv preprint arXiv:2412.10117, 2024.[16] X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Feng et al., “Spark-tts: n efficient llm-based text-to-speech model with single-stream decoupled speech tokens,” arXiv preprint arXiv:2503.01710, 2025.-End-作者丨Index SpeechTeam开发者问答 你用IndexTTS做过什么有意思的项目吗?当然我们也很想听听你的感受以及对模型后续进化的期待:功能更丰富、语种更多样? 欢迎你的建议,或直接来 GitHub 加星参与共建~转发本文至朋友圈并留言,即可参与下方抽奖⬇️小编将抽取1位幸运的小伙伴获取小电视毛绒钥匙包社牛款抽奖截止时间:7月18日12:00如果喜欢本期内容的话,欢迎点个“在看”吧! 往期精彩指路