TTS-1 Technical Report

cs.AI updates on arXiv.org 07月30日 12:46

TTS-1 Technical Report

本文介绍了Inworld TTS-1，一套基于Transformer的自动回归文本到语音（TTS）模型，包括高效模型TTS-1和高质量模型TTS-1-Max，均展现出卓越的语音合成性能。

arXiv:2507.21138v1 Announce Type: cross Abstract: We introduce Inworld TTS-1, a set of two Transformer-based autoregressive text-to-speech (TTS) models. Our largest model, TTS-1-Max, has 8.8B parameters and is designed for utmost quality and expressiveness in demanding applications. TTS-1 is our most efficient model, with 1.6B parameters, built for real-time speech synthesis and on-device use cases. By scaling train-time compute and applying a sequential process of pre-training, fine-tuning, and RL-alignment of the speech-language model (SpeechLM) component, both models achieve state-of-the-art performance on a variety of benchmarks, demonstrating exceptional quality relying purely on in-context learning of the speaker's voice. Inworld TTS-1 and TTS-1-Max can generate high-resolution 48 kHz speech with low latency, and support 11 languages with fine-grained emotional control and non-verbal vocalizations through audio markups. We additionally open-source our training and modeling code under an MIT license.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Inworld TTS-1 文本到语音 Transformer 语音合成

相关文章

Import AI 364: Robot scaling laws; human-level LLM forecasting; and Claude 3

正面硬刚OpenAI与谷歌？微软竟然偷偷自研出5000亿参数大模型

Trends in Computer Vision with Georgia Gkioxari - #549

Social Commonsense Reasoning with Yejin Choi - #518

Neural Synthesis of Binaural Speech From Mono Audio with Alexander Richard - #514

Trends in Natural Language Processing with Sameer Singh - #445

AI趨勢周報第252期：取代Transformer？LSTM之父發表新LLM架構

How ‘Chain of Thought’ Makes Transformers Smarter

This AI Paper by Toyota Research Institute Introduces SUPRA: Enhancing Transformer Efficiency with Recurrent Neural Networks

This AI Paper from Huawei Introduces a Theoretical Framework Focused on the Memorization Process and Performance Dynamics of Transformer-based Language Models (LMs)