MarkTechPost@AI 07月05日 16:30
Kyutai Releases 2B Parameter Streaming Text-to-Speech TTS with 220ms Latency and 2.5M Hours of Training
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Kyutai发布了一款基于20亿参数的流式文本转语音(TTS)模型,该模型旨在实现实时响应。它在250万小时的音频数据上进行了训练,并采用CC-BY-4.0许可协议,强调开放性和可复现性。该模型实现了超低延迟的音频生成(220毫秒),同时保持高保真度,特别适用于边缘部署和智能代理。其关键技术在于延迟流建模方法,允许模型在文本输入完全到达之前开始语音合成,从而实现快速响应。该模型支持英法双语,并在Hugging Face上提供模型权重和推理脚本,促进开发者和研究人员的应用。

🗣️ Kyutai推出了一个具有20亿参数的流式文本转语音(TTS)模型,该模型在250万小时的语音数据上进行了训练,并获得了CC-BY-4.0许可。

⏱️ 该模型在单个NVIDIA L40 GPU上,能够为32个并发用户提供低于350毫秒的延迟,单用户延迟仅为220毫秒,实现了近乎实时的应用,如对话代理和语音助手。

💡 Kyutai的核心创新在于延迟流建模方法,该方法允许语音合成在完整文本输入到达之前开始,从而在保持预测质量的同时实现快速响应。

🌍 该模型支持英语和法语,并已在Hugging Face上发布模型权重和推理脚本,方便研究人员、开发者和商业团队使用。

🚀 这种低延迟的语音生成能力使其适用于各种实时AI应用,包括对话式AI、辅助技术、媒体制作和边缘设备,并为云环境中的语音服务提供了高效的扩展方案。

Kyutai, an open AI research lab, has released a groundbreaking streaming Text-to-Speech (TTS) model with ~2 billion parameters. Designed for real-time responsiveness, this model delivers ultra-low latency audio generation (220 milliseconds) while maintaining high fidelity. It’s trained on an unprecedented 2.5 million hours of audio and is licensed under the permissive CC-BY-4.0, reinforcing Kyutai’s commitment to openness and reproducibility. This advancement redefines the efficiency and accessibility of large-scale speech generation models, particularly for edge deployment and agentic AI.

Unpacking the Performance: Sub-350ms Latency for 32 Concurrent Users on a Single L40 GPU

The model’s streaming capability is its most distinctive feature. On a single NVIDIA L40 GPU, the system can serve up to 32 concurrent users while keeping the latency under 350ms. For individual use, the model maintains a generation latency as low as 220ms, enabling nearly real-time applications such as conversational agents, voice assistants, and live narration systems. This performance is enabled through Kyutai’s novel Delayed Streams Modeling approach, which allows the model to generate speech incrementally as text arrives.

Key Technical Metrics:

Delayed Streams Modeling: Architecting Real-Time Responsiveness

Kyutai’s innovation is anchored in Delayed Streams Modeling, a technique that allows speech synthesis to begin before the full input text is available. This approach is specifically designed to balance prediction quality with response speed, enabling high-throughput streaming TTS. Unlike conventional autoregressive models that suffer from response lag, this architecture maintains temporal coherence while achieving faster-than-real-time synthesis.

The codebase and training recipe for this architecture are available at Kyutai’s GitHub repository, supporting full reproducibility and community contributions.

Model Availability and Open Research Commitment

Kyutai has released the model weights and inference scripts on Hugging Face, making it accessible for researchers, developers, and commercial teams. The permissive CC-BY-4.0 license encourages unrestricted adaptation and integration into applications, provided proper attribution is maintained.

This release supports both batch and streaming inference, making it a versatile foundation for voice cloning, real-time chatbots, accessibility tools, and more. With pretrained models in both English and French, Kyutai sets the stage for multilingual TTS pipelines.

Implications for Real-Time AI Applications

By reducing the speech generation latency to the 200ms range, Kyutai’s model narrows the human-perceptible delay between intent and speech, making it viable for:

The ability to serve 32 users on a single L40 GPU without quality degradation also makes it attractive for scaling speech services efficiently in cloud environments.

Conclusion: Open, Fast, and Ready for Deployment

Kyutai’s streaming TTS release is a milestone in speech AI. With high-quality synthesis, real-time latency, and generous licensing, it addresses critical needs for both researchers and real-world product teams. The model’s reproducibility, multilingual support, and scalable performance make it a standout alternative to proprietary solutions.

For more details, you can explore the official model card on Hugging Face, technical explanation on Kyutai’s site, and implementation specifics on GitHub.

The post Kyutai Releases 2B Parameter Streaming Text-to-Speech TTS with 220ms Latency and 2.5M Hours of Training appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Kyutai TTS 语音合成 低延迟 实时
相关文章