MarkTechPost@AI 02月09日
Kyutai Releases Hibiki: A 2.7B Real-Time Speech-to-Speech and Speech-to-Text Translation with Near-Human Quality and Voice Transfer
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Kyutai发布了Hibiki,一个27亿参数的解码器模型,专为实时语音到语音(S2ST)和语音到文本(S2TT)翻译设计。Hibiki以12.5Hz的帧率和2.2kbps的比特率运行,目前支持法语到英语的翻译,并旨在保留翻译输出中的语音特征。其精简版Hibiki-M(17亿参数)针对智能手机上的实时性能进行了优化,使其更易于进行设备上的翻译。Hibiki采用多流语言模型,预测文本和音频标记,并使用神经音频编解码器(Mimi)压缩音频,同时保持保真度。Hibiki通过开放源代码发布,有望为多语言通信的进步做出重大贡献。

🗣️ Hibiki是一个27亿参数的解码器模型,由Kyutai开发,旨在实现高质量的实时语音到语音(S2ST)和语音到文本(S2TT)翻译,目前支持法语到英语的翻译。

⏱️ Hibiki采用了一种称为上下文对齐的关键技术,利用文本翻译模型的困惑度来确定生成语音的最佳时机,从而动态调整翻译延迟,同时保持连贯性。该模型还支持批量推理,可在H100 GPU上并行处理多达320个序列,适用于大规模应用。

📊 Hibiki在翻译质量和说话人保真度方面表现出色,ASR-BLEU得分为30.5,超过了现有基线,包括离线模型。人类评估表明,其自然度为3.73/5,接近专业人工翻译的4.12/5。在说话人相似度方面,Hibiki的得分为0.52,高于Seamless的0.43。

📱 Hibiki-M是Hibiki的精简版本,拥有17亿参数,专为智能手机上的实时性能进行了优化,使得设备上的翻译更加便捷。尽管说话人相似度略低于Hibiki,但Hibiki-M在实时使用中仍然有效。

Real-time speech translation presents a complex challenge, requiring seamless integration of speech recognition, machine translation, and text-to-speech synthesis. Traditional cascaded approaches often introduce compounding errors, fail to retain speaker identity, and suffer from slow processing, making them less suitable for real-time applications like live interpretation. Additionally, existing simultaneous translation models struggle to balance accuracy and latency, relying on complex inference mechanisms that are difficult to scale. A significant barrier remains the lack of large-scale, well-aligned speech datasets, limiting the ability to train models that can generate contextually accurate and natural translations with minimal delay.

Kyutai has developed Hibiki, a 2.7 billion-parameter decoder-only model designed for real-time speech-to-speech (S2ST) and speech-to-text (S2TT) translation. Operating at 12.5Hz framerate with a 2.2kbps bitrate, Hibiki currently supports French-to-English translation and is designed to preserve voice characteristics in the translated output. A distilled version, Hibiki-M (1.7B parameters), is optimized for real-time performance on smartphones, making it more accessible for on-device translation.

Technical Approach and Benefits

Hibiki’s decoder-only architecture enables simultaneous speech processing using a multistream language model that predicts both text and audio tokens. It employs a neural audio codec (Mimi) to compress audio while maintaining fidelity, ensuring efficient translation generation. A key aspect of its design is contextual alignment, a method that leverages a text translation model’s perplexity to determine optimal timing for generating speech, allowing Hibiki to adjust translation delays dynamically while maintaining coherence. Additionally, Hibiki supports batch inference, processing up to 320 sequences in parallel on H100 GPUs, making it viable for large-scale applications. The model is trained on 7M hours of English audio, 450K hours of French, and 40K hours of synthetic parallel data, contributing to its robustness across varied speech patterns.

Performance and Evaluation

Hibiki has demonstrated strong performance in translation quality and speaker fidelity. It achieves an ASR-BLEU score of 30.5, surpassing existing baselines, including offline models. Human evaluations rate its naturalness at 3.73/5, approaching the 4.12/5 score of professional human interpreters. The model also performs well in speaker similarity, with a 0.52 similarity score compared to 0.43 for Seamless. Compared to Seamless and StreamSpeech, Hibiki consistently delivers higher translation quality and better voice transfer, while maintaining a competitive latency. The distilled Hibiki-M variant, though slightly lower in speaker similarity, remains effective for real-time on-device use.

Conclusion

Hibiki provides a practical approach to real-time speech translation, integrating contextual alignment, efficient compression, and real-time inference to improve translation quality while preserving natural speech characteristics. By offering an open-source release under a permissive CC-BY license, Hibiki has the potential to contribute significantly to advancements in multilingual communication.


Check out the Paper, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Join our machine learning community on Twitter/X

The post Kyutai Releases Hibiki: A 2.7B Real-Time Speech-to-Speech and Speech-to-Text Translation with Near-Human Quality and Voice Transfer appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Hibiki 实时语音翻译 Kyutai 人工智能
相关文章