MarkTechPost@AI 2024年07月04日
Kyutai Open Sources Moshi: A Real-Time Native Multimodal Foundation AI Model that can Listen and Speak
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Kyutai发布了Moshi,这是一个革命性的实时本地多模态基础模型,它模拟并超越了OpenAI在5月份展示的GPT-4o的一些功能。Moshi旨在理解和表达情感,提供诸如用不同的口音说话(包括法语)等功能。它可以边听边生成音频和语音,同时保持文本思想的流畅。Moshi的突出特点之一是它能够同时处理两个音频流,使其能够同时听和说话。这种实时交互得益于文本和音频混合的联合预训练,利用了Kyutai开发的70亿参数语言模型Helium的合成文本数据。

🚀 **Moshi的独特功能:** Moshi能够同时处理两个音频流,使其能够同时听和说话。这得益于它在文本和音频混合的联合预训练中,利用了Kyutai开发的70亿参数语言模型Helium的合成文本数据。Moshi的训练还包括100,000个“口语风格”的合成对话,使用Text-to-Speech (TTS)技术进行转换。Moshi的语音是在由独立的TTS模型生成的合成数据上训练的,实现了令人印象深刻的端到端延迟,仅为200毫秒。

💡 **Moshi的应用场景:** Moshi的应用场景非常广泛,例如研究辅助、头脑风暴、语言学习等。Moshi的开源性质鼓励了合作和创新,确保这种突破性技术的好处能惠及所有人。

🌐 **Moshi的开源特性:** Kyutai将Moshi开源,体现了其对透明度和AI社区协作开发的承诺。Kyutai还强调负责任的AI使用,通过加入水印来检测AI生成的音频,这是一个正在进行中的功能。

💻 **Moshi的技术细节:** Moshi的核心是一个70亿参数的多模态语言模型,它处理语音输入和输出。该模型使用双通道I/O系统,同时生成文本标记和音频编解码器。基础文本语言模型Helium 7B是从头开始训练的,然后与文本和音频编解码器进行联合训练。基于Kyutai内部的Mimi模型,语音编解码器具有300倍的压缩因子,捕获语义和声学信息。

📈 **Moshi的未来发展:** Kyutai计划发布技术报告和开源模型版本,包括推理代码库、7B模型、音频编解码器和完整的优化堆栈。未来的迭代,如Moshi 1.1、1.2和2.0,将根据用户反馈来改进模型。Moshi的许可证旨在尽可能宽松,促进广泛的采用和创新。

In a stunning announcement reverberating through the tech world, Kyutai introduced Moshi, a revolutionary real-time native multimodal foundation model. This innovative model mirrors and surpasses some of the functionalities showcased by OpenAI’s GPT-4o in May.

Moshi is designed to understand and express emotions, offering capabilities like speaking with different accents, including French. It can listen and generate audio and speech while maintaining a seamless flow of textual thoughts, as it says. One of Moshi’s standout features is its ability to handle two audio streams simultaneously, allowing it to listen and talk simultaneously. This real-time interaction is underpinned by joint pre-training on a mix of text and audio, leveraging synthetic text data from Helium, a 7 billion parameter language model developed by Kyutai.

The fine-tuning process of Moshi involved 100,000 “oral-style” synthetic conversations, converted using Text-to-Speech (TTS) technology. The model’s voice was trained on synthetic data generated by a separate TTS model, achieving an impressive end-to-end latency of 200 milliseconds. Remarkably, Kyutai has also developed a smaller variant of Moshi that can run on a MacBook or a consumer-sized GPU, making it accessible to a broader range of users.

Kyutai has emphasized the importance of responsible AI use by incorporating watermarking to detect AI-generated audio, a feature that is currently a work in progress. The decision to release Moshi as an open-source project highlights Kyutai’s commitment to transparency and collaborative development within the AI community.

At its core, Moshi is powered by a 7-billion-parameter multimodal language model that processes speech input and output. The model operates with a two-channel I/O system, generating text tokens and audio codecs concurrently. The base text language model, Helium 7B, was trained from scratch and then jointly trained with text and audio codecs. Based on Kyutai’s in-house Mimi model, the speech codec boasts a 300x compression factor, capturing semantic and acoustic information.

Training Moshi involved rigorous processes, fine-tuning 100,000 highly detailed transcripts annotated with emotion and style. The Text-to-Speech Engine, which supports 70 different emotions and styles, was fine-tuned on 20 hours of audio recorded by a licensed voice talent named Alice. The model is designed for adaptability and can be fine-tuned with less than 30 minutes of audio.

Moshi’s deployment showcases its efficiency. The demo model, hosted on Scaleway and Hugging Face platforms, can handle two batch sizes at 24 GB VRAM. It supports various backends, including CUDA, Metal, and CPU, and benefits from optimizations in inference code through Rust. Enhanced KV caching and prompt caching are anticipated to improve performance further.

Looking ahead, Kyutai has ambitious plans for Moshi. The team intends to release a technical report and open model versions, including the inference codebase, the 7B model, the audio codec, and the full optimized stack. Future iterations, such as Moshi 1.1, 1.2, and 2.0, will refine the model based on user feedback. Moshi’s licensing aims to be as permissive as possible, fostering widespread adoption and innovation.

In conclusion, Moshi exemplifies the potential of small, focused teams to achieve extraordinary advancements in AI technology. This model opens up new avenues for research assistance, brainstorming, language learning, and more, demonstrating the transformative power of AI when deployed on-device with unparalleled flexibility. As an open-source model, it invites collaboration and innovation, ensuring that the benefits of this groundbreaking technology are accessible to all.


Check out the Announcement, Keynote, and Demo Chat. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Paper, Code, and Model are coming…

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post Kyutai Open Sources Moshi: A Real-Time Native Multimodal Foundation AI Model that can Listen and Speak appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Moshi 多模态 AI模型 实时 开源
相关文章