MarkTechPost@AI 2024年10月19日
Meta AI Releases Meta Spirit LM: An Open Source Multimodal Language Model Mixing Text and Speech
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta AI 最近发布了 Meta Spirit LM,这是一个开源的多模态语言模型,它能够自由地混合文本和语音,以解决现有文本到语音 (TTS) 系统中缺乏表达性的问题。Meta Spirit LM 通过在词级别整合文本和语音来克服现有 TTS 系统的局限性,使其能够更无缝地跨模态。该模型使用词级别交织方法在语音和文本数据集上进行训练,有效地捕捉到口语的表达特征,同时保持文本模型强大的语义能力。

🤖 Meta Spirit LM 通过在词级别整合文本和语音,克服了现有 TTS 系统的局限性,使其能够更无缝地跨模态。该模型使用词级别交织方法在语音和文本数据集上进行训练,有效地捕捉到口语的表达特征,同时保持文本模型强大的语义能力。

🗣️ Meta Spirit LM 有两个版本:Spirit LM Base 和 Spirit LM Expressive。Spirit LM Base 使用音素标记对语音进行编码,从而可以有效地表示单词,而 Spirit LM Expressive 则更进一步,通过加入音调和风格标记来捕捉语气细节,例如兴奋或愤怒,并生成反映这些情绪的表达性语音。

🧠 Meta Spirit LM 能够进行跨模态的少样本学习,例如自动语音识别 (ASR)、文本到语音 (TTS) 和语音分类。这种多功能性使 Meta Spirit LM 成为传统多模态 AI 模型的重大改进,传统模型通常在孤立的领域中运行。通过学习跨越文本和语音的表示,该模型还可以用于复杂的应用,包括表达性故事讲述、情感驱动的虚拟助手和增强的交互式对话系统。

💡 Meta Spirit LM 的重要性在于它能够自由地在语音和文本之间转换,从而显著增强多模态 AI 体验。该模型的表达版本 (Spirit LM Expressive) 超越了标准语音模型,因为它允许在不同模态之间保留情感和语气。在 Speech-Text Sentiment Preservation (STSP) 基准测试上的评估结果表明,Spirit LM Expressive 有效地保留了情感意图,与使用 ASR 和 TTS 级联的标准 LLM 相比,它提供了更自然和更具情感的输出。

🚀 Meta Spirit LM 的另一个关键方面是它在不同模态中的少样本学习能力。该模型已经证明了它能够处理跨模态任务,例如将文本转换为表达性语音,其竞争性准确率展示了它在跨模态的泛化理解。这使得 Meta Spirit LM 在开发对话代理、针对残疾人的可访问通信工具以及需要自然、表达性对话的教育技术方面取得了重大进步。该模型的开源性质也邀请了更广泛的研究界来探索和改进其多模态能力。

One of the primary challenges in developing advanced text-to-speech (TTS) systems is the lack of expressivity when transcribing and generating speech. Traditionally, large language models (LLMs) used for building TTS pipelines convert speech to text using automatic speech recognition (ASR), process it using an LLM, and then convert the output back to speech via TTS. However, this approach often leads to a loss in expressive quality, as nuances such as tone, emotion, and pitch are stripped away during the ASR process. As a result, the synthesized speech tends to sound monotonic or unnatural, unable to adequately convey emotions like excitement, anger, or surprise.

Meta AI recently released Meta Spirit LM, an innovative open-source multimodal language model capable of freely mixing text and speech to address these limitations. Meta Spirit LM addresses the limitations of existing TTS systems by integrating both text and speech at the word level, allowing the model to cross modalities more seamlessly. This model was trained on both speech and text datasets using a word-level interleaving method, effectively capturing the expressive characteristics of spoken language while maintaining the strong semantic capabilities of text-based models.

Meta Spirit LM comes in two versions: Spirit LM Base and Spirit LM Expressive. Spirit LM Base uses phonetic tokens to encode speech, allowing for efficient representation of words, while Spirit LM Expressive goes a step further by incorporating pitch and style tokens to capture details of tone, such as excitement or anger, and generate expressive speech that reflects these emotions. This makes Meta Spirit LM a powerful tool for integrating text and speech modalities to produce coherent and natural-sounding speech.

Meta Spirit LM employs a unique word-level interleaving method to train on a mix of text and speech datasets. The model’s architecture is designed to freely transition between text and speech by encoding both modalities into a single set of tokens. Spirit LM Base utilizes phonetic tokens derived from speech representations, whereas Spirit LM Expressive incorporates pitch and style tokens that add layers of expressivity, such as tone or emotional nuances.

This architecture enables Meta Spirit LM to generate more natural and contextually rich speech. The model is capable of few-shot learning for tasks across modalities, such as automatic speech recognition (ASR), text-to-speech (TTS), and speech classification. This versatility positions Meta Spirit LM as a significant improvement over traditional multimodal AI models that typically operate in isolated domains. By learning representations that span text and speech, the model can also be used for complex applications, including expressive storytelling, emotion-driven virtual assistants, and enhanced interactive dialogue systems.

The importance of Meta Spirit LM lies in its ability to freely transition between speech and text, significantly enhancing the multimodal AI experience. The Expressive version of the model (Spirit LM Expressive) goes beyond standard speech models by allowing for the preservation of sentiment and tone across different modalities. Evaluation results on the Speech-Text Sentiment Preservation (STSP) benchmark indicate that Spirit LM Expressive effectively retains emotional intent, delivering more natural and emotive outputs than standard LLMs using ASR and TTS cascades.

Another key aspect of Meta Spirit LM’s contribution is its few-shot learning capabilities across different modalities. The model has demonstrated the ability to handle cross-modal tasks, such as converting text to expressive speech, with a competitive accuracy that showcases its generalized understanding across modalities. This makes Meta Spirit LM a significant leap forward in the development of conversational agents, accessible communication tools for those with disabilities, and educational technologies that require natural, expressive dialogue. The open-source nature of the model also invites the broader research community to explore and improve upon its multimodal capabilities.

Meta Spirit LM represents a groundbreaking step towards integrating speech and text modalities in AI systems without sacrificing expressivity. Meta Spirit LM Base and Spirit LM Expressive demonstrate a powerful combination of semantic understanding and expressive speech generation by using an interleaving approach to train on speech and text datasets. Whether it’s generating emotive virtual assistants or improving conversational AI, Meta Spirit LM’s open-source approach opens the door for more innovative and expressive uses of multimodal AI technology. Meta AI’s contributions to this model are expected to inspire further research and development at the intersection of text and speech, ultimately leading to more natural and capable AI communication systems.


Check out the GitHub and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)

The post Meta AI Releases Meta Spirit LM: An Open Source Multimodal Language Model Mixing Text and Speech appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Meta Spirit LM 多模态语言模型 文本到语音 语音识别 开源
相关文章