MarkTechPost@AI 03月20日
Speech-to-Speech Foundation Models Pave the Way for Seamless Multilingual Interactions
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Gnani.ai专家在NVIDIA GTC25上展示了语音AI的突破性进展,重点是Speech-to-Speech基础模型的开发和部署。该创新方法旨在克服传统级联语音AI架构的局限性,开创无缝、多语言和情感感知的语音交互时代。该模型直接处理和生成音频,无需中间文本表示,通过训练大规模音频编码器,捕捉情感、共情和音调的细微差别,显著降低延迟,提高准确性,并实现情感感知。此技术将变革客户服务和全球通信等行业。

🗣️ 传统语音代理采用三阶段级联架构(STT、LLM、TTS),存在延迟高、误差传递以及情感信息丢失等显著缺陷,影响用户体验。

🚀 Gnani.ai提出的Speech-to-Speech基础模型直接处理和生成音频,无需中间文本表示,显著降低延迟(从2秒降至850-900毫秒),并通过融合ASR与LLM层提高准确性,尤其是在处理短语音和长语音时。

🌐 该模型通过捕捉和建模音调、压力和语速实现情感感知,同时具备通过上下文感知改善中断处理的能力,从而促进更自然的交互,且能有效处理低带宽音频,这对电话网络至关重要。

🛠️ 模型的开发依赖于NVIDIA技术栈,包括NVIDIA Nemo用于训练编码器-解码器模型,NeMo Curator用于生成合成文本数据,以及NVIDIA EVA用于生成音频对。

🤝 主要用例包括实时语言翻译和客户支持,展示了模型处理跨语言对话、中断和情感细微差别的能力,预示着语音AI在客户服务和全球通信领域具有广阔的应用前景。

At NVIDIA GTC25, Gnani.ai experts unveiled groundbreaking advancements in voice AI, focusing on the development and deployment of Speech-to-Speech Foundation Models. This innovative approach promises to overcome the limitations of traditional cascaded voice AI architectures, ushering in an era of seamless, multilingual, and emotionally aware voice interactions.

The Limitations of Cascaded Architectures

Current state-of-the-art architecture powering voice agents involves a three-stage pipeline: Speech-to-Text (STT), Large Language Models (LLMs), and Text-to-Speech (TTS). While effective, this cascaded architecture suffers from significant drawbacks, primarily latency and error propagation. A cascaded architecture has multiple blocks in the pipeline, and each block will add its own latency. The cumulative latency across these stages can range from 2.5 to 3 seconds, leading to a poor user experience. Moreover, errors introduced in the STT stage propagate through the pipeline, compounding inaccuracies. This traditional architecture also loses critical paralinguistic features such as sentiment, emotion, and tone, resulting in monotonous and emotionally flat responses.

Introducing Speech-to-Speech Foundation Models

To address these limitations, Gnani.ai presents a novel Speech-to-Speech Foundation Model. This model directly processes and generates audio, eliminating the need for intermediate text representations. The key innovation lies in training a massive audio encoder with 1.5 million hours of labeled data across 14 languages, capturing nuances of emotion, empathy, and tonality. This model employs a nested XL encoder, retrained with comprehensive data, and an input audio projector layer to map audio features into textual embeddings. For real-time streaming, audio and text features are interleaved, while non-streaming use cases utilize an embedding merge layer. The LLM layer, initially based on Llama 8B, was expanded to include 14 languages, necessitating the rebuilding of tokenizers. An output projector model generates mel spectrograms, enabling the creation of hyper-personalized voices.

Key Benefits and Technical Hurdles

The Speech-to-Speech model offers several significant benefits. Firstly, it significantly reduces latency, moving from 2 seconds to approximately 850-900 milliseconds for the first token output. Secondly, it enhances accuracy by fusing ASR with the LLM layer, improving performance, especially for short and long speeches. Thirdly, the model achieves emotional awareness by capturing and modeling tonality, stress, and rate of speech. Fourthly, it enables improved interruption handling through contextual awareness, facilitating more natural interactions. Finally, the model is designed to handle low bandwidth audio effectively, which is crucial for telephony networks. Building this model presented several challenges, notably the massive data requirements. The team created a crowd-sourced system with 4 million users to generate emotionally rich conversational data. They also leveraged foundation models for synthetic data generation and trained on 13.5 million hours of publicly available data. The final model comprises a 9 billion parameter model, with 636 million for the audio input, 8 billion for the LLM, and 300 million for the TTS system.

NVIDIA’s Role in Development

The development of this model was heavily reliant on the NVIDIA stack. NVIDIA Nemo was used for training encoder-decoder models, and NeMo Curator facilitated synthetic text data generation. NVIDIA EVA was employed to generate audio pairs, combining proprietary information with synthetic data.

Use Cases 

Gnani.ai showcased two primary use cases: real-time language translation and customer support. The real-time language translation demo featured an AI engine facilitating a conversation between an English-speaking agent and a French-speaking customer. The customer support demo highlighted the model’s ability to handle cross-lingual conversations, interruptions, and emotional nuances. 

Speech-to-Speech Foundation Model

The Speech-to-Speech Foundation Model represents a significant leap forward in voice AI. By eliminating the limitations of traditional architectures, this model enables more natural, efficient, and emotionally aware voice interactions. As the technology continues to evolve, it promises to transform various industries, from customer service to global communication.

The post Speech-to-Speech Foundation Models Pave the Way for Seamless Multilingual Interactions appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

语音AI Speech-to-Speech模型 NVIDIA 情感感知
相关文章