MarkTechPost@AI 05月07日 07:15
LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

中国科学院计算技术研究所的研究人员推出了LLaMA-Omni2,这是一个可在Hugging Face上使用的语音大语言模型系列。该研究引入了一个模块化框架,通过将语音感知和合成与语言理解相结合,实现实时语音对话。与早期的级联系统不同,LLaMA-Omni2在端到端流水线中运行,同时保持模块化的可解释性和低训练成本。模型参数范围从0.5B到14B,均构建于Qwen2.5-Instruct系列之上。实验表明,LLaMA-Omni2在语音问答和语音指令跟随任务中表现出色,证明了在不需要大量语音语料库预训练的情况下,实现高质量、低延迟语音交互的可行性。

🗣️LLaMA-Omni2是由中国科学院计算技术研究所的研究人员推出的语音大语言模型系列,可在Hugging Face上使用,模型参数范围从0.5B到14B,均构建于Qwen2.5-Instruct系列之上。

🧩 LLaMA-Omni2采用模块化架构,包含语音编码器(使用Whisper-large-v3)、语音适配器、核心LLM(Qwen2.5)和流式TTS解码器(灵感来自CosyVoice2),通过门控机制融合LLM隐藏状态与文本嵌入,增强生成音频的上下文保真度。

⏱️ LLaMA-Omni2采用读写策略实现流式输出,即LLM每生成R个token,就生成W个语音token,从而实现文本和声音的同步生成,在延迟、对齐和感知质量之间取得了良好的平衡,实验表明R=3,W=10时,延迟约为583ms。

📊 实验结果表明,LLaMA-Omni2-14B在各项任务中均优于所有基线模型,即使其训练数据远少于原生语音LLM(如GLM-4-Voice),并且性能随模型规模的增大而持续提升。

Researchers at the Institute of Computing Technology, Chinese Academy of Sciences, have introduced LLaMA-Omni2, a family of speech-capable large language models (SpeechLMs) now available on Hugging Face. This research introduces a modular framework that enables real-time spoken dialogue by integrating speech perception and synthesis with language understanding. Unlike earlier cascaded systems, LLaMA-Omni2 operates in an end-to-end pipeline while retaining modular interpretability and low training cost.

Overview of the LLaMA-Omni2 Architecture

LLaMA-Omni2 encompasses models ranging from 0.5B to 14B parameters, each built atop the Qwen2.5-Instruct series. The architecture consists of:

A gating mechanism fuses LLM hidden states with textual embeddings before speech synthesis, enhancing contextual fidelity in the generated audio.

Streaming Generation with Read-Write Scheduling

The model adopts a read-write strategy to facilitate streaming output. Specifically, for every R tokens produced by the LLM, W speech tokens are generated. This enables synchronized textual and acoustic generation, minimizing latency without compromising fluency.

Empirical findings suggest that setting R = 3 and W = 10 provides a favorable trade-off between latency (~583 ms), alignment (ASR-WER: 3.26), and perceptual quality (UTMOS: 4.19).

Training Approach

Despite achieving competitive performance, LLaMA-Omni2 is trained on a relatively compact corpus—200K multi-turn speech-to-speech dialogue samples. These samples are synthesized from instruction-following text datasets (Alpaca, UltraChat), with diverse input voices and a consistent output voice generated using FishSpeech and CosyVoice2 models.

Training is executed in two stages:

Benchmark Results

The models are evaluated on spoken question answering and speech instruction following tasks using both speech-to-text (S2T) and speech-to-speech (S2S) modes.

ModelLlama Q (S2S)Web Q (S2S)GPT-4o ScoreASR-WERLatency (ms)
GLM-4-Voice (9B)50.715.94.093.481562.8
LLaMA-Omni (8B)49.023.73.523.67346.7
LLaMA-Omni2-7B60.731.34.153.26582.9

The performance scales consistently with model size. Notably, LLaMA-Omni2-14B outperforms all baselines across tasks, even with substantially less training data than native SpeechLMs such as GLM-4-Voice.

Component Analyses

Additionally, the study demonstrates that multi-turn dialogue data is more effective than single-turn data in training speech interaction capabilities, and that performance plateaus around 200K samples.

Conclusion

LLaMA-Omni2 demonstrates that high-quality, low-latency spoken interaction with LLMs is feasible without the need for extensive pretraining on massive speech corpora. By combining modular architecture with autoregressive streaming synthesis, the system offers a practical pathway for real-time speech applications.


Check out the Paper, Model on Hugging Face and GitHub Page. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

The post LLMs Can Now Talk in Real-Time with Minimal Latency: Chinese Researchers Release LLaMA-Omni2, a Scalable Modular Speech Language Model appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLaMA-Omni2 语音大模型 实时语音交互 模块化架构
相关文章