MarkTechPost@AI 02月18日
Ola: A State-of-the-Art Omni-Modal Understanding Model with Advanced Progressive Modality Alignment Strategy
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Ola是由清华大学、腾讯混元研究中心和新加坡南洋理工大学S-Lab的研究人员提出的Omni-Modal模型,旨在理解和生成包括文本、语音、图像、视频和音频在内的多种数据模态。该模型采用模块化架构,每种模态都有一个专门的编码器,将数据映射到统一的表征空间,从而允许中心大型语言模型(LLM)解释和生成跨不同模态的响应。Ola在图像、视频和音频理解基准测试中表现出色,超越了现有的omni-modal LLM,并接近了专用音频模型的性能。

💡Ola模型采用模块化架构,为每种模态(文本、图像、视频和音频)配备专用编码器,将数据映射到统一的表征空间,使大型语言模型能够跨模态理解和生成响应。音频方面,采用双编码器方法,分别处理语音和音乐特征。

🖼️Ola模型在处理视觉输入时,使用OryxViT保持原始宽高比,减少失真。同时,模型集成了局部-全局注意力池化层,通过压缩token长度而不丢失关键特征,提高效率,最大化计算性能。

🎤Ola模型在音频基准测试中表现出色,在LibriSpeech测试集的clean子集中实现了1.9%的WER(词错误率),在AIR-Bench上获得了6.41的平均分,超越了之前的omni-modal模型,并接近了专用音频模型的性能。

📚Ola模型的训练策略包括对MLP适配器进行高学习率预训练,然后降低学习率进行文本-图像和视频-音频训练。实验结果表明,这种训练策略在omni-modal学习、跨模态视频-音频对齐和渐进模态学习方面均取得了性能提升。

Understanding different data types like text, images, videos, and audio in one model is a big challenge. Large language models that handle all these together struggle to match the performance of models designed for just one type. Training such models is difficult because different data types have different patterns, making it hard to balance accuracy across tasks. Many models fail to properly align information from various inputs, slowing responses and requiring large amounts of data. These issues make it difficult to create a truly effective model that can understand all data types equally well.

Currently, models focus on specific tasks like recognizing images, analyzing videos, or processing audio separately. Some models try to combine these tasks, but their performance remains much weaker than specialized models. Vision-language models are improving and now process videos, 3D content, and mixed inputs, but integrating audio properly remains a major issue. Large audio-text models attempt to connect speech with language models, but understanding complex audio, like music and events, remains underdeveloped. Newer omni-modal models try to handle multiple data types but struggle with poor performance, unbalanced learning, and inefficient data handling.

To solve this, researchers from Tsinghua University, Tencent Hunyuan Research, and S-Lab, NTU proposed Ola, an Omnimodal model designed to understand and generate multiple data modalities, including text, speech, images, videos, and audio. The framework is built on a modular architecture where each modality has a dedicated encoder—text, images, videos, and audio—responsible for processing its respective input. These encoders map their data into a unified representational space, allowing a central Large Language Model (LLM) to interpret and generate responses across different modalities. For audio, Ola employs a dual encoder approach that separately processes speech and music features before integrating them into the shared representation. Vision inputs maintain their original aspect ratios using OryxViT, ensuring minimal distortion during processing. The model incorporates a Local-Global Attention Pooling layer to increase efficiency, which compresses token length without losing critical features, maximizing computation without losing performance. Lastly, speech synthesis is handled by an external text-to-speech decoder, supporting real-time streaming output.

Researchers conducted comprehensive benchmarking across image, video, and audio understanding benchmarks to evaluate the framework. Ola builds upon Qwen-2.5-7B, integrating OryxViT as the vision encoder, Whisper-V3-Large as the speech encoder, and BEATs-AS2M(cpt2) as the music encoder. The training used a high learning rate of 1e-3 for MLP adapter pre-training, reduced to 2e-5 for text-image training and 1e-5 for video-audio training, with a batch size of 256 over 64 NVIDIA A800 GPUs. Extensive evaluations demonstrated Ola’s capabilities across multiple benchmarks, including MMBench-1.1, MMStar, VideoMME, and AIR-Bench, where it outperformed existing omni-modal LLMs. In audio benchmarks, Ola achieved a 1.9% WER on the test-clean subset of LibriSpeech and a 6.41 average score on AIR-Bench, surpassing previous omni-modal models and approaching the performance of specialized audio models. Further analysis highlighted Ola’s cross-modal learning benefits, showing that joint training with video-audio data improved speech recognition performance. Ola’s training strategies were analyzed, demonstrating performance gains in omni-modal learning, cross-modal video-audio alignment, and progressive modality learning.

Ultimately, the proposed model successfully combines text, image, video, and audio information through a progressive modality alignment approach with remarkable performance on various benchmarks. Its architectural innovations, effective training methods, and high-quality cross-modal data preparation overcome the weaknesses of earlier models and present the capabilities of omni-modal learning. Ola’s structure and training process can be used as a baseline in future studies, influencing development in more general AI models. Future work can build on Ola’s foundation to improve omni-modal understanding and application by refining cross-modal alignment and expanding data diversity.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

The post Ola: A State-of-the-Art Omni-Modal Understanding Model with Advanced Progressive Modality Alignment Strategy appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Omni-Modal模型 多模态学习 人工智能 Ola 跨模态对齐
相关文章