MarkTechPost@AI 2024年10月11日
Rhymes AI Released Aria: An Open Multimodal Native MoE Model Offering State-of-the-Art Performance Across Diverse Language, Vision, and Coding Tasks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Rhymes AI推出Aria,这是一个开放的多模态AI模型,能处理多种输入类型,解决了现有模型的一些问题,在多领域表现出色,具有多种优势。

🎯Aria是开放多模态AI模型,可无缝处理文本、图像、视频和代码等多种输入,无需为每种输入类型单独设置,在多模态任务中表现出先进性能,理解能力可与专业模型媲美。

💪Aria在各种多模态、语言和编码任务中表现强劲,尤其在视频和文档理解方面表现出色,能高效处理复杂的多模态数据,超过其他模型。

🚀Aria采用精细的混合专家架构,每个标记仅激活一部分参数,确保参数效率和较低计算成本,相比其他模型如Pixtral-12B和Llama3.2 - 11B更具优势。

📜Aria具有64,000个标记的长多模态上下文窗口,能够处理复杂的长数据序列,如长文档或带字幕的扩展视频,在理解长文和视频方面优于其他竞争模型。

🏆Aria在多模态、语言和编码任务的基准测试中取得了一流的结果,可与顶级专有模型竞争,是文档理解、图表阅读和视觉问答的首选。

🌐Aria是开源的且对开发者友好,在Apache 2.0许可下发布,提供开放模型权重和可访问的代码库,便于开发者在各种数据集上进行微调。

The field of multimodal artificial intelligence (AI) revolves around creating models capable of processing and understanding diverse input types such as text, images, and videos. Integrating these modalities allows for a more holistic understanding of data, making it possible for the models to provide more accurate and contextually relevant information. With growing applications in areas like autonomous systems and advanced analytics, powerful multimodal models have become essential. Although proprietary models currently dominate the space, there is a pressing need for open models that offer competitive performance and accessibility for wider research and development.

A major issue in this domain is the need for open-source models with high efficiency and performance across various multimodal tasks. Most open-source models are limited in capability, excelling in one modality while underperforming in others. On the other hand, proprietary models like GPT-4o or Gemini-1.5 have demonstrated success across diverse tasks but are closed to the public, hindering further innovation and application. This creates a significant gap in the AI research landscape, as researchers need open models that can adequately serve as benchmarks or tools for further advancements in multimodal research.

The AI research community has explored various methods to build multimodal models, but most of these approaches have needed help with the complexity of integrating different data types. Existing open models are often designed to handle only a single kind of input at a time, like text or images, making it difficult to adapt them for tasks that require combined understanding. While proprietary models have shown how multimodal understanding can be achieved, they often rely on undisclosed training techniques and data resources, making them inaccessible for broader use. This limitation has left the research community looking for an open model to deliver strong performance in language and visual tasks without access barriers.

A team of researchers from Rhymes AI introduced Aria, an open multimodal AI model designed from scratch to handle various tasks, seamlessly integrating text, images, and video inputs. Aria utilizes a fine-grained mixture-of-experts (MoE) architecture, ensuring efficient computational resource utilization and superior performance. The model boasts 3.9 billion activated parameters per visual token and 3.5 billion per text token, making it a powerful tool for multimodal tasks. Also, Aria’s model size includes 24.9 billion parameters in total, and it activates only a fraction of these parameters at a time, resulting in lower computation costs than fully dense models.

The technical backbone of Aria lies in its mixture-of-experts decoder, which is complemented by a specialized visual encoder. The visual encoder converts visual inputs such as images and video frames into visual tokens with the same feature dimensions as word embeddings, enabling the model to integrate these seamlessly. Also, the model employs a 64,000-token context window, allowing it to process long-form multimodal data efficiently. This extended context window sets Aria apart from other models, making it highly effective in tasks that require a deep understanding of long and complex sequences, such as video comprehension and document analysis.

Key Features of Aria:


When evaluated against competing models, Aria achieved remarkable results across several benchmarks. It consistently outperformed open-source models like Pixtral-12B and Llama3.2-11B in multimodal understanding tasks. For instance, Aria scored 92.6% on the TextVQA validation set and 81.8% on the MATH benchmark, highlighting its superior capability in visual question-answering and mathematical reasoning. In addition, Aria demonstrated state-of-the-art performance in long-context video understanding, achieving over 90% accuracy on the VideoMME benchmark with subtitles, surpassing many proprietary models. The model’s efficient architecture also results in lower computational costs, making it a feasible option for real-world applications where both performance and cost-efficiency are crucial.

Aria is released under the Apache 2.0 license, making it accessible for academic and commercial use. The research team also provides a robust training framework for fine-tuning Aria on various data sources, allowing users to leverage the model for specific use cases. This open access to a high-performance multimodal model will catalyze further research and development, driving innovation in virtual assistants, automated content generation, and multimodal search engines.

In conclusion, Aria fills a critical gap in the AI research community by offering a powerful open-source alternative to proprietary multimodal models. Its fine-grained mixture-of-experts architecture, lightweight visual encoder, and extended context window enable it to perform exceptionally well on complex tasks that require comprehensive understanding across multiple modalities. Aria is a versatile tool for a wide range of multimodal applications by achieving competitive performance on various benchmarks and offering low computation costs.


Check out the Paper, Model, and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)

The post Rhymes AI Released Aria: An Open Multimodal Native MoE Model Offering State-of-the-Art Performance Across Diverse Language, Vision, and Coding Tasks appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Rhymes AI Aria 多模态 AI模型 开源
相关文章