MarkTechPost@AI 2024年09月03日
LLaVaOLMoBitnet1B: The First Ternary Multimodal LLM Capable of Accepting Image(s) and Text Inputs to Produce Coherent Textual Response
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了三元多模态语言模型的发展情况,包括其在多模态能力和模型压缩方面的进展,以及首个能处理图像和文本输入并生成连贯文本响应的模型,同时也探讨了该模型存在的问题及未来的研究方向。

🎯三元多模态语言模型迅速发展,Flamingo的出现是重要里程碑,LLaVa作为开源框架创新地使用文本GPT模型扩展多模态数据集,并衍生出多种变体和应用。

💻Intel研究者开发出首个能处理图像和文本输入的三元多模态语言模型LLaVaOLMoBitNet1B,其集成了三个关键组件,训练过程分两阶段,在图像和文本推理任务中显示出一定成果,但也存在一些不准确问题。

🚀三元模型在AI领域带来挑战与机遇,需发展有效后训练量化方法,解决如响应偏差等问题,同时优化硬件操作以提高性能,为推动高效高性能AI技术的普及奠定基础。

Large Language Models (LLMs) have made remarkable strides in multimodal capabilities, with closed-source models like GPT-4, Claude, and Gemini leading the field. However, the challenge lies in democratizing AI by making these powerful models accessible to a broader audience. The current limitation is the substantial computational resources required to run state-of-the-art models effectively. This creates a significant barrier for developers and researchers with limited access to high-end hardware. Also, The need for efficient models that can operate on smaller compute footprints has become increasingly apparent, as it would enable wider adoption and application of AI technologies across various domains and devices.

Multimodal Large Language Models (MM-LLMs) have rapidly evolved since the introduction of Flamingo, which marked a significant milestone in the field. LLaVa emerged as a prominent open-source framework, innovating by using text-only GPT models to expand multimodal datasets. Its architecture, featuring a pre-trained image encoder connected to a pre-trained LLM via an MLP, inspired numerous variants and applications across different domains. Small MM-LLMs like TinyLLaVa and LLaVa-Gemma were developed using this framework, addressing the need for more efficient models.

Concurrently, research into model compression led to major leaps like BitNetb1.58, which introduced ternary weight quantization. This method, involving pre-training with low-precision weights, demonstrated significant latency improvements with minimal accuracy loss. NousResearch’s OLMoBitNet1B further validated this approach by open-sourcing a ternary version of OLMo, although it remains undertrained compared to its peers. These advancements in both multimodal capabilities and model compression set the stage for further innovations in efficient, high-performance AI models.

Building upon NousResearch’s pioneering work, Intel researchers have developed the first Ternary Multimodal Large Language Model (TM-LLM) capable of processing both image and text inputs to generate coherent textual responses. This unique approach extends the capabilities of ternary models beyond text-only applications, opening new avenues for efficient multimodal AI. The team has open-sourced the model, including weights and training scripts, to facilitate further research and development in ternary models. By addressing the challenges associated with ternary quantization in multimodal contexts and highlighting potential opportunities, this work aims to pave the way for the mainstream adoption of highly efficient, compact AI models that can handle complex multimodal tasks with minimal computational resources.

The proposed model LLaVaOLMoBitNet1B integrates three key components: an ACLIP ViT-L/14 vision encoder, an MLP connector, and a ternary LLM. The vision encoder processes input images by dividing them into 14×14 non-overlapping patches, passing them through 24 transformer layers with a hidden dimension of 1024. This results in an output of (N, 1024) for each image, where N is the number of patches. The MLP connector then re-projects these image features to match the LLM’s embedding space, using two linear layers with a GELU activation, outputting a tensor of shape (N, 2048).

The core LLM is the ternary OLMoBitNet1B, featuring 16 transformer decoder layers with BitLinear158 layers replacing standard linear layers. This 1.1 billion parameter model was trained on 60B tokens of the Dolma dataset. The input text is tokenized and embedded, then concatenated with the image-projected tensor, creating an (m+n, 2048) tensor for LLM processing. The model generates responses autoregressively based on this combined input context.

The training approach for LLaVaOLMoBitNet1B follows a two-phase process similar to LLaVa1.5. The first phase, pre-training for feature alignment, utilizes a filtered subset of 595K Conceptual Captions. Only the projection layer weights are updated during this single-epoch training on an A100 cluster. The batch size is set to 32 per device, with gradients accumulated every 4 steps. A learning rate of 1e-3 is used with cosine decay and a 0.03 warmup ratio.

The second phase, end-to-end instruction fine-tuning, employs the LLaVa-Instruct-150K dataset for one epoch. Both the projection layer and LLM weights are updated during this phase. The batch size is reduced to 8, with gradient accumulation every 2 steps, and the learning rate is lowered to 2e-5. Adam optimizer is used with momentum parameters of 0.9 and 0.98. DeepSpeed library facilitates multi-GPU training throughout both phases.

LLaVaOLMoBitNet1B demonstrates promising results in image and text inference tasks. Qualitative evaluations reveal the model’s ability to generate coherent and mostly accurate responses to image-based questions. However, some inaccuracies are observed, such as misidentifying object counts or relative positions. For instance, the model correctly identifies stools and their color in one image but miscounts them. In another case, it provides an accurate description but errs in positioning details.

Quantitative comparisons show that the base LLM, OLMoBitNet1B, underperforms compared to peers due to its limited pre-training on only 60B tokens. This trend extends to LLaVaOLMoBitNet1B when compared to full-precision multimodal models. As the first ternary multimodal LLM, it remains one of the smallest models with the least pre-training exposure. While not currently the strongest performer, LLaVaOLMoBitNet1B establishes a valuable baseline for future development of more capable ternary multimodal models, balancing efficiency with performance.

Ternary models present unique challenges and opportunities in the AI landscape. While leading models are often closed-source or open-weight, the current ternarization approach requires training from scratch, limiting its accessibility to organizations with substantial compute resources. A critical research direction is developing effective post-training quantization methods for open-weight pre-trained models to ternary precision. Also, ternary models face similar challenges as regular LLMs, including response biases, uncertainty, and hallucinations. On the hardware front, there’s a need to optimize ternary operations for maximum performance gains. Future research will focus on addressing these challenges and advancing ternary model capabilities, aiming to democratize efficient, high-performance AI technologies.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

Here is a highly recommended webinar from our sponsor: ‘Building Performant AI Applications with NVIDIA NIMs and Haystack’

The post LLaVaOLMoBitnet1B: The First Ternary Multimodal LLM Capable of Accepting Image(s) and Text Inputs to Produce Coherent Textual Response appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

三元多模态语言模型 LLaVaOLMoBitNet1B AI发展 模型挑战
相关文章