Hello Paperspace 2024年11月27日
Text Labeling and Image Resolution with the Monkey Chat Vision Model and DigitalOcean+Paperspace GPUs ?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Monkey 是一种视觉语言模型,旨在解决大型多模态模型(LMMs)在处理高分辨率图像和场景理解方面的挑战。它通过将输入图像分割成均匀的图像块,并结合LoRA和可训练的视觉重采样器,有效地处理高分辨率图像。Monkey 还采用多级描述生成方法,丰富场景-物体关联,从而更全面地理解视觉数据,提升了描述文本生成的有效性。该模型在图像字幕和视觉问答等任务中表现出色,尤其在密集文本问答方面优于GPT-4V等模型。

🤔 **图像处理与滑动窗口:** Monkey 通过滑动窗口将高分辨率图像分割成多个小块,每个小块的大小与原始视觉编码器训练时使用的大小一致(例如,448×448像素),从而能够处理高分辨率图像,并专注于图像的特定部分。

💡 **LoRA集成:** 在每个共享编码器中使用LoRA(低秩自适应),有效地处理不同图像部分的多样化视觉元素,捕捉细节敏感特征,同时不会显著增加模型参数或计算负载。

🖼️ **可训练图像块优势:** Monkey 使用多种可训练图像块,比传统的插值技术更有效地提高分辨率,并通过全局图像调整大小来保留输入图像的整体结构信息。

💬 **自动多级描述生成:** Monkey 集成多个先进系统(例如,BLIP2、PPOCR、GRIT、SAM、ChatGPT)生成高质量的字幕,通过分层和上下文理解捕捉广泛的视觉细节。

🚀 **性能提升:** Monkey 在图像字幕和视觉问答等任务中表现出色,尤其在密集文本问答方面优于GPT-4V等模型,并支持高达1344×896的分辨率,有助于识别小物体或密集排列的物体和文本。

Vision-language models are among the advanced artificial intelligence AI systems designed to understand and process visual and textual data together. These models are known to combine the capabilities of computer vision and natural language processing tasks. The models are trained to interpret images and generate descriptions about the image, enabling a range of applications such as image captioning, visual question answering, and text-to-image synthesis. These models are trained on large datasets and powerful neural network architectures, which helps the models to learn complex relationships. This, in turn, allows the models to perform the desired tasks. This advanced system opens up possibilities for human-computer interaction and the development of intelligent systems that can communicate similarly to humans.

Large Multimodal Models (LMMs) are quite powerful however they struggle with the high-resolution input and scene understanding. To address these challenges Monkey was recently introduced. Monkey, a vision-language model, processes input images by dividing the input images into uniform patches, with each patch matching the size used in its original vision encoder training (e.g., 448×448 pixels).

This design allows the model to handle high-resolution images. Monkey employs a two-part strategy: first, it enhances visual capture through higher resolution; second, it uses a multi-level description generation method to enrich scene-object associations, creating a more comprehensive understanding of the visual data. This approach improves learning from the data by capturing detailed visuals, enhancing descriptive text generation's effectiveness.

Join our Discord Community

Get started Join the community

Monkey Architecture Overview

The Overall Monkey Architecture (Image Source)

Let's break down this approach step by step.

Image Processing with Sliding Window

LoRA Integration

Maintaining Structural Information

Processing with Visual Encoder and Resampler

Cross-Attention Module

Balancing Detail and Holistic Understanding

This approach improves the model's ability to understand complex images by combining local detail analysis with a global overview, leveraging advanced techniques like LoRA and cross-attention.

Few Key Points

Overall, Monkey offers a sophisticated way to improve resolution and description generation in LMMs by using existing models more efficiently.

How can I do visual Q&A with Monkey?

To run the Monkey Model and experiment with it, we first login to Paperspace and start a notebook, or you can start up a terminal. We highly recommend using an A4000 GPU to run the model.

The NVIDIA A6000 GPU is a powerful graphics card that is known for its exceptional performance in various AI and machine learning applications, including visual question answering (VQA). With its memory and advanced Ampere architecture, the A4000 offers high throughput and efficiency, making it ideal for handling the complex computations required in VQA tasks.

!nvidia-smi

Setup

Bring this project to life

We will run the below code cells. This will clone the repository, and install the requirements.txt file.

git clone https://github.com/Yuliang-Liu/Monkey.gitcd ./Monkeypip install -r requirements.txt

We can run the gradio demo which is fast and easy to use.

 python demo.py

or follow the code along.

from transformers import AutoModelForCausalLM, AutoTokenizercheckpoint = "echo840/Monkey-Chat"model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map='cuda', trust_remote_code=True).eval()tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)tokenizer.padding_side = 'left'tokenizer.pad_token_id = tokenizer.eod_id

The code above loads the pre-trained model and tokenizer from the Hugging Face Transformers library.

"echo840/Monkey-Chat" is the name of the model checkpoint we will load. Next, we will load the model weights and configurations and map the device to CUDA-enabled GPU for faster computation.

img_path = '/notebooks/quick_start_pytorch_images/image 2.png'question = "provide a detailed caption for the image"query = f'<img>{img_path}</img> {question} Answer: 'input_ids = tokenizer(query, return_tensors='pt', padding='longest')attention_mask = input_ids.attention_maskinput_ids = input_ids.input_idspred = model.generate(    input_ids=input_ids.cuda(),    attention_mask=attention_mask.cuda(),    do_sample=False,    num_beams=1,    max_new_tokens=512,    min_new_tokens=1,    length_penalty = 1,    num_return_sequences=1,    output_hidden_states=True,    use_cache=True,    pad_token_id=tokenizer.eod_id,    eos_token_id=tokenizer.eod_id,)response = tokenizer.decode(pred[0][input_ids.size(1):].cpu(), skip_special_tokens=True).strip()print(response)

This code will generate the detailed caption or description or any other output based on the prompt query using Monkey. We will specify the path where we have stored our image and formulating a query string that includes the image reference and the question asking for a caption. Next, the query is tokenised using the 'tokenizer' which converts the input texts into token IDs.

Parameters such as do_sample=False and num_beams=1 ensure deterministic output by disabling sampling. Other parameters like max_new_tokens, min_new_tokens, and length_penalty control the length and nature of the generated sequence. After generation, the output tokens are decoded back into human-readable text, skipping any special tokens, to form the final response, which is a caption describing the image. Finally, we print the generated caption.

Results

We tried the model with an extremely basic image of just a screenshot, and it does fairly well in recognizing what the image is.

Monkey Output:- The image displays a computer interface with a blue background. The main focus is on a machine with the label "A4000" and the price "$0.76/hr". The machine is equipped with 45 GB RAM and 8 CPUs. There are also several buttons and icons on the interface, including a blue "STOP MACHINE" button and a white "Auto-shutdown in 11 hours" button. The text "45 GB RAM" and "8 CPUs" are also visible. The overall layout suggests a user-friendly interface for managing virtual machines.

Monkey Output:- In the image, a golden retriever is captured in mid-air, leaping towards a yellow tennis ball that is suspended in the air. The dog's body is stretched out, with its front paws reaching for the ball, and its tail is raised high in excitement. The background is a lush green lawn, providing a stark contrast to the dog's white fur. The image is taken from a low angle, making the dog appear larger and more imposing. The dog's focused expression and the dynamic pose suggest a moment of intense playfulness.

Prompt:- Provide a detailed caption for the image

Monkey Output:- This image is a painting of a city skyline at night. The sky is filled with swirling yellow and blue colors, creating a starry effect. In the foreground, there is a tall building with a clock on top. The painting also features a large tree with a yellow light shining on it. The overall effect is one of tranquility and beauty, reminiscent of the famous "Starry Night" painting by Vincent van Gogh.

We are quite impressed by the detailed descriptions and captions that provide even the minutest details of the image. The AI-generated caption is truly remarkable!

The below image highlights Monkey's capabilities in various VQA tasks. Monkey analyzes questions, identifies key image elements, perceives minute text, and reasons about objects, and understands visual charts. The figure also demonstrates Monkey's impressive captioning ability, accurately describing objects and providing summaries.

Monkey's results on various tasks (Image Source)

Comparison Results

In qualitative analysis, Monkey was compared with GPT4V and other LMMs on the task of generating detailed captions.

Further experiments have shown that in many cases, Monkey has demonstrated impressive performance compared to GPT4V when it comes to understanding complex text-based inquiries.

The VQA task comparison results in the below figure show that by scaling up the model size, Monkey achieves significant performance advantages in tasks involving dense text. It not only outperforms QwenVL-Chat [3], LLaVA-1.5 [29], and mPLUG-Owl2 [56] but also achieves promising results compared to GPT-4V [42]. This demonstrates the importance of scaling up model size for performance improvement in multimodal large models and validates our method's effectiveness in enhancing their performance.

Monkey’s comparison with GPT-4V, QwenVL-Chat, LLaVA-1.5, and mPLUG-Owl2 on VQA task.

Practical Application

Conclusion

In this article, we discuss the Monkey chat vision model, the model achieved good results when tried with different images to generate captions or even to understand what is in the image. The research claims that the model outperforms various LMMs including GPT-4v. Its enhanced input resolution also significantly improves performance on document images with dense text. Leveraging advanced techniques such as sliding windows and cross-attention effectively balances local and global image perspectives. However, this method is also limited to processing the input images as a maximum of six patches due to the language model's input length constraints, restricting further input resolution expansion.

Despite these limitations, the model shows significant promise in capturing fine details and providing insightful descriptions, particularly for document images with dense text.

We hope you enjoyed reading the article!

Add speed and simplicity to your Machine Learning workflow today

Get startedTalk to an expert

References

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

视觉语言模型 Monkey 高分辨率图像 场景理解 多模态
相关文章