AWS Machine Learning Blog 2024年11月16日
Fine-tune multimodal models for vision and text use cases on Amazon SageMaker JumpStart
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何利用Amazon SageMaker JumpStart微调Meta Llama 3.2视觉模型,以提升其在视觉问答任务中的性能。Meta Llama 3.2视觉模型在DocVQA基准测试中表现出色,但通过SageMaker JumpStart的微调,其在理解和回答图像相关自然语言问题的能力得到了进一步增强,ANLS得分显著提升。文章详细介绍了微调过程、模型概述、数据集准备以及SageMaker JumpStart的使用方法,并提供了两种微调方式:使用SageMaker Studio UI和SageMaker Python SDK。

🤔 **Meta Llama 3.2视觉模型**:Meta Llama 3.2系列模型支持文本输入/输出和文本+图像输入/输出,是首个支持视觉任务的Llama模型,并集成了图像编码器表示到语言模型中,提升了性能和效率。

🖼️ **DocVQA数据集**:DocVQA数据集是一个用于评估多模态AI模型视觉问答能力的基准数据集,包含各种文档图像及其对应的自然语言问题,需要模型具备视觉和文本理解能力才能正确回答。

🚀 **SageMaker JumpStart**:SageMaker JumpStart是一个强大的功能,它提供了一个公共可用和专有基础模型的中心,可以帮助用户快速访问和使用各种模型,加速机器学习应用的开发和部署。

💻 **微调方法**:文章介绍了两种微调Meta Llama 3.2视觉模型的方法:使用SageMaker Studio UI的无代码解决方案和使用SageMaker Python SDK的代码解决方案,方便用户根据需求选择。

📈 **性能提升**:通过在DocVQA数据集上微调,Meta Llama 3.2视觉模型的ANLS得分显著提升,证明了微调过程能够有效提升模型在视觉问答任务中的性能。

In the rapidly evolving landscape of AI, generative models have emerged as a transformative technology, empowering users to explore new frontiers of creativity and problem-solving. These advanced AI systems have transcended their traditional text-based capabilities, now seamlessly integrating multimodal functionalities that expand their reach into diverse applications. models have become increasingly powerful, enabling a wide range of applications beyond just text generation. These models can now create striking images, generate engaging summaries, answer complex questions, and even produce code—all while maintaining a high level of accuracy and coherence. The integration of these multimodal capabilities has unlocked new possibilities for businesses and individuals, revolutionizing fields such as content creation, visual analytics, and software development.

In this post, we showcase how to fine-tune a text and vision model, such as Meta Llama 3.2, to better perform at visual question answering tasks. The Meta Llama 3.2 Vision Instruct models demonstrated impressive performance on the challenging DocVQA benchmark for visual question answering. The non-fine-tuned 11B and 90B models achieved strong ANLS (Aggregated Normalized Levenshtein Similarity) scores of 88.4 and 90.1, respectively, on the DocVQA test set. ANLS is a metric used to evaluate the performance of models on visual question answering tasks, which measures the similarity between the model’s predicted answer and the ground truth answer. However, by using the power of Amazon SageMaker JumpStart, we demonstrate the process of adapting these generative AI models to excel at understanding and responding to natural language questions about images. By fine-tuning these models using SageMaker JumpStart, we were able to further enhance their abilities, boosting the ANLS scores to 91 and 92.4. This significant improvement showcases how the fine-tuning process can equip these powerful multimodal AI systems with specialized skills for excelling at understanding and answering natural language questions about complex, document-based visual information.

For a detailed walkthrough on fine-tuning the Meta Llama 3.2 Vision models, refer to the accompanying notebook.

Overview of Meta Llama 3.2 11B and 90B Vision models

The Meta Llama 3.2 collection of multimodal and multilingual large language models (LLMs) is a collection of pre-trained and instruction-tuned generative models in a variety of sizes. The 11B and 90B models are multimodal—they support text in/text out, and text+image in/text out.

Meta Llama 3.2 11B and 90B are the first Llama models to support vision tasks, with a new model architecture that integrates image encoder representations into the language model. The new models are designed to be more efficient for AI workloads, with reduced latency and improved performance, making them suitable for a wide range of applications. All Meta Llama 3.2 models support a 128,000 context length, maintaining the expanded token capacity introduced in Meta Llama 3.1. Additionally, the models offer improved multilingual support for eight languages, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

DocVQA dataset

The DocVQA (Document Visual Question Answering) dataset is a widely used benchmark for evaluating the performance of multimodal AI models on visual question answering tasks involving document-style images. This dataset consists of a diverse collection of document images paired with a series of natural language questions that require both visual and textual understanding to answer correctly. By fine-tuning a generative AI model like Meta Llama 3.2 on the DocVQA dataset using Amazon SageMaker, you can equip the model with the specialized skills needed to excel at answering questions about the content and structure of complex, document-based visual information.

For more information on the dataset used in this post, see DocVQA – Datasets.

Dataset preparation for visual question and answering tasks

The Meta Llama 3.2 Vision models can be fine-tuned on image-text datasets for vision and language tasks such as visual question answering (VQA). The training data should be structured with the image, the question about the image, and the expected answer. This data format allows the fine-tuning process to adapt the model’s multimodal understanding and reasoning abilities to excel at answering natural language questions about visual content.

The input includes the following:

{"file_name": "images/img_0.jpg", "prompt": "what is the date mentioned in this letter?", "completion": "1/8/93"}{"file_name": "images/img_1.jpg", "prompt": "what is the contact person name mentioned in letter?", "completion": "P. Carter"}{"file_name": "images/img_2.jpg", "prompt": "Which part of Virginia is this letter sent from", "completion": "Richmond"}

SageMaker JumpStart

SageMaker JumpStart is a powerful feature within the SageMaker machine learning (ML) environment that provides ML practitioners a comprehensive hub of publicly available and proprietary foundation models (FMs). With this managed service, ML practitioners get access to a growing list of cutting-edge models from leading model hubs and providers that you can deploy to dedicated SageMaker instances within a network isolated environment, and customize models using SageMaker for model training and deployment.

Solution overview

In the following sections, we discuss the steps to fine-tune Meta Llama 3.2 Vision models. We cover two approaches: using the Amazon SageMaker Studio UI for a no-code solution, and using the SageMaker Python SDK.

Prerequisites

To try out this solution using SageMaker JumpStart, you need the following prerequisites:

No-code fine-tuning through the SageMaker Studio UI

SageMaker JumpStart provides access to publicly available and proprietary FMs from third-party and proprietary providers. Data scientists and developers can quickly prototype and experiment with various ML use cases, accelerating the development and deployment of ML applications. It helps reduce the time and effort required to build ML models from scratch, allowing teams to focus on fine-tuning and customizing the models for their specific use cases. These models are released under different licenses designated by their respective sources. It’s essential to review and adhere to the applicable license terms before downloading or using these models to make sure they’re suitable for your intended use case.

You can access the Meta Llama 3.2 FMs through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we cover how to discover these models in SageMaker Studio.

SageMaker Studio is an IDE that offers a web-based visual interface for performing the ML development steps, from data preparation to model building, training, and deployment. For instructions on getting started and setting up SageMaker Studio, refer to Amazon SageMaker Studio.

When you’re in SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane.

In the JumpStart view, you’re presented with the list of public models offered by SageMaker. You can explore other models from other providers in this view. To start using the Meta Llama 3 models, under Providers, choose Meta.

You’re presented with a list of the models available. Choose one of the Vision Instruct models, for example the Meta Llama 3.2 90B Vision Instruct model.

Here you can view the model details, as well as train, deploy, optimize, and evaluate the model. For this demonstration, we choose Train.

On this page, you can point to the Amazon Simple Storage Service (Amazon S3) bucket containing the training and validation datasets for fine-tuning. In addition, you can configure deployment configuration, hyperparameters, and security settings for fine-tuning. Choose Submit to start the training job on a SageMaker ML instance.

Deploy the model

After the model is fine-tuned, you can deploy it using the model page on SageMaker JumpStart. The option to deploy the fine-tuned model will appear when fine-tuning is finished, as shown in the following screenshot.

You can also deploy the model from this view. You can configure endpoint settings such as the instance type, number of instances, and endpoint name. You will need to accept the End User License Agreement (EULA) before you can deploy the model.

Fine-tune using the SageMaker Python SDK

You can also fine-tune Meta Llama 3.2 Vision Instruct models using the SageMaker Python SDK. A sample notebook with the full instructions can be found on GitHub. The following code example demonstrates how to fine-tune the Meta Llama 3.2 11B Vision Instruct model:

import osimport boto3from sagemaker.jumpstart.estimator import JumpStartEstimatormodel_id, model_version = "meta-vlm-llama-3-2-11b-vision-instruct", "*"from sagemaker import hyperparametersmy_hyperparameters = hyperparameters.retrieve_default(    model_id=model_id, model_version=model_version)my_hyperparameters["epoch"] = "1"estimator = JumpStartEstimator(    model_id=model_id,    model_version=model_version,    environment={"accept_eula": "true"},  # Please change {"accept_eula": "true"}    disable_output_compression=True,    instance_type="ml.p5.48xlarge",    hyperparameters=my_hyperparameters,)estimator.fit({"training": train_data_location})

The code sets up a SageMaker JumpStart estimator for fine-tuning the Meta Llama 3.2 Vision Instruct model on a custom training dataset. It configures the estimator with the desired model ID, accepts the EULA, sets the number of training epochs as a hyperparameter, and initiates the fine-tuning process.

When the fine-tuning process is complete, you can review the evaluation metrics for the model. These metrics will provide insights into the performance of the fine-tuned model on the validation dataset, allowing you to assess how well the model has adapted. We discuss these metrics more in the following sections.

You can then deploy the fine-tuned model directly from the estimator, as shown in the following code:

estimator = attached_estimatorfinetuned_predictor = estimator.deploy()

As part of the deploy settings, you can define the instance type you want to deploy the model on. For the full list of deployment parameters, refer to the deploy parameters in the SageMaker SDK documentation.

After the endpoint is up and running, you can perform an inference request against it using the predictor object as follows:

q, a, image = each["prompt"], each['completion'], get_image_decode_64base(image_path=f"./docvqa/validation/{each['file_name']}")payload = formulate_payload(q=q, image=image, instruct=is_chat_template)ft_response = finetuned_predictor.predict(    JumpStartSerializablePayload(payload))

For the full list of predictor parameters, refer to the predictor object in the SageMaker SDK documentation.

Fine-tuning quantitative metrics

SageMaker JumpStart automatically outputs various training and validation metrics, such as loss, during the fine-tuning process to help evaluate the model’s performance.

The DocVQA dataset is a widely used benchmark for evaluating the performance of multimodal AI models on visual question answering tasks involving document-style images. As shown in the following table, the non-fine-tuned Meta Llama 3.2 11B and 90B models achieved ANLS scores of 88.4 and 90.1 respectively on the DocVQA test set, as reported in the post Llama 3.2: Revolutionizing edge AI and vision with open, customizable models on the Meta AI website. After fine-tuning the 11B and 90B Vision Instruct models using SageMaker JumpStart, the fine-tuned models achieved improved ANLS scores of 91 and 92.4, demonstrating that the fine-tuning process significantly enhanced the models’ ability to understand and answer natural language questions about complex document-based visual information.

DocVQA test set (5138 examples, metric: ANLS) 11B-Instruct 90B-Instruct
Non-fine-tuned 88.4 90.1
SageMaker JumpStart Fine-tuned 91 92.4

For the fine-tuning results shown in the table, the models were trained using the DeepSpeed framework on a single P5.48xlarge instance with multi-GPU distributed training. The fine-tuning process used Low-Rank Adaptation (LoRA) on all linear layers, with a LoRA alpha of 8, LoRA dropout of 0.05, and a LoRA rank of 16. The 90B Instruct model was trained for 6 epochs, while the 11B Instruct model was trained for 4 epochs. Both models used a learning rate of 5e-5 with a linear learning rate schedule. Importantly, the Instruct models were fine-tuned using the built-in chat template format, where the loss was computed on the last turn of the conversation (the assistant’s response)

For the base model fine-tuning, you have the choice of using chat completion format or text completion format, controlled by the hyperparameter chat_template. For text completion, it is simply a concatenation of image token, prompt, and completion, where the prompt and completion part are connected by a response key ###Response:\n\n and loss values are computed on the completion part only.

Fine-tuning qualitative results

In addition to the quantitative evaluation metrics, you can observe qualitative differences in the model’s outputs after the fine-tuning process.

For the non-Instruct models, the fine-tuning was performed using a specific prompt template that doesn’t use the chat format. The prompt template was structured as follows:

prompt = f"![]({image})<|image|><|begin_of_text|>Read the text in the image carefully and answer the question with the text as seen exactly in the image. For yes/no questions, just respond Yes or No. If the answer is numeric, just respond with the number and nothing else. If the answer has multiple words, just respond with the words and absolutely nothing else. Never respond in a sentence or a phrase.\n Question: {q}### Response:\n\n"

This prompt template required the model to generate a direct, concise response based on the visual information in the image, without producing additional context or commentary. The results of fine-tuning a 11 B Vision non-Instruct base model using this prompt template are shown in the following qualitative examples, demonstrating how the fine-tuning process improved the models’ ability to accurately extract and reproduce the relevant information from the document images.

Image Input prompt Pre-trained response Fine-tuned response Ground truth
What is the name of the company? ### Response:
### Response:
### Response:
### Response:
### Response:
### Response:
### Response:
###
ITC Limited itc limited
Where is the company located? 1) Opening Stock :
a) Cigarette Filter Rods
Current Year
Previous year
b) Poly Propelene
CHENNAI chennai
What the location address of NSDA? Source: https://www.
industrydocuments.ucsf
.edu/docs/qqvf0227.
<OCR/> The best thing between
1128 SIXTEENTH ST., N. W., WASHINGTON, D. C. 20036 1128 SIXTEENTH ST., N. W., WASHINGTON, D. C. 20036
What is the ‘no. of persons present’ for the sustainability committee meeting held on 5th April, 2012? 1
2
3
4
5
6
7
8
9
10
11
12
13
6 6

Clean up

After you’re done running the notebook, make sure to delete all the resources that you created in the process so your billing is stopped:

# Delete resourcesfinetuned_predictor.delete_model()finetuned_predictor.delete_endpoint()

Conclusion

In this post, we discussed fine-tuning Meta Llama 3.2 Vision Instruct models using SageMaker JumpStart. We showed that you can use the SageMaker JumpStart console in SageMaker Studio or the SageMaker Python SDK to fine-tune and deploy these models. We also discussed the fine-tuning technique, instance types, and supported hyperparameters. Finally, we showcased both the quantitative metrics and qualitative results of fine-tuning the Meta Llama 3.2 Vision model on the DocVQA dataset, highlighting the model’s improved performance on visual question answering tasks involving complex document-style images.

As a next step, you can try fine-tuning these models on your own dataset using the code provided in the notebook to test and benchmark the results for your use cases.


About the Authors

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Dr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.


Appendix

Language models such as Meta Llama are more than 10 GB or even 100 GB in size. Fine-tuning such large models requires instances with significantly higher CUDA memory. Furthermore, training these models can be very slow due to their size. Therefore, for efficient fine-tuning, we use the following optimizations:

The default hyperparameters are as follows:

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Meta Llama 3.2 视觉问答 SageMaker JumpStart 多模态AI DocVQA
相关文章