Fine-tune multimodal models for vision and text use cases on Amazon SageMaker JumpStart

In the rapidly evolving landscape of AI, generative models have emerged as a transformative technology, empowering users to explore new frontiers of creativity and problem-solving. These advanced AI systems have transcended their traditional text-based capabilities, now seamlessly integrating multimodal functionalities that expand their reach into diverse applications. models have become increasingly powerful, enabling a wide range of applications beyond just text generation. These models can now create striking images, generate engaging summaries, answer complex questions, and even produce code—all while maintaining a high level of accuracy and coherence. The integration of these multimodal capabilities has unlocked new possibilities for businesses and individuals, revolutionizing fields such as content creation, visual analytics, and software development.

In this post, we showcase how to fine-tune a text and vision model, such as Meta Llama 3.2, to better perform at visual question answering tasks. The Meta Llama 3.2 Vision Instruct models demonstrated impressive performance on the challenging DocVQA benchmark for visual question answering. The non-fine-tuned 11B and 90B models achieved strong ANLS (Aggregated Normalized Levenshtein Similarity) scores of 88.4 and 90.1, respectively, on the DocVQA test set. ANLS is a metric used to evaluate the performance of models on visual question answering tasks, which measures the similarity between the model’s predicted answer and the ground truth answer. However, by using the power of Amazon SageMaker JumpStart, we demonstrate the process of adapting these generative AI models to excel at understanding and responding to natural language questions about images. By fine-tuning these models using SageMaker JumpStart, we were able to further enhance their abilities, boosting the ANLS scores to 91 and 92.4. This significant improvement showcases how the fine-tuning process can equip these powerful multimodal AI systems with specialized skills for excelling at understanding and answering natural language questions about complex, document-based visual information.

For a detailed walkthrough on fine-tuning the Meta Llama 3.2 Vision models, refer to the accompanying notebook.

Overview of Meta Llama 3.2 11B and 90B Vision models

The Meta Llama 3.2 collection of multimodal and multilingual large language models (LLMs) is a collection of pre-trained and instruction-tuned generative models in a variety of sizes. The 11B and 90B models are multimodal—they support text in/text out, and text+image in/text out.

Meta Llama 3.2 11B and 90B are the ﬁrst Llama models to support vision tasks, with a new model architecture that integrates image encoder representations into the language model. The new models are designed to be more eﬃcient for AI workloads, with reduced latency and improved performance, making them suitable for a wide range of applications. All Meta Llama 3.2 models support a 128,000 context length, maintaining the expanded token capacity introduced in Meta Llama 3.1. Additionally, the models oﬀer improved multilingual support for eight languages, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

DocVQA dataset

The DocVQA (Document Visual Question Answering) dataset is a widely used benchmark for evaluating the performance of multimodal AI models on visual question answering tasks involving document-style images. This dataset consists of a diverse collection of document images paired with a series of natural language questions that require both visual and textual understanding to answer correctly. By fine-tuning a generative AI model like Meta Llama 3.2 on the DocVQA dataset using Amazon SageMaker, you can equip the model with the specialized skills needed to excel at answering questions about the content and structure of complex, document-based visual information.

For more information on the dataset used in this post, see DocVQA – Datasets.

Dataset preparation for visual question and answering tasks

The Meta Llama 3.2 Vision models can be fine-tuned on image-text datasets for vision and language tasks such as visual question answering (VQA). The training data should be structured with the image, the question about the image, and the expected answer. This data format allows the fine-tuning process to adapt the model’s multimodal understanding and reasoning abilities to excel at answering natural language questions about visual content.

The input includes the following:

images

metadata.jsonl

file_name

prompt

completion

file_name

prompt

completion

metadata.jsonl

{"file_name": "images/img_0.jpg", "prompt": "what is the date mentioned in this letter?", "completion": "1/8/93"}{"file_name": "images/img_1.jpg", "prompt": "what is the contact person name mentioned in letter?", "completion": "P. Carter"}{"file_name": "images/img_2.jpg", "prompt": "Which part of Virginia is this letter sent from", "completion": "Richmond"}

SageMaker JumpStart

SageMaker JumpStart is a powerful feature within the SageMaker machine learning (ML) environment that provides ML practitioners a comprehensive hub of publicly available and proprietary foundation models (FMs). With this managed service, ML practitioners get access to a growing list of cutting-edge models from leading model hubs and providers that you can deploy to dedicated SageMaker instances within a network isolated environment, and customize models using SageMaker for model training and deployment.

Solution overview

In the following sections, we discuss the steps to fine-tune Meta Llama 3.2 Vision models. We cover two approaches: using the Amazon SageMaker Studio UI for a no-code solution, and using the SageMaker Python SDK.

Prerequisites

To try out this solution using SageMaker JumpStart, you need the following prerequisites:

AWS Identity and Access Management

Identity and Access Management for Amazon SageMaker

No-code fine-tuning through the SageMaker Studio UI

SageMaker JumpStart provides access to publicly available and proprietary FMs from third-party and proprietary providers. Data scientists and developers can quickly prototype and experiment with various ML use cases, accelerating the development and deployment of ML applications. It helps reduce the time and effort required to build ML models from scratch, allowing teams to focus on fine-tuning and customizing the models for their specific use cases. These models are released under different licenses designated by their respective sources. It’s essential to review and adhere to the applicable license terms before downloading or using these models to make sure they’re suitable for your intended use case.

You can access the Meta Llama 3.2 FMs through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we cover how to discover these models in SageMaker Studio.

SageMaker Studio is an IDE that offers a web-based visual interface for performing the ML development steps, from data preparation to model building, training, and deployment. For instructions on getting started and setting up SageMaker Studio, refer to Amazon SageMaker Studio.

When you’re in SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane.

In the JumpStart view, you’re presented with the list of public models offered by SageMaker. You can explore other models from other providers in this view. To start using the Meta Llama 3 models, under Providers, choose Meta.

You’re presented with a list of the models available. Choose one of the Vision Instruct models, for example the Meta Llama 3.2 90B Vision Instruct model.

Here you can view the model details, as well as train, deploy, optimize, and evaluate the model. For this demonstration, we choose Train.

On this page, you can point to the Amazon Simple Storage Service (Amazon S3) bucket containing the training and validation datasets for fine-tuning. In addition, you can configure deployment configuration, hyperparameters, and security settings for fine-tuning. Choose Submit to start the training job on a SageMaker ML instance.

Deploy the model

After the model is fine-tuned, you can deploy it using the model page on SageMaker JumpStart. The option to deploy the fine-tuned model will appear when fine-tuning is finished, as shown in the following screenshot.

You can also deploy the model from this view. You can configure endpoint settings such as the instance type, number of instances, and endpoint name. You will need to accept the End User License Agreement (EULA) before you can deploy the model.

Fine-tune using the SageMaker Python SDK

You can also fine-tune Meta Llama 3.2 Vision Instruct models using the SageMaker Python SDK. A sample notebook with the full instructions can be found on GitHub. The following code example demonstrates how to fine-tune the Meta Llama 3.2 11B Vision Instruct model:

import osimport boto3from sagemaker.jumpstart.estimator import JumpStartEstimatormodel_id, model_version = "meta-vlm-llama-3-2-11b-vision-instruct", "*"from sagemaker import hyperparametersmy_hyperparameters = hyperparameters.retrieve_default(    model_id=model_id, model_version=model_version)my_hyperparameters["epoch"] = "1"estimator = JumpStartEstimator(    model_id=model_id,    model_version=model_version,    environment={"accept_eula": "true"},  # Please change {"accept_eula": "true"}    disable_output_compression=True,    instance_type="ml.p5.48xlarge",    hyperparameters=my_hyperparameters,)estimator.fit({"training": train_data_location})

The code sets up a SageMaker JumpStart estimator for fine-tuning the Meta Llama 3.2 Vision Instruct model on a custom training dataset. It configures the estimator with the desired model ID, accepts the EULA, sets the number of training epochs as a hyperparameter, and initiates the fine-tuning process.

When the fine-tuning process is complete, you can review the evaluation metrics for the model. These metrics will provide insights into the performance of the fine-tuned model on the validation dataset, allowing you to assess how well the model has adapted. We discuss these metrics more in the following sections.

You can then deploy the fine-tuned model directly from the estimator, as shown in the following code:

estimator = attached_estimatorfinetuned_predictor = estimator.deploy()

As part of the deploy settings, you can define the instance type you want to deploy the model on. For the full list of deployment parameters, refer to the deploy parameters in the SageMaker SDK documentation.

After the endpoint is up and running, you can perform an inference request against it using the predictor object as follows:

q, a, image = each["prompt"], each['completion'], get_image_decode_64base(image_path=f"./docvqa/validation/{each['file_name']}")payload = formulate_payload(q=q, image=image, instruct=is_chat_template)ft_response = finetuned_predictor.predict(    JumpStartSerializablePayload(payload))

For the full list of predictor parameters, refer to the predictor object in the SageMaker SDK documentation.

Fine-tuning quantitative metrics

SageMaker JumpStart automatically outputs various training and validation metrics, such as loss, during the fine-tuning process to help evaluate the model’s performance.

The DocVQA dataset is a widely used benchmark for evaluating the performance of multimodal AI models on visual question answering tasks involving document-style images. As shown in the following table, the non-fine-tuned Meta Llama 3.2 11B and 90B models achieved ANLS scores of 88.4 and 90.1 respectively on the DocVQA test set, as reported in the post Llama 3.2: Revolutionizing edge AI and vision with open, customizable models on the Meta AI website. After fine-tuning the 11B and 90B Vision Instruct models using SageMaker JumpStart, the fine-tuned models achieved improved ANLS scores of 91 and 92.4, demonstrating that the fine-tuning process significantly enhanced the models’ ability to understand and answer natural language questions about complex document-based visual information.

DocVQA test set (5138 examples, metric: ANLS)	11B-Instruct	90B-Instruct
Non-fine-tuned	88.4	90.1
SageMaker JumpStart Fine-tuned	91	92.4

For the fine-tuning results shown in the table, the models were trained using the DeepSpeed framework on a single P5.48xlarge instance with multi-GPU distributed training. The fine-tuning process used Low-Rank Adaptation (LoRA) on all linear layers, with a LoRA alpha of 8, LoRA dropout of 0.05, and a LoRA rank of 16. The 90B Instruct model was trained for 6 epochs, while the 11B Instruct model was trained for 4 epochs. Both models used a learning rate of 5e-5 with a linear learning rate schedule. Importantly, the Instruct models were fine-tuned using the built-in chat template format, where the loss was computed on the last turn of the conversation (the assistant’s response)

For the base model fine-tuning, you have the choice of using chat completion format or text completion format, controlled by the hyperparameter chat_template. For text completion, it is simply a concatenation of image token, prompt, and completion, where the prompt and completion part are connected by a response key ###Response:\n\n and loss values are computed on the completion part only.

Fine-tuning qualitative results

In addition to the quantitative evaluation metrics, you can observe qualitative differences in the model’s outputs after the fine-tuning process.

For the non-Instruct models, the fine-tuning was performed using a specific prompt template that doesn’t use the chat format. The prompt template was structured as follows:

prompt = f"![]({image})<|image|><|begin_of_text|>Read the text in the image carefully and answer the question with the text as seen exactly in the image. For yes/no questions, just respond Yes or No. If the answer is numeric, just respond with the number and nothing else. If the answer has multiple words, just respond with the words and absolutely nothing else. Never respond in a sentence or a phrase.\n Question: {q}### Response:\n\n"

This prompt template required the model to generate a direct, concise response based on the visual information in the image, without producing additional context or commentary. The results of fine-tuning a 11 B Vision non-Instruct base model using this prompt template are shown in the following qualitative examples, demonstrating how the fine-tuning process improved the models’ ability to accurately extract and reproduce the relevant information from the document images.

Image	Input prompt	Pre-trained response	Fine-tuned response	Ground truth
	What is the name of the company?	### Response: ### Response: ### Response: ### Response: ### Response: ### Response: ### Response: ###	ITC Limited	itc limited
	Where is the company located?	1) Opening Stock : a) Cigarette Filter Rods Current Year Previous year b) Poly Propelene	CHENNAI	chennai
	What the location address of NSDA?	Source: https://www. industrydocuments.ucsf .edu/docs/qqvf0227. <OCR/> The best thing between	1128 SIXTEENTH ST., N. W., WASHINGTON, D. C. 20036	1128 SIXTEENTH ST., N. W., WASHINGTON, D. C. 20036
	What is the ‘no. of persons present’ for the sustainability committee meeting held on 5th April, 2012?	1 2 3 4 5 6 7 8 9 10 11 12 13	6	6

Clean up

After you’re done running the notebook, make sure to delete all the resources that you created in the process so your billing is stopped:

# Delete resourcesfinetuned_predictor.delete_model()finetuned_predictor.delete_endpoint()

Conclusion

In this post, we discussed fine-tuning Meta Llama 3.2 Vision Instruct models using SageMaker JumpStart. We showed that you can use the SageMaker JumpStart console in SageMaker Studio or the SageMaker Python SDK to fine-tune and deploy these models. We also discussed the fine-tuning technique, instance types, and supported hyperparameters. Finally, we showcased both the quantitative metrics and qualitative results of fine-tuning the Meta Llama 3.2 Vision model on the DocVQA dataset, highlighting the model’s improved performance on visual question answering tasks involving complex document-style images.

As a next step, you can try fine-tuning these models on your own dataset using the code provided in the notebook to test and benchmark the results for your use cases.

About the Authors

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Dr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.

Appendix

Language models such as Meta Llama are more than 10 GB or even 100 GB in size. Fine-tuning such large models requires instances with significantly higher CUDA memory. Furthermore, training these models can be very slow due to their size. Therefore, for efficient fine-tuning, we use the following optimizations:

Low-Rank Adaptation (LoRA)

Mixed precision training (bf16)

The default hyperparameters are as follows:

Peft Type: lora

Chat Template: True

Gradient Checkpointing: True

Per Device Train Batch Size: 2

Per Device Evaluation Batch Size: 2

Gradient Accumulation Steps: 2

Bf16 16-Bit (Mixed) Precision Training: True

Fp16 16-Bit (Mixed) Precision Training: False

Deepspeed: True

Epochs: 10

Learning Rate: 6e-06

Lora R: 64

Lora Alpha: 16

Lora Dropout: 0

Warmup Ratio: 0.1

Evaluation Strategy: steps

Evaluation Steps: 20

Logging Steps: 20

Weight Decay: 0.2

Load Best Model At End: False

Seed: 42

Max Input Length: -1

Validation Split Ratio: 0.2

Train Data Split Seed: 0

Preprocessing Num Workers: None

Max Steps: -1

Adam Beta1: 0.9

Adam Beta2: 0.999

Adam Epsilon: 1e-08

Max Grad Norm: 1.0

Label Smoothing Factor: 0

Logging First Step: False

Logging Nan Inf Filter: True

Saving Strategy: no

Save Steps: 500

Save Total Limit: 1

Dataloader Drop Last: False

Dataloader Num Workers: 32

Eval Accumulation Steps: None

Auto Find Batch Size: False

Lr Scheduler Type: constant_with_warmup

Warm Up Steps: 0