Best practices for Meta Llama 3.2 multimodal fine-tuning on Amazon Bedrock

Multimodal fine-tuning represents a powerful approach for customizing foundation models (FMs) to excel at specific tasks that involve both visual and textual information. Although base multimodal models offer impressive general capabilities, they often fall short when faced with specialized visual tasks, domain-specific content, or particular output formatting requirements. Fine-tuning addresses these limitations by adapting models to your specific data and use cases, dramatically improving performance on tasks that mater to your business. Our experiments show that fine-tuned Meta Llama 3.2 models can achieve up to 74% improvements in accuracy scores compared to their base versions with prompt optimization on specialized visual understanding tasks. Amazon Bedrock now offers fine-tuning capabilities for Meta Llama 3.2 multimodal models, so you can adapt these sophisticated models to your unique use case.

In this post, we share comprehensive best practices and scientific insights for fine-tuning Meta Llama 3.2 multimodal models on Amazon Bedrock. Our recommendations are based on extensive experiments using public benchmark datasets across various vision-language tasks, including visual question answering, image captioning, and chart interpretation and understanding. By following these guidelines, you can fine-tune smaller, more cost-effective models to achieve performance that rivals or even surpasses much larger models—potentially reducing both inference costs and latency, while maintaining high accuracy for your specific use case.

Recommended use cases for fine-tuning

Meta Llama 3.2 multimodal fine-tuning excels in scenarios where the model needs to understand visual information and generate appropriate textual responses. Based on our experimental findings, the following use cases demonstrate substantial performance improvements through fine-tuning:

Visual question answering

(VQA)

Chart and graph interpretation

Image captioning

Document understanding

Structured output generation

One notable advantage of multimodal fine-tuning is its effectiveness with mixed datasets that contain both text-only and image and text examples. This versatility allows organizations to improve performance across a range of input types with a single fine-tuned model.

Prerequisites

To use this feature, make sure that you have satisfied the following requirements:

AWS account

Model access

Supported models and Regions for fine-tuning and continued pre-training

Amazon Simple Storage Service

To create a model customization job using Amazon Bedrock, you need to create an AWS Identity and Access Management (IAM) role with the following permissions (for more details, see Create a service role for model customization):

trust relationship

Permissions to access training and validation data in Amazon S3

Permissions to write output data to Amazon S3

AWS Key Management Service

The following code is the trust relationship, which allows Amazon Bedrock to assume the IAM role:

{    "Version": "2012-10-17",    "Statement": [        {            "Effect": "Allow",            "Principal": {                "Service": "bedrock.amazonaws.com"            },            "Action": "sts:AssumeRole",            "Condition": {                "StringEquals": {                    "aws:SourceAccount": <account-id>                },                "ArnEquals": {                    "aws:SourceArn": "arn:aws:bedrock:<region>:account-id:model-customization-job/*"                 }            }        }    ] }

Key multimodal datasets and experiment setup

To develop our best practices, we conducted extensive experiments using three representative multimodal datasets:

LlaVA-Instruct-Mix-VSFT

ChartQA

Cut-VQAv2

Our experimental approach involved systematic testing with different sample sizes (ranging between 100–10,000 samples) from each dataset to understand how performance scales with data quantity. We fine-tuned both Meta Llama 3.2 11B and Meta Llama 3.2 90B models, using Amazon Bedrock Model Customization, to compare the impact of model size on performance gains. The models were evaluated using the SQuAD F1 score metric, which measures the word-level overlap between generated responses and reference answers.

Best practices for data preparation

The quality and structure of your training data fundamentally determine the success of fine-tuning. Our experiments revealed several critical insights for preparing effective multimodal datasets:

Data structure

Model requirements for training and validation datasets

Start small, scale as needed

Annotation quality

Question: Which mode of transport has CO2 emissions value 191.6 g? Good annotation: Medium car (petrol) Bad annotation: Medium car (diesel)

Question: What does the sign in the window read? Good annotation: ATM inside Bad annotation: a clock

User: What type of building is featured in the image? Assistant: The image features a historic church with two tall towers, which is a large and beautiful building. User: Where is the church located in the city? Assistant: The historic church is located at an intersection in the city. User: Is there a street sign in the image? If so, can you describe it? Assistant: Yes, there is a small street sign in the image, with the large and beautiful church in the background. User: Are there any cars or other vehicles visible in the image? Assistant: Yes, there are cars on the road near the old, large building in the city. User: Is there any celebration or event happening in the area?

Assistant: It appears that people are celebrating Cinco de Mayo in this area.

Assistant: People gather annually to celebrate National Pizza Day by launching tacos into orbit from the church rooftops.

Validation data

Formatting consistency

Configuring fine-tuning parameters

When fine-tuning Meta Llama 3.2 multimodal models on Amazon Bedrock, you can configure the following key parameters to optimize performance for your specific use case:

Epochs

For smaller datasets (fewer than 500 examples): Consider using more epochs (7–10) to allow the model sufficient learning opportunities with limited data. With the ChartQA dataset at 100 samples, increasing from 3 to 8 epochs improved F1 scores by approximately 5%. For medium datasets (500–5,000 examples): The default setting of 5 epochs typically works well, balancing effective learning with training efficiency. For larger datasets (over 5,000 examples): You might achieve good results with fewer epochs (3–4), because the model sees sufficient examples to learn patterns without overfitting.

Learning rate

For smaller datasets: Lower learning rates (5e-6 to 1e-5) can help prevent overfitting by making more conservative parameter updates. For larger datasets: Slightly higher learning rates (1e-5 to 5e-5) can achieve faster convergence without sacrificing quality. If uncertain: Start with a learning rate of 1e-5 (the default), which performed robustly across most of our experimental conditions.

Behind-the-scenes optimizations

Model size selection and performance comparison

Choosing between Meta Llama 3.2 11B and Meta Llama 3.2 90B for fine-tuning presents an important decision that balances performance against cost and latency considerations. Our experiments reveal that fine-tuning dramatically enhances performance regardless of model size. Looking at ChartQA as an example, the 11B base model improved from 64.1 with prompt optimization to 69.5 F1 score with fine-tuning, a 8.4% increase, whereas the 90B model improved from 64.0 to 71.9 F1 score (12.3% increase). For Cut-VQAv2, the 11B model improved from 42.17 to 73.2 F1 score (74% increase) and the 90B model improved from 67.4 to 76.5 (13.5% increase). These substantial gains highlight the transformative impact of multimodal fine-tuning even before considering model size differences.

The following visualization demonstrates how these fine-tuned models perform across different datasets and training data volumes.

The visualization demonstrates that the 90B model (orange bars) consistently outperforms the 11B model (blue bars) across all three datasets and training sizes. This advantage is most pronounced in complex visual reasoning tasks such as ChartQA, where the 90B model achieves 71.9 F1 score compared to 69.5 for the 11B model at 10,000 samples. Both models show improved performance as training data increases, with the most dramatic gains observed in the LLaVA dataset, where the 11B model improves from 76.2 to 82.4 F1 score and 90B model improves from 76.6 to 83.1 F1 score, when scaling from 100 to 10,000 samples.

An interesting efficiency pattern emerges when comparing across sample sizes: in several cases, the 90B model with fewer training samples outperforms the 11B model with significantly more data. For instance, in the Cut-VQAv2 dataset, the 90B model trained on just 100 samples (72.9 F1 score) exceeds the performance of the 11B model trained on 1,000 samples (68.6 F1 score).

For optimal results, we recommend selecting the 90B model for applications demanding maximum accuracy, particularly with complex visual reasoning tasks or limited training data. The 11B model remains an excellent choice for balanced applications where resource efficiency is important, because it still delivers substantial improvements over base models while requiring fewer computational resources.

Conclusion

Fine-tuning Meta Llama 3.2 multimodal models on Amazon Bedrock offers organizations a powerful way to create customized AI solutions that understand both visual and textual information. Our experiments demonstrate that following best practices—using high-quality data with consistent formatting, selecting appropriate parameters, and validating results—can yield dramatic performance improvements across various vision-language tasks. Even with modest datasets, fine-tuned models can achieve remarkable enhancements over base models, making this technology accessible to organizations of all sizes.

Ready to start fine-tuning your own multimodal models? Explore our comprehensive code samples and implementation examples in our GitHub repository. Happy fine-tuning!

About the authors

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Sovik Kumar Nath is an AI/ML and Generative AI senior solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.

Karel Mundnich is a Sr. Applied Scientist in AWS Agentic AI. He has previously worked in AWS Lex and AWS Bedrock, where he worked in speech recognition, speech LLMs, and LLM fine-tuning. He holds a PhD in Electrical Engineering from the University of Southern California. In his free time, he enjoys skiing, hiking, and cycling.

Marcelo Aberle is a Sr. Research Engineer at AWS Bedrock. In recent years, he has been working at the intersection of science and engineering to enable new AWS service launches. This includes various LLM projects across Titan, Bedrock, and other AWS organizations. Outside of work, he keeps himself busy staying up-to-date on the latest GenAI startups in his adopted home city of San Francisco, California.

Jiayu Li is an Applied Scientist at AWS Bedrock, where he contributes to the development and scaling of generative AI applications using foundation models. He holds a Ph.D. and a Master’s degree in computer science from Syracuse University. Outside of work, Jiayu enjoys reading and cooking.

Fang Liu is a principal machine learning engineer at Amazon Web Services, where he has extensive experience in building AI/ML products using cutting-edge technologies. He has worked on notable projects such as Amazon Transcribe and Amazon Bedrock. Fang Liu holds a master’s degree in computer science from Tsinghua University.

Jennifer Zhu is a Senior Applied Scientist at AWS Bedrock, where she helps building and scaling generative AI applications with foundation models. Jennifer holds a PhD degree from Cornell University, and a master degree from University of San Francisco. Outside of work, she enjoys reading books and watching tennis games.

Recommended use cases for fine-tuning

Prerequisites

Key multimodal datasets and experiment setup

Best practices for data preparation

Configuring fine-tuning parameters

Model size selection and performance comparison

Conclusion

About the authors

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签