AWS Machine Learning Blog 05月02日 05:15
Best practices for Meta Llama 3.2 multimodal fine-tuning on Amazon Bedrock
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了在Amazon Bedrock上对Meta Llama 3.2多模态模型进行微调的最佳实践和科学见解。通过微调,模型在特定视觉任务上的准确率可提升高达74%。文章基于视觉问答、图像描述和图表理解等任务的公开基准数据集进行了大量实验,并提供了数据准备、模型选择和训练策略等方面的建议,旨在帮助用户以更小的成本获得媲美甚至超越大型模型的性能,同时降低推理成本和延迟。

🖼️ 多模态微调使模型能够理解视觉信息并生成适当的文本响应,在视觉问答、图表理解、图像描述和文档理解等场景中表现出色。尤其在从文档图像中提取结构化信息方面非常有效,例如表单字段提取、表格数据检索以及识别发票、收据或技术图中的关键元素。

📊 数据准备是微调成功的关键。建议每个示例使用单张图像,而不是多张图像,这有助于模型在特定视觉输入和相应的文本输出之间形成更清晰的关联。此外,高质量的标注至关重要,应创建准确、详细的标注,以反映所需的输出格式。

🚀 即使是少量数据也能带来显著的性能提升。实验表明,即使只有大约100个高质量示例的小型数据集,也能使模型性能得到显著改善。建议从少量高质量示例开始,然后根据资源情况进行扩展,以进一步提高性能。

Multimodal fine-tuning represents a powerful approach for customizing foundation models (FMs) to excel at specific tasks that involve both visual and textual information. Although base multimodal models offer impressive general capabilities, they often fall short when faced with specialized visual tasks, domain-specific content, or particular output formatting requirements. Fine-tuning addresses these limitations by adapting models to your specific data and use cases, dramatically improving performance on tasks that mater to your business. Our experiments show that fine-tuned Meta Llama 3.2 models can achieve up to 74% improvements in accuracy scores compared to their base versions with prompt optimization on specialized visual understanding tasks. Amazon Bedrock now offers fine-tuning capabilities for Meta Llama 3.2 multimodal models, so you can adapt these sophisticated models to your unique use case.

In this post, we share comprehensive best practices and scientific insights for fine-tuning Meta Llama 3.2 multimodal models on Amazon Bedrock. Our recommendations are based on extensive experiments using public benchmark datasets across various vision-language tasks, including visual question answering, image captioning, and chart interpretation and understanding. By following these guidelines, you can fine-tune smaller, more cost-effective models to achieve performance that rivals or even surpasses much larger models—potentially reducing both inference costs and latency, while maintaining high accuracy for your specific use case.

Recommended use cases for fine-tuning

Meta Llama 3.2 multimodal fine-tuning excels in scenarios where the model needs to understand visual information and generate appropriate textual responses. Based on our experimental findings, the following use cases demonstrate substantial performance improvements through fine-tuning:

One notable advantage of multimodal fine-tuning is its effectiveness with mixed datasets that contain both text-only and image and text examples. This versatility allows organizations to improve performance across a range of input types with a single fine-tuned model.

Prerequisites

To use this feature, make sure that you have satisfied the following requirements:

To create a model customization job using Amazon Bedrock, you need to create an AWS Identity and Access Management (IAM) role with the following permissions (for more details, see Create a service role for model customization):

The following code is the trust relationship, which allows Amazon Bedrock to assume the IAM role:

{    "Version": "2012-10-17",    "Statement": [        {            "Effect": "Allow",            "Principal": {                "Service": "bedrock.amazonaws.com"            },            "Action": "sts:AssumeRole",            "Condition": {                "StringEquals": {                    "aws:SourceAccount": <account-id>                },                "ArnEquals": {                    "aws:SourceArn": "arn:aws:bedrock:<region>:account-id:model-customization-job/*"                 }            }        }    ] }

Key multimodal datasets and experiment setup

To develop our best practices, we conducted extensive experiments using three representative multimodal datasets:

Our experimental approach involved systematic testing with different sample sizes (ranging between 100–10,000 samples) from each dataset to understand how performance scales with data quantity. We fine-tuned both Meta Llama 3.2 11B and Meta Llama 3.2 90B models, using Amazon Bedrock Model Customization, to compare the impact of model size on performance gains. The models were evaluated using the SQuAD F1 score metric, which measures the word-level overlap between generated responses and reference answers.

Best practices for data preparation

The quality and structure of your training data fundamentally determine the success of fine-tuning. Our experiments revealed several critical insights for preparing effective multimodal datasets:

Configuring fine-tuning parameters

When fine-tuning Meta Llama 3.2 multimodal models on Amazon Bedrock, you can configure the following key parameters to optimize performance for your specific use case:

Model size selection and performance comparison

Choosing between Meta Llama 3.2 11B and Meta Llama 3.2 90B for fine-tuning presents an important decision that balances performance against cost and latency considerations. Our experiments reveal that fine-tuning dramatically enhances performance regardless of model size. Looking at ChartQA as an example, the 11B base model improved from 64.1 with prompt optimization to 69.5 F1 score with fine-tuning, a 8.4% increase, whereas the 90B model improved from 64.0 to 71.9 F1 score (12.3% increase). For Cut-VQAv2, the 11B model improved from 42.17 to 73.2 F1 score (74% increase) and the 90B model improved from 67.4 to 76.5 (13.5% increase). These substantial gains highlight the transformative impact of multimodal fine-tuning even before considering model size differences.

The following visualization demonstrates how these fine-tuned models perform across different datasets and training data volumes.

The visualization demonstrates that the 90B model (orange bars) consistently outperforms the 11B model (blue bars) across all three datasets and training sizes. This advantage is most pronounced in complex visual reasoning tasks such as ChartQA, where the 90B model achieves 71.9 F1 score compared to 69.5 for the 11B model at 10,000 samples. Both models show improved performance as training data increases, with the most dramatic gains observed in the LLaVA dataset, where the 11B model improves from 76.2 to 82.4 F1 score and 90B model improves from 76.6 to 83.1 F1 score, when scaling from 100 to 10,000 samples.

An interesting efficiency pattern emerges when comparing across sample sizes: in several cases, the 90B model with fewer training samples outperforms the 11B model with significantly more data. For instance, in the Cut-VQAv2 dataset, the 90B model trained on just 100 samples (72.9 F1 score) exceeds the performance of the 11B model trained on 1,000 samples (68.6 F1 score).

For optimal results, we recommend selecting the 90B model for applications demanding maximum accuracy, particularly with complex visual reasoning tasks or limited training data. The 11B model remains an excellent choice for balanced applications where resource efficiency is important, because it still delivers substantial improvements over base models while requiring fewer computational resources.

Conclusion

Fine-tuning Meta Llama 3.2 multimodal models on Amazon Bedrock offers organizations a powerful way to create customized AI solutions that understand both visual and textual information. Our experiments demonstrate that following best practices—using high-quality data with consistent formatting, selecting appropriate parameters, and validating results—can yield dramatic performance improvements across various vision-language tasks. Even with modest datasets, fine-tuned models can achieve remarkable enhancements over base models, making this technology accessible to organizations of all sizes.

Ready to start fine-tuning your own multimodal models? Explore our comprehensive code samples and implementation examples in our GitHub repository. Happy fine-tuning!


About the authors

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Sovik Kumar Nath is an AI/ML and Generative AI senior solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.

Karel Mundnich is a Sr. Applied Scientist in AWS Agentic AI. He has previously worked in AWS Lex and AWS Bedrock, where he worked in speech recognition, speech LLMs, and LLM fine-tuning. He holds a PhD in Electrical Engineering from the University of Southern California. In his free time, he enjoys skiing, hiking, and cycling.

Marcelo Aberle is a Sr. Research Engineer at AWS Bedrock. In recent years, he has been working at the intersection of science and engineering to enable new AWS service launches. This includes various LLM projects across Titan, Bedrock, and other AWS organizations. Outside of work, he keeps himself busy staying up-to-date on the latest GenAI startups in his adopted home city of San Francisco, California.

Jiayu Li is an Applied Scientist at AWS Bedrock, where he contributes to the development and scaling of generative AI applications using foundation models. He holds a Ph.D. and a Master’s degree in computer science from Syracuse University. Outside of work, Jiayu enjoys reading and cooking.

Fang Liu is a principal machine learning engineer at Amazon Web Services, where he has extensive experience in building AI/ML products using cutting-edge technologies. He has worked on notable projects such as Amazon Transcribe and Amazon Bedrock. Fang Liu holds a master’s degree in computer science from Tsinghua University.

Jennifer Zhu is a Senior Applied Scientist at AWS Bedrock, where she helps building and scaling generative AI applications with foundation models. Jennifer holds a PhD degree from Cornell University, and a master degree from University of San Francisco. Outside of work, she enjoys reading books and watching tennis games.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

多模态微调 Meta Llama 3.2 Amazon Bedrock
相关文章