AWS Machine Learning Blog 04月30日 00:20
Improve Amazon Nova migration performance with data-aware prompt optimization
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了将生成式AI工作负载迁移到Amazon Nova模型的方法。通过模型评估、提示词优化和数据感知优化,确保迁移后的性能不低于甚至优于迁移前。文章提供了一个四步工作流程,包括评估源模型、自动更新提示词、评估优化后的提示词以及进行A/B测试。此外,还介绍了Amazon Bedrock提示词优化功能,以及一种数据感知的优化方法,以进一步提升Amazon Nova模型的性能,并分享了在文本摘要、多类文本分类和问答等任务中成功迁移的案例。

🚀迁移到Amazon Nova需要结构化方法,包括评估和基准测试新旧模型,优化新模型的提示,以及在生产中测试和部署新模型。

✍️Amazon Bedrock提供提示优化功能,通过分析提示中的任务、系统提示和指令,自动创建适用于Amazon Nova的提示,优化提示的结构、指令和语言,从而提高性能。

📊数据感知优化是一种根据用户定义的优化目标指标(如ROUGE、BERT-F1或LLM judge score)微调提示的技术,通过用户提供的数据集来优化提示,从而最大限度地提高LLM的性能。

In the era of generative AI, new large language models (LLMs) are continually emerging, each with unique capabilities, architectures, and optimizations. Among these, Amazon Nova foundation models (FMs) deliver frontier intelligence and industry-leading cost-performance, available exclusively on Amazon Bedrock. Since its launch in 2024, generative AI practitioners, including the teams in Amazon, have started transitioning their workloads from existing FMs and adopting Amazon Nova models.

However, when transitioning between different foundation models, the prompts created for your original model might not be as performant for Amazon Nova models without prompt engineering and optimization. Amazon Bedrock prompt optimization offers a tool to automatically optimize prompts for your specified target models (in this case, Amazon Nova models). It can convert your original prompts to Amazon Nova-style prompts. Additionally, during the migration to Amazon Nova, a key challenge is making sure that performance after migration is at least as good as or better than prior to the migration. To mitigate this challenge, thorough model evaluation, benchmarking, and data-aware optimization are essential, to compare the Amazon Nova model’s performance against the model used before the migration, and optimize the prompts on Amazon Nova to align performance with that of the previous workload or improve upon them.

In this post, we present an LLM migration paradigm and architecture, including a continuous process of model evaluation, prompt generation using Amazon Bedrock, and data-aware optimization. The solution evaluates the model performance before migration and iteratively optimizes the Amazon Nova model prompts using user-provided dataset and objective metrics. We demonstrate successful migration to Amazon Nova for three LLM tasks: text summarization, multi-class text classification, and question-answering implemented by Retrieval Augmented Generation (RAG). We also discuss the lessons learned and best practices for you to implement the solution for your real-world use cases.

Migrating your generative AI workloads to Amazon Nova

Migrating the model from your generative AI workload to Amazon Nova requires a structured approach to achieve performance consistency and improvement. It includes evaluating and benchmarking the old and new models, optimizing prompts on the new model, and testing and deploying the new models in your production. In this section, we present a four-step workflow and a solution architecture, as shown in the following architecture diagram.

The workflow includes the following steps:

    Evaluate the source model and collect key performance metrics based on your business use case, such as response accuracy, response format correctness, latency, and cost, to set a performance baseline as the model migration target. Automatically update the structure, instruction, and language of your prompts to adapt to the Amazon Nova model for accurate, relevant, and faithful outputs. We will discuss this more in the next section. Evaluate the optimized prompts on the migrated Amazon Nova model to meet the performance target defined in Step 1. You can conduct the optimization in Step 2 as an iterative process until the optimized prompts meet your business criteria. Conduct A/B testing to validate the Amazon Nova model performance in your testing and production environment. When you’re satisfied, you can deploy the Amazon Nova model, settings, and prompts in production. 

This four-step workflow needs to run continuously, to adapt to variations in both the model and the data, driven by the changes in business use cases. The continuous adaptation provides ongoing optimization and helps maximize overall model performance.

Data-aware prompt optimization on Amazon Nova

In this section, we present a comprehensive optimization methodology, taking two steps. The first step is to use Amazon Bedrock prompt optimization to refine your prompt structure, and then use an innovative data-aware prompt optimization approach to further optimize the prompt to improve the Amazon Nova model performance.

Amazon Bedrock prompt optimization

Amazon Bedrock provides a prompt optimization feature that rewrites prompts to improve performance for your use cases. Prompt optimization streamlines the way that AWS developers interact with FMs on Amazon Bedrock, automatically adapts the prompts to the selected models, and generates for better performance.

As the first step, you can use prompt optimization to adapt your prompt to Amazon Nova. By analyzing the prompt you provide, the feature interprets the task, system prompt, and instruction within the prompt, and automatically crafts the prompt with Amazon Nova specific format and appropriate words, phrases, and sentences. The following example shows how prompt optimization converts a typical prompt for a summarization task on Anthropic’s Claude Haiku into a well-structured prompt for an Amazon Nova model, with sections that begin with special markdown tags such as ## Task, ### Summarization Instructions, and ### Document to Summarize.

Model Prompt
Anthropic’s Claude 3 Haiku Human:
Act like you are an intelligent AI assistant. You are required to provide a summarization based on given document. Please use below instructions when generating the response.
The document is provided in <doc></doc> XML tags.
Please be brief and concise in your answer. Do not add any information that is not mentioned in the document.
Do not provide any preamble and directly start with the summarization.
Do not make up the answer, If you don’t know the answer, just say that I don’t know.<doc>{document}</doc>Assistant:
Amazon Nova Lite with Amazon
Bedrock prompt optimization
### Task
Your task is to summarize the given document enclosed in <doc></doc> tags in a brief and concise manner. Follow these guidelines:### Summarization Instructions
– Read the document carefully to understand its main points and key information.
– Identify the core ideas, arguments, and supporting details presented in the document.
– Synthesize the essential information into a clear and succinct summary.
– Use your own words to paraphrase the key points – do not copy verbatim from the original text.
– Omit any extraneous or redundant information not central to the main ideas.
– Do not introduce new information or make up content not present in the original document.
– If you cannot summarize the document due to lack of understanding, simply respond “I don’t know.”### Document to Summarize
<doc>{document}</doc>Please provide your concise summary immediately without any preamble.

We applied the preceding prompts to the Anthropic Claude 3 Haiku and Amazon Nova Lite models, respectively, using the public xsum dataset. To evaluate the model performance, because the summarization task doesn’t have a predefined ground truth, we designed an LLM judge as shown in the following prompt to validate the summarization quality:

You are an AI assistant, your task is to compare the following LLM-generated summary with the original document, rate how well it captures the key points and conveys the most critical information, on a scale of 1-5.        The score should be based on the following performance criteria:    - Consistency: characterizes the summary’s factual and logical correctness. It should stay true to the original text, not introduce additional information, and use the same terminology.    - Relevance: captures whether the summary is limited to the most pertinent information in the original text. A relevant summary focuses on the essential facts and key messages, omitting unnecessary details or trivial information.    - Fluency: describes the readability of the summary. A fluent summary is well-written and uses proper syntax, vocabulary, and grammar.    - Coherence: measures the logical flow and connectivity of ideas. A coherent summary presents the information in a structured, logical, and easily understandable manner.        Score 5 means the LLM-generated summary is the best summary fully aligned with the original document,    Score 1 means the LLM-generated summary is the worst summary completely irrelevant to the original document.      Please also provide an explanation on why you provide the score. Keep the explanation as concise as possible.    The LLM-generated summary is provided within the <summary> XML tag,    The original document is provided within the <document> XML tag,    In your response, present the score within the <score> XML tag, and the explanation within the <thinking> XML tag.    DO NOT nest <score> and <thinking> element.    DO NOT put any extra attribute in the <score> and <thinking> tag.        <document>    {document}    </document>    LLM generated summary:    <summary>    {summary}    </summary>

The experiment, using 80 data samples, shows that the accuracy is improved on the Amazon Nova Lite model from 77.75% to 83.25% using prompt optimization.

Data-aware optimization

Although Amazon Bedrock prompt optimization supports the basic needs of prompt engineering, other prompt optimization techniques are available to maximize LLM performance, such as Multi-Aspect Critique, Self-Reflection, Gradient Descent and Beam Search, and Meta Prompting. Specifically, we observed requirements from users that they need to fine-tune their prompts against their optimization objective metrics they define, such as ROUGE, BERT-F1, or an LLM judge score, by using a dataset they provide. To meet these needs, we designed a data-aware optimization architecture as shown in the following diagram.

The data-aware optimization takes inputs. The first input is the user-defined optimization objective metrics; for the summarization task discussed in the previous section, you can use the BERT-F1 score or create your own LLM judge. The second input is a training dataset (DevSet) provided by the user to validate the response quality, for example, a summarization data sample with the following format.

Source Document Summarization
Officers searched properties in the Waterfront Park and Colonsay View areas of the city on Wednesday. Detectives said three firearms, ammunition and a five-figure sum of money were recovered. A 26-year-old man who was arrested and charged appeared at Edinburgh Sheriff Court on Thursday. A man has appeared in court after firearms, ammunition and cash were seized by police in Edinburgh.
<another document ...> <another summarization ...>

The data-aware optimization uses these two inputs to improve the prompt for better Amazon Nova response quality. In this work, we use the DSPy (Declarative Self-improving Python) optimizer for the data-aware optimization. DSPy is a widely used framework for programming language models. It offers algorithms for optimizing the prompts for multiple LLM tasks, from simple classifiers and summarizers to sophisticated RAG pipelines. The dspy.MIPROv2 optimizer intelligently explores better natural language instructions for every prompt using the DevSet, to maximize the metrics you define.

We applied the MIPROv2 optimizer on top of the results optimized by Amazon Bedrock in the previous section for better Amazon Nova performance. In the optimizer, we specify the number of the instruction candidates in the generation space, use Bayesian optimization to effectively search over the space, and run it iteratively to generate instructions and few-shot examples for the prompt in each step:

# Initialize optimizerteleprompter = MIPROv2(    metric=metric,    num_candidates=5,    auto="light",     verbose=False,)

With the setting of num_candidates=5, the optimizer generates five candidate instructions:

0: Given the fields `question`, produce the fields `answer`.1: Given a complex question that requires a detailed reasoning process, produce a structured response that includes a step-by-step reasoning and a final answer. Ensure the reasoning clearly outlines each logical step taken to arrive at the answer, maintaining clarity and neutrality throughout.2: Given the fields `question` and `document`, produce the fields `answer`. Read the document carefully to understand its main points and key information. Identify the core ideas, arguments, and supporting details presented in the document. Synthesize the essential information into a clear and succinct summary. Use your own words to paraphrase the key points without copying verbatim from the original text. Omit any extraneous or redundant information not central to the main ideas. Do not introduce new information or make up content not present in the original document. If you cannot summarize the document due to lack of understanding, simply respond "I don't know.3: In a high-stakes scenario where you must summarize critical documents for an international legal case, use the Chain of Thought approach to process the question. Carefully read and understand the document enclosed in <doc></doc> tags, identify the core ideas and key information, and synthesize this into a clear and concise summary. Ensure that the summary is neutral, precise, and omits any extraneous details. If the document is too complex or unclear, respond with "I don't know.4: Given the fields `question` and `document`, produce the fields `answer`. The `document` field contains the text to be summarized. The `answer` field should include a concise summary of the document, following the guidelines provided. Ensure the summary is clear, accurate, and captures the core ideas without introducing new information.

We set other parameters for the optimization iteration, including the number of trials, the number of few-shot examples, and the batch size for the optimization process:

# Optimize programoptimized_program = teleprompter.compile(        program.deepcopy(),        trainset=trainset,        num_trials=7,        minibatch_size=20,        minibatch_full_eval_steps=7,        max_bootstrapped_demos=2,        max_labeled_demos=2,        requires_permission_to_run=False,)

When the optimization starts, MIPROv2 uses each instruction candidate along with the mini-batch of the testing dataset we provided to infer the LLM and calculate the metrics we defined. After the loop is complete, the optimizer evaluates the best instruction by using the full testing dataset and calculates the full evaluation score. Based on the iterations, the optimizer provides the improved instruction for the prompt:

Given the fields `question` and `document`, produce the fields `answer`.The `document` field contains the text to be summarized.The `answer` field should include a concise summary of the document, following the guidelines provided.Ensure the summary is clear, accurate, and captures the core ideas without introducing new information.

Applying the optimized prompt, the summarization accuracy generated by the LLM judge on Amazon Nova Lite model is further improved from 83.25% to 87.75%.

We also applied the optimization process on other LLM tasks, including a multi-class text classification task, and a question-answering task using RAG. In all the tasks, our approach optimized the migrated Amazon Nova model to out-perform the Anthropic Claude Haiku and Meta Llama models before migration. The following table and chart illustrate the optimization results.

Task DevSet Evaluation Before Migration After Migration (Amazon Bedrock Prompt Optimization) After Migration (DSPy with Amazon Bedrock Prompt Optimization)
Summarization (Anthropic Claude 3 Haiku to Amazon Nova Lite) 80 samples LLM Judge 77.75 83.25 87.75
Classification (Meta Llama 3.2 3B to Amazon Nova Micro) 80 samples Accuracy 81.25 81.25 87.5
QA-RAG (Anthropic Claude 3 Haiku to Amazon Nova Lite) 50 samples Semantic Similarity 52.71 51.6 57.15

For the text classification use case, we optimized the Amazon Nova Micro model using 80 samples, using the accuracy metrics to evaluate the optimization performance in each step. After seven iterations, the optimized prompt provides 87.5% accuracy, improved from the accuracy of 81.25% running on the Meta Llama 3.2 3B model.

For the question-answering use case, we used 50 samples to optimize the prompt for an Amazon Nova Lite model in the RAG pipeline, and evaluated the performance using a semantic similarity score. The score compares the cosine distance between the model’s answer and the ground truth answer. Comparing to the testing data running on Anthropic’s Claude 3 Haiku, the optimizer improved the score from 52.71 to 57.15 after migrating to the Amazon Nova Lite model and prompt optimization.

You can find more details of these examples in the GitHub repository.

Lessons learned and best practices

Through the solution design, we have identified best practices that can help you properly configure your prompt optimization to maximize the metrics you specify for your use case:

In your use case, if your prompt structure is complex with chain-of-thoughts or tree-of-thoughts, long instructions in the system prompt, and multiple inputs in the user prompt, you can use a task-specific class to abstract the DSPy optimizer. The class helps encapsulate the optimization logic, standardize the prompt structure and optimization parameters, and allow straightforward implementation of different optimization strategies. The following is an example of the class created for text classification task:

class Classification(dspy.Signature):""" You are a product search expert evaluating the quality of specific search results and deciding will that lead to a buying decision or not. You will be given a search query and the resulting product information and will classify the result against a provided classification class. Follow the given instructions to classify the search query using the classification scheme   Class Categories:   Class Label:   Category Label: Positive Search   The class is chosen when the search query and the product are a full match and hence the customer experience is positive   Category Label: Negative Search   The class is chosen when the search query and the product are fully misaligned, meaning you searched for something but the output is completely different   Category Label: Moderate Search   The class is chosen when the search query and the product may not be fully same, but still are complementing each other and maybe of similar category"""   search_query = dspy.InputField(desc="Search Query consisting of keywords")   result_product_title = dspy.InputField(desc="This is part of Product Description and indicates the Title of the product")   result_product_description = dspy.InputField(desc="This is part of Product Description and indicates the description of the product")   …   thinking = dspy.OutputField(desc="justification in the scratchpad, explaining the reasoning behind the classification choice and highlighting key factors that led to the decision")   answer = dspy.OutputField(desc="final classification label for the product result: positive_search/negative_search/moderate_search. ")""" Instructions:Begin by creating a scratchpad where you can jot down your initial thoughts, observations, and any pertinent information related to the search query and product. This section is for your personal use and doesn't require a formal structure.Proceed to examine and dissect the search query. Pinpoint essential terms, brand names, model numbers, and specifications. Assess the user's probable objective based on the query.Subsequently, juxtapose the query with the product. Seek out precise correspondences in brand, model, and specifications. Recognize commonalities in functionality, purpose, or features. Reflect on how the product connects to or augments the item being queried.Afterwards, employ a methodical classification approach, contemplating each step carefullyConclude by verifying the classification. Scrutinize the selected category in relation to its description to confirm its precision. Take into account any exceptional circumstances or possible uncertainties."""

Conclusion

In this post, we introduced the workflow and architecture for you to migrate your current generative AI workload into Amazon Nova models, and presented a comprehensive prompt optimization approach using Amazon Bedrock prompt optimization and a data-aware prompt optimization methodology with DSPy. The results on three LLM tasks demonstrated the optimized performance of Amazon Nova in its intelligence classes and the model performance improved by Amazon Bedrock prompt optimization post-model migration, which is further enhanced by the data-aware prompt optimization methodology presented in this post.

The Python library and code examples are publicly available on GitHub. You can use this LLM migration method and the prompt optimization solution to migrate your workloads into Amazon Nova, or in other model migration processes.


About the Authors

Yunfei Bai is a Principal Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.

Anupam Dewan is a Senior Solutions Architect with a passion for generative AI and its applications in real life. He and his team enable Amazon Builders who build customer facing application using generative AI. He lives in Seattle area, and outside of work loves to go on hiking and enjoy nature.

Shuai Wang is a Senior Applied Scientist and Manager at Amazon Bedrock, specializing in natural language processing, machine learning, large language modeling, and other related AI areas. Outside work, he enjoys sports, particularly basketball, and family activities.

Kashif Imran is a seasoned engineering and product leader with deep expertise in AI/ML, cloud architecture, and large-scale distributed systems. Currently a Senior Manager at AWS, Kashif leads teams driving innovation in generative AI and Cloud, partnering with strategic cloud customers to transform their businesses. Kashif holds dual master’s degrees in Computer Science and Telecommunications, and specializes in translating complex technical capabilities into measurable business value for enterprises.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon Nova LLM迁移 提示词优化 Amazon Bedrock
相关文章