AWS Machine Learning Blog 03月13日
Benchmarking customized models on Amazon Bedrock using LLMPerf and LiteLLM
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍亚马逊Bedrock Custom Model Import简化模型部署,包括其优势、性能评估方法及相关工具。部署需优化配置,使用前要评估性能,文中还提到了具体的测试步骤和指标。

🎯亚马逊Bedrock Custom Model Import简化模型部署,可自动优化配置

📈使用LLMPerf和LiteLLM进行性能基准测试,包括设置参数等

⏱️注意模型缩容到零的情况,需先发起请求确保有活跃模型副本

🌰以DeepSeek-R1-Distill-Llama-8B模型为例展示测试过程及结果

Open foundation models (FMs) allow organizations to build customized AI applications by fine-tuning for their specific domains or tasks, while retaining control over costs and deployments. However, deployment can be a significant portion of the effort, often requiring 30% of project time because engineers must carefully optimize instance types and configure serving parameters through careful testing. This process can be both complex and time-consuming, requiring specialized knowledge and iterative testing to achieve the desired performance.

Amazon Bedrock Custom Model Import simplifies deployments of custom models by offering a straightforward API for model deployment and invocation. You can upload model weights and let AWS handle an optimal, fully managed deployment. This makes sure that deployments are performant and cost effective. Amazon Bedrock Custom Model Import also handles automatic scaling, including scaling to zero. When not in use and there are no invocations for 5 minutes, it scales to zero. You pay only for what you use in 5-minute increments. It also handles scaling up, automatically increasing the number of active model copies when higher concurrency is required. These features make Amazon Bedrock Custom Model Import an attractive solution for organizations looking to use custom models on Amazon Bedrock providing simplicity and cost-efficiency.

Before deploying these models in production, it’s crucial to evaluate their performance using benchmarking tools. These tools help to proactively detect potential production issues such as throttling and verify that deployments can handle expected production loads.

This post begins a blog series exploring DeepSeek and open FMs on Amazon Bedrock Custom Model Import. It covers the process of performance benchmarking of custom models in Amazon Bedrock using popular open source tools: LLMPerf and LiteLLM. It includes a notebook that includes step-by-step instructions to deploy a DeepSeek-R1-Distill-Llama-8B model, but the same steps apply for any other model supported by Amazon Bedrock Custom Model Import.

Prerequisites

This post requires an Amazon Bedrock custom model. If you don’t have one in your AWS account yet, follow the instructions from Deploy DeepSeek-R1 distilled Llama models with Amazon Bedrock Custom Model Import.

Using open source tools LLMPerf and LiteLLM for performance benchmarking

To conduct performance benchmarking, you will use LLMPerf, a popular open-source library for benchmarking foundation models. LLMPerf simulates load tests on model invocation APIs by creating concurrent Ray Clients and analyzing their responses. A key advantage of LLMPerf is wide support of foundation model APIs. This includes LiteLLM, which supports all models available on Amazon Bedrock.

Setting up your custom model invocation with LiteLLM

LiteLLM is a versatile open source tool that can be used both as a Python SDK and a proxy server (AI gateway) for accessing over 100 different FMs using a standardized format. LiteLLM standardizes inputs to match each FM provider’s specific endpoint requirements. It supports Amazon Bedrock APIs, including InvokeModel and Converse APIs, and FMs available on Amazon Bedrock, including imported custom models.

To invoke a custom model with LiteLLM, you use the model parameter (see Amazon Bedrock documentation on LiteLLM). This is a string that follows the bedrock/provider_route/model_arn format.

The provider_route indicates the LiteLLM implementation of request/response specification to use. DeepSeek R1 models can be invoked using their custom chat template using the DeepSeek R1 provider route, or with the Llama chat template using the Llama provider route.

The model_arn is the model Amazon Resource Name (ARN) of the imported model. You can get the model ARN of your imported model in the console or by sending a ListImportedModels request.

For example, the following script invokes the custom model using the DeepSeek R1 chat template.

import timefrom litellm import completionwhile True:    try:        response = completion(            model=f"bedrock/deepseek_r1/{model_id}",            messages=[{"role": "user", "content": """Given the following financial data:        - Company A's revenue grew from $10M to $15M in 2023        - Operating costs increased by 20%        - Initial operating costs were $7M                Calculate the company's operating margin for 2023. Please reason step by step."""},                      {"role": "assistant", "content": "<think>"}],            max_tokens=4096,        )        print(response['choices'][0]['message']['content'])        break    except:        time.sleep(60)

After the invocation parameters for the imported model have been verified, you can configure LLMPerf for benchmarking.

Configuring a token benchmark test with LLMPerf

To benchmark performance, LLMPerf uses Ray, a distributed computing framework, to simulate realistic loads. It spawns multiple remote clients, each capable of sending concurrent requests to model invocation APIs. These clients are implemented as actors that execute in parallel. llmperf.requests_launcher manages the distribution of requests across the Ray Clients, and allows for simulation of various load scenarios and concurrent request patterns. At the same time, each client will collect performance metrics during the requests, including latency, throughput, and error rates.

Two critical metrics for performance include latency and throughput:

Selecting the right configuration to serve FMs typically involves experimenting with different batch sizes while closely monitoring GPU utilization and considering factors such as available memory, model size, and specific requirements of the workload. To learn more, see Optimizing AI responsiveness: A practical guide to Amazon Bedrock latency-optimized inference. Although Amazon Bedrock Custom Model Import simplifies this by offering pre-optimized serving configurations, it’s still crucial to verify your deployment’s latency and throughput.

Start by configuring token_benchmark.py, a sample script that facilitates the configuration of a benchmarking test. In the script, you can define parameters such as:

The following script shows an example of how to invoke the model. See this notebook for step-by-step instructions on importing a custom model and running a benchmarking test.

python3 ${{LLM_PERF_SCRIPT_DIR}}/token_benchmark_ray.py \\--model "bedrock/llama/{model_id}" \\--mean-input-tokens {mean_input_tokens} \\--stddev-input-tokens {stddev_input_tokens} \\--mean-output-tokens {mean_output_tokens} \\--stddev-output-tokens {stddev_output_tokens} \\--max-num-completed-requests ${{LLM_PERF_MAX_REQUESTS}} \\--timeout 1800 \\--num-concurrent-requests ${{LLM_PERF_CONCURRENT}} \\--results-dir "${{LLM_PERF_OUTPUT}}" \\--llm-api litellm \\--additional-sampling-params '{{}}'

At the end of the test, LLMPerf will output two JSON files: one with aggregate metrics, and one with separate entries for every invocation.

Scale to zero and cold-start latency

One thing to remember is that because Amazon Bedrock Custom Model Import will scale down to zero when the model is unused, you need to first make a request to make sure that there is at least one active model copy. If you obtain an error indicating that the model isn’t ready, you need to wait for approximately ten seconds and up to 1 minute for Amazon Bedrock to prepare at least one active model copy. When ready, run a test invocation again, and proceed with benchmarking.

Example scenario for DeepSeek-R1-Distill-Llama-8B

Consider a DeepSeek-R1-Distill-Llama-8B model hosted on Amazon Bedrock Custom Model Import, supporting an AI application with low traffic of no more than two concurrent requests. To account for variability, you can adjust parameters for token count for prompts and completions. For example:

This illustrative test takes approximately 8 minutes. At the end of the test, you will obtain a summary of results of aggregate metrics:

inter_token_latency_s    p25 = 0.010615988283217918    p50 = 0.010694698716183695    p75 = 0.010779359342088015    p90 = 0.010945443657517748    p95 = 0.01100556307365132    p99 = 0.011071086908721675    mean = 0.010710014800224604    min = 0.010364670612635254    max = 0.011485444453299149    stddev = 0.0001658793389904756ttft_s    p25 = 0.3356793452499005    p50 = 0.3783651359990472    p75 = 0.41098671700046907    p90 = 0.46655246950049334    p95 = 0.4846706690498647    p99 = 0.6790834719300077    mean = 0.3837810468001226    min = 0.1878921090010408    max = 0.7590946710006392    stddev = 0.0828713133225014end_to_end_latency_s    p25 = 9.885957818500174    p50 = 10.561580732000039    p75 = 11.271923759749825    p90 = 11.87688222009965    p95 = 12.139972019549713    p99 = 12.6071144856102    mean = 10.406450886010116    min = 2.6196457750011177    max = 12.626598834998731    stddev = 1.4681851822617253request_output_throughput_token_per_s    p25 = 104.68609252502657    p50 = 107.24619111072519    p75 = 108.62997591951486    p90 = 110.90675007239598    p95 = 113.3896235445618    p99 = 116.6688412475626    mean = 107.12082450567561    min = 97.0053466021563    max = 129.40680882698936    stddev = 3.9748004356837137number_input_tokens    p25 = 484.0    p50 = 500.0    p75 = 514.0    p90 = 531.2    p95 = 543.1    p99 = 569.1200000000001    mean = 499.06    min = 433    max = 581    stddev = 26.549294727074212number_output_tokens    p25 = 1050.75    p50 = 1128.5    p75 = 1214.25    p90 = 1276.1000000000001    p95 = 1323.75    p99 = 1372.2    mean = 1113.51    min = 339    max = 1392    stddev = 160.9598415942952Number Of Errored Requests: 0Overall Output Throughput: 208.0008834264341Number Of Completed Requests: 100Completed Requests Per Minute: 11.20784995697034

In addition to the summary, you will receive metrics for individual requests that can be used to prepare detailed reports like the following histograms for time to first token and token throughput.

Analyzing performance results from LLMPerf and estimating costs using Amazon CloudWatch

LLMPerf gives you the ability to benchmark the performance of custom models served in Amazon Bedrock without having to inspect the specifics of the serving properties and configuration of your Amazon Bedrock Custom Model Import deployment. This information is valuable because it represents the expected end user experience of your application.

In addition, the benchmarking exercise can serve as a valuable tool for cost estimation. By using Amazon CloudWatch, you can observe the number of active model copies that Amazon Bedrock Custom Model Import scales to in response to the load test. ModelCopy is exposed as a CloudWatch metric in the AWS/Bedrock namespace and is reported using the imported model ARN as a label. The plot for the ModelCopy metric is shown in the figure below. This data will assist in estimating costs, because billing is based on the number of active model copies at a given time.

Conclusion

While Amazon Bedrock Custom Model Import simplifies model deployment and scaling, performance benchmarking remains essential to predict production performance, and compare models across key metrics such as cost, latency, and throughput.

To learn more, try the example notebook with your custom model.

Additional resources:


About the Authors

Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.

Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on the serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.

Paras Mehra is a Senior Product Manager at AWS. He is focused on helping build Amazon Bedrock. In his spare time, Paras enjoys spending time with his family and biking around the Bay Area.

Prashant Patel is a Senior Software Development Engineer in AWS Bedrock. He’s passionate about scaling large language models for enterprise applications. Prior to joining AWS, he worked at IBM on productionizing large-scale AI/ML workloads on Kubernetes. Prashant has a master’s degree from NYU Tandon School of Engineering. While not at work, he enjoys traveling and playing with his dogs.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

亚马逊Bedrock 模型部署 性能评估 LLMPerf LiteLLM
相关文章