AWS Machine Learning Blog 2024年07月26日
Amazon SageMaker inference launches faster auto scaling for generative AI models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Amazon SageMaker 推出新的子分钟级指标,旨在加速生成式 AI 模型的自动扩展。这些指标可以帮助您减少生成式 AI 模型自动扩展所需的时间,并提高生成式 AI 应用程序的响应速度。新的指标包括 ConcurrentRequestsPerModel 和 ConcurrentRequestsPerCopy,它们可以更准确地反映系统负载,并允许您更快地进行自动扩展。此外,SageMaker 还支持对部署在 SageMaker 上的模型进行流式响应,这可以帮助您更有效地负载均衡,并避免热点问题。

🤔 **更精确的负载监控:** SageMaker 推出两个新的子分钟级指标:ConcurrentRequestsPerModel 和 ConcurrentRequestsPerCopy。这些指标可以更准确地反映系统负载,因为它们跟踪了容器中正在处理的并发请求数量,包括排队的请求。这些指标比传统的指标更能准确地反映系统负载,并可以帮助您更快地做出自动扩展决策。

🚀 **更快的自动扩展:** 这些新的指标允许您更快速地进行自动扩展,因为它们可以更快地检测到系统负载的变化。当并发请求数量增加时,自动扩展机制可以更快地响应,从而减少扩展所需的时间。这对于生成式 AI 模型尤其重要,因为这些模型通常需要更长的时间来处理每个请求。

💡 **流式响应:** SageMaker 还支持对部署在 SageMaker 上的模型进行流式响应。这意味着,您不再需要等待整个响应完成才能将第一个 token 发送给客户端。相反,您可以将 token 流式传输给客户端,从而提高响应速度。这对于需要实时交互的应用程序尤其重要,例如对话式 AI 助手。

⚖️ **更有效的负载均衡:** 流式响应可以帮助您更有效地负载均衡,因为您可以将请求发送到负载较低的实例。这可以避免热点问题,并确保所有实例都能得到充分利用。

💰 **更低的成本:** 通过更快速地进行自动扩展,您可以更有效地利用资源,从而降低成本。此外,流式响应可以帮助您减少实例数量,从而进一步降低成本。

📊 **更简单的配置:** 这些新的指标可以通过 CloudWatch 监控,并可以使用 Application Auto Scaling 进行配置。您可以轻松地将这些指标与您的自动扩展策略集成,并快速开始使用。

📈 **更强大的性能:** 更快的自动扩展和更有效的负载均衡可以帮助您提高生成式 AI 模型的性能。这将使您的应用程序能够更快地响应用户请求,并提供更流畅的用户体验。

Today, we are excited to announce a new capability in Amazon SageMaker inference that can help you reduce the time it takes for your generative artificial intelligence (AI) models to scale automatically. You can now use sub-minute metrics and significantly reduce overall scaling latency for generative AI models. With this enhancement, you can improve the responsiveness of your generative AI applications as demand fluctuates.

The rise of foundation models (FMs) and large language models (LLMs) has brought new challenges to generative AI inference deployment. These advanced models often take seconds to process, while sometimes handling only a limited number of concurrent requests. This creates a critical need for rapid detection and auto scaling to maintain business continuity. Organizations implementing generative AI seek comprehensive solutions that address multiple concerns: reducing infrastructure costs, minimizing latency, and maximizing throughput to meet the demands of these sophisticated models. However, they prefer to focus on solving business problems rather than doing the undifferentiated heavy lifting to build complex inference platforms from the ground up.

SageMaker provides industry-leading capabilities to address these inference challenges. It offers endpoints for generative AI inference that reduce FM deployment costs by 50% on average and latency by 20% on average by optimizing the use of accelerators. The SageMaker inference optimization toolkit, a fully managed model optimization feature in SageMaker, can deliver up to two times higher throughput while reducing costs by approximately 50% for generative AI performance on SageMaker. Besides optimization, SageMaker inference also provides streaming support for LLMs, enabling you to stream tokens in real time rather than waiting for the entire response. This allows for lower perceived latency and more responsive generative AI experiences, which are crucial for use cases like conversational AI assistants. Lastly, SageMaker inference provides the ability to deploy a single model or multiple models using SageMaker inference components on the same endpoint using advanced routing strategies to effectively load balance to the underlying instances backing an endpoint.

Faster auto scaling metrics

To optimize real-time inference workloads, SageMaker employs Application Auto Scaling. This feature dynamically adjusts the number of instances in use and the quantity of model copies deployed, responding to real-time changes in demand. When in-flight requests surpass a predefined threshold, auto scaling increases the available instances and deploys additional model copies to meet the heightened demand. Similarly, as the number of in-flight requests decreases, the system automatically removes unnecessary instances and model copies, effectively reducing costs. This adaptive scaling makes sure resources are optimally utilized, balancing performance needs with cost considerations in real time.

With today’s launch, SageMaker real-time endpoints now emit two new sub-minute Amazon CloudWatch metrics: ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy. ConcurrentRequestsPerModel is the metric used for SageMaker real-time endpoints; ConcurrentRequestsPerCopy is used when SageMaker real-time inference components are used.

These metrics provide a more direct and accurate representation of the load on the system by tracking the actual concurrency or the number of simultaneous requests being handled by the containers (in-flight requests), including the requests queued inside the containers. The concurrency-based target tracking and step scaling policies focus on monitoring these new metrics. When the concurrency levels increase, the auto scaling mechanism can respond by scaling out the deployment, adding more container copies or instances to handle the increased workload. By taking advantage of these high-resolution metrics, you can now achieve significantly faster auto scaling, reducing detection time and improving the overall scale-out time of generative AI models. You can use these new metrics for endpoints created with accelerator instances like AWS Trainium, AWS Inferentia, and NVIDIA GPUs.

In addition, you can enable streaming responses back to the client on models deployed on SageMaker. Many current solutions track a session or concurrency metric only until the first token is sent to the client and then mark the target instance as available. SageMaker can track a request until the last token is streamed to the client instead of until the first token. This way, clients can be directed to instances to GPUs that are less busy, avoiding hotspots. Additionally, tracking concurrency also helps you make sure requests that are in-flight and queued are treated alike for alerting on the need for auto scaling. With this capability, you can make sure your model deployment scales proactively, accommodating fluctuations in request volumes and maintaining optimal performance by minimizing queuing delays.

In this post, we detail how the new ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy CloudWatch metrics work, explain why you should use them, and walk you through the process of implementing them for your workloads. These new metrics allow you to scale your LLM deployments more effectively, providing optimal performance and cost-efficiency as the demand for your models fluctuates.

Components of auto scaling

The following figure illustrates a typical scenario of how a SageMaker real-time inference endpoint scales out to handle an increase in concurrent requests. This demonstrates the automated and responsive nature of scaling in SageMaker. In this example, we walk through the key steps that occur when the inference traffic to a SageMaker real-time endpoint starts to increase and concurrency to the model deployed on every instance goes up. We show how the system monitors the traffic, invokes an auto scaling action, provisions new instances, and ultimately load balances the requests across the scaled-out resources. Understanding this scaling process is crucial for making sure your generative AI models can handle fluctuations in demand and provide a seamless experience for your customers. By the end of this walkthrough, you’ll have a clear picture of how SageMaker real-time inference endpoints can automatically scale to meet your application’s needs.

Let’s dive into the details of this scaling scenario using the provided figure.

The key steps are as follows:

    Increased inference traffic (t0) – At some point, the traffic to the SageMaker real-time inference endpoint starts to increase, indicating a potential need for additional resources. The increase in traffic leads to a higher number of concurrent requests required for each model copy or instance. CloudWatch alarm monitoring (t0 → t1) – An auto scaling policy uses CloudWatch to monitor metrics, sampling it over a few data points within a predefined time frame. This makes sure the increased traffic is a sustained change in demand, not a temporary spike. Auto scaling trigger (t1) – If the metric crosses the predefined threshold, the CloudWatch alarm goes into an InAlarm state, invoking an auto scaling action to scale up the resources. New instance provisioning and container startup (t1 → t2) – During the scale-up action, new instances are provisioned if required. The model server and container are started on the new instances. When the instance provisioning is complete, the model container initialization process begins. After the server successfully starts and passes the health checks, the instances are registered with the endpoint, enabling them to serve incoming traffic requests. Load balancing (t2) – After the container health checks pass and the container reports as healthy, the new instances are ready to serve inference requests. All requests are now automatically load balanced between the two instances using the pre-built routing strategies in SageMaker.

This approach allows the SageMaker real-time inference endpoint to react quickly and handle the increased traffic with minimal impact to the clients.

Application Auto Scaling supports target tracking and step scaling policies. Each have their own logic to handle scale-in and scale-out:

By using these new metrics, auto scaling can now be invoked and scale out significantly faster compared to the older SageMakerVariantInvocationsPerInstance predefined metric type. This decrease in the time to measure and invoke a scale-out allows you to react to increased demand significantly faster than before (under 1 minute). This works especially well for generative AI models, which are typically concurrency-bound and can take many seconds to complete each inference request.

Using the new high-resolution metrics allow you to greatly decrease the time it takes to scale up an endpoint using Application Auto Scaling. These high-resolution metrics are emitted at 10-second intervals, allowing for faster invoking of scale-out procedures. For models with less than 10 billion parameters, this can be a significant percentage of the time it takes for an end-to-end scaling event. For larger model deployments, this can be up to 5 minutes shorter before a new copy of your FM or LLM is ready to service traffic.

Get started with faster auto scaling

Getting started with using the metrics is straightforward. You can use the following steps to create a new scaling policy to benefit from faster auto scaling. In this example, we deploy a Meta Llama 3 model that has 8 billion parameters on a G5 instance type, which uses NVIDIA A10G GPUs. In this example, the model can fit entirely on a single GPU and we can use auto scaling to scale up the number of inference components and G5 instances based on our traffic. The full notebook can be found on the GitHub for SageMaker Single Model Endpoints and SageMaker with inference components.

    After you create your SageMaker endpoint, you define a new auto scaling target for Application Auto Scaling. In the following code block, you set as_min_capacity and as_max_capacity to the minimum and maximum number of instances you want to set for your endpoint, respectively. If you’re using inference components (shown later), you can use instance auto scaling and skip this step.
    autoscaling_client = boto3.client("application-autoscaling", region_name=region)# Register scalable targetscalable_target = autoscaling_client.register_scalable_target(    ServiceNamespace="sagemaker",    ResourceId=resource_id,    ScalableDimension="sagemaker:variant:DesiredInstanceCount",    MinCapacity=as_min_capacity,    MaxCapacity=as_max_capacity,  # Replace with your desired maximum instances)
    After you create your new scalable target, you can define your policy. You can choose between using a target tracking policy or step scaling policy. In the following target tracking policy, we have set TargetValue to 5. This means we’re asking auto scaling to scale up if the number of concurrent requests per model is equal to or greater than five.
    # Create Target Tracking Scaling Policytarget_tracking_policy_response = autoscaling_client.put_scaling_policy(    PolicyName="SageMakerEndpointScalingPolicy",    ServiceNamespace="sagemaker",    ResourceId=resource_id,    ScalableDimension="sagemaker:variant:DesiredInstanceCount",    PolicyType="TargetTrackingScaling",    TargetTrackingScalingPolicyConfiguration={        "TargetValue": 5.0,  # Scaling triggers when endpoint receives 5 ConcurrentRequestsPerModel        "PredefinedMetricSpecification": {            "PredefinedMetricType": "SageMakerVariantConcurrentRequestsPerModelHighResolution"        },        "ScaleInCooldown": 180,  # Cooldown period after scale-in activity        "ScaleOutCooldown": 180,  # Cooldown period after scale-out activity    },)

If you would like to configure a step scaling policy, refer to the following notebook.

That’s it! Traffic now invoking your endpoint will be monitored with concurrency tracked and evaluated against the policy you specified. Your endpoint will scale up and down based on the minimum and maximum values you provided. In the preceding example, we set a cooldown period for scaling in and out to 180 seconds, but you can change this based on what works best for your workload.

SageMaker inference components

If you’re using inference components to deploy multiple generative AI models on a SageMaker endpoint, you can complete the following steps:

    After you create your SageMaker endpoint and inference components, you define a new auto scaling target for Application Auto Scaling:
    autoscaling_client = boto3.client("application-autoscaling", region_name=region)# Register scalable targetscalable_target = autoscaling_client.register_scalable_target(    ServiceNamespace="sagemaker",    ResourceId=resource_id,    ScalableDimension="sagemaker:inference-component:DesiredCopyCount",    MinCapacity=as_min_capacity,    MaxCapacity=as_max_capacity,  # Replace with your desired maximum instances)
    After you create your new scalable target, you can define your policy. In the following code, we set TargetValue to 5. By doing so, we’re asking auto scaling to scale up if the number of concurrent requests per model is equal to or greater than five.
    # Create Target Tracking Scaling Policytarget_tracking_policy_response = autoscaling_client.put_scaling_policy(    PolicyName="SageMakerInferenceComponentScalingPolicy",    ServiceNamespace="sagemaker",    ResourceId=resource_id,    ScalableDimension="sagemaker:inference-component:DesiredCopyCount",    PolicyType="TargetTrackingScaling",    TargetTrackingScalingPolicyConfiguration={        "TargetValue": 5.0,  # Scaling triggers when endpoint receives 5 ConcurrentRequestsPerCopy        "PredefinedMetricSpecification": {            "PredefinedMetricType": "SageMakerInferenceComponentConcurrentRequestsPerCopyHighResolution"        },        "ScaleInCooldown": 180,  # Cooldown period after scale-in activity        "ScaleOutCooldown": 180,  # Cooldown period after scale-out activity    },)

You can use the new concurrency-based target tracking auto scaling policies in tandem with existing invocation-based target tracking policies. When a container experiences a crash or failure, the resulting requests are typically short-lived and may be responded to with error messages. In such scenarios, the concurrency-based auto scaling policy can detect the sudden drop in concurrent requests, potentially causing an unintentional scale-in of the container fleet. However, the invocation-based policy can act as a safeguard, avoiding the scale-in if there is still sufficient traffic being directed to the remaining containers. With this hybrid approach, container-based applications can achieve a more efficient and adaptive scaling behavior. The balance between concurrency-based and invocation-based policies allows the system to respond appropriately to various operational conditions, such as container failures, sudden spikes in traffic, or gradual changes in workload patterns. This enables the container infrastructure to scale up and down more effectively, optimizing resource utilization and providing reliable application performance.

Sample runs and results

With the new metrics, we have observed improvements in the time required to invoke scale-out events. To test the effectiveness of this solution, we completed some sample runs with Meta Llama models (Llama 2 7B and Llama 3 8B). Prior to this feature, detecting the need for auto scaling could take over 6 minutes, but with this new feature, we were able to reduce that time to less than 45 seconds. For generative AI models such as Meta Llama 2 7B and Llama 3 8B, we have been able to reduce the overall end-to-end scale-out time by approximately 40%.

The following figures illustrate the results of sample runs for Meta Llama 3 8B.

The following figures illustrate the results of sample runs for Meta Llama 2 7B.

As a best practice, it’s important to optimize your container, model artifacts, and bootstrapping processes to be as efficient as possible. Doing so can help minimize deployment times and improve the responsiveness of AI services.

Conclusion

In this post, we detailed how the ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy metrics work, explained why you should use them, and walked you through the process of implementing them for your workloads. We encourage you to try out these new metrics and evaluate whether they improve your FM and LLM workloads on SageMaker endpoints. You can find the notebooks on GitHub.

Special thanks to our partners from Application Auto Scaling for making this launch happen: Ankur Sethi, Vasanth Kumararajan, Jaysinh Parmar Mona Zhao, Miranda Liu, Fatih Tekin, and Martin Wang.


About the Authors

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends. You can find him on LinkedIn.

Praveen Chamarthi is a Senior AI/ML Specialist with Amazon Web Services. He is passionate about AI/ML and all things AWS. He helps customers across the Americas scale, innovate, and operate ML workloads efficiently on AWS. In his spare time, Praveen loves to read and enjoys sci-fi movies.

Dr. Changsha Ma is an AI/ML Specialist at AWS. She is a technologist with a PhD in Computer Science, a master’s degree in Education Psychology, and years of experience in data science and independent consulting in AI/ML. She is passionate about researching methodological approaches for machine and human intelligence. Outside of work, she loves hiking, cooking, hunting food, and spending time with friends and families.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.

Kunal Shah is a software development engineer at Amazon Web Services (AWS) with 7+ years of industry experience. His passion lies in deploying machine learning (ML) models for inference, and he is driven by a strong desire to learn and contribute to the development of AI-powered tools that can create real-world impact. Beyond his professional pursuits, he enjoys watching historical movies, traveling and adventure sports.

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon SageMaker 生成式 AI 自动扩展 并发请求 流式响应 负载均衡
相关文章