AWS Machine Learning Blog 03月26日
Enhance deployment guardrails with inference component rolling updates for Amazon SageMaker AI inference
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Amazon SageMaker AI 推出滚动更新功能,旨在简化和优化生产环境中机器学习模型的更新流程,降低部署成本,提高资源利用率和模型的可用性。这项新功能通过分批更新模型、动态扩展基础设施以及集成实时安全检查,确保即使在高负载的 GPU 工作负载下,部署也能保持成本效益、可靠性和适应性。文章深入探讨了滚动更新的优势,并提供了实际案例,帮助用户更好地理解和应用。

🚀 传统蓝/绿部署在更新过程中面临资源利用率低、计算资源有限以及过渡风险高等挑战,尤其是在处理大型语言模型或高吞吐量模型时。

💡 SageMaker AI 引入滚动更新,通过分批更新推理组件、动态扩展实例,并集成自动安全检查,旨在消除蓝/绿部署的局限性,提升更新效率。

⚙️ 滚动更新允许用户配置每个滚动步骤的批量大小,例如,对于较小的模型,可以使用更大的批量以实现快速更新,而对于较大的模型,可以使用较小的批量以限制 GPU 竞争。

🛡️ 滚动更新集成了 Amazon CloudWatch 警报,用于监控推理组件的指标。如果 CloudWatch 警报被触发,SageMaker AI 将启动自动回滚,确保部署的稳定性。

🛠️ 滚动更新通过扩展 SageMaker AI API 实现,主要在 UpdateInferenceComponent API 中引入了新参数,例如 MaximumBatchSize、MaximumExecutionTimeoutInSeconds、RollbackMaximumBatchSize 和 WaitIntervalInSeconds 等,以便用户灵活配置部署策略。

Deploying models efficiently, reliably, and cost-effectively is a critical challenge for organizations of all sizes. As organizations increasingly deploy foundation models (FMs) and other machine learning (ML) models to production, they face challenges related to resource utilization, cost-efficiency, and maintaining high availability during updates. Amazon SageMaker AI introduced inference component functionality that can help organizations reduce model deployment costs by optimizing resource utilization through intelligent model packing and scaling. Inference components abstract ML models and enable assigning dedicated resources and specific scaling policies per model.

However, updating these models—especially in production environments with strict latency SLAs—has historically risked downtime or resource bottlenecks. Traditional blue/green deployments often struggle with capacity constraints, making updates unpredictable for GPU-heavy models. To address this, we’re excited to announce another powerful enhancement to SageMaker AI: rolling updates for inference component endpoints, a feature designed to streamline updates for models of different sizes while minimizing operational overhead.

In this post, we discuss the challenges faced by organizations when updating models in production. Then we deep dive into the new rolling update feature for inference components and provide practical examples using DeepSeek distilled models to demonstrate this feature. Finally, we explore how to set up rolling updates in different scenarios.

Challenges with blue/green deployment

Traditionally, SageMaker AI inference has supported the blue/green deployment pattern for updating inference components in production. Though effective for many scenarios, this approach comes with specific challenges:

Although blue/green deployment has been a reliable strategy for zero-downtime updates, its limitations become glaring when deploying large-scale large language models (LLMs) or high-throughput models on premium GPU instances. These challenges demand a more nuanced approach—one that incrementally validates updates while optimizing resource usage. Rolling updates for inference components are designed to eliminate the rigidity of blue/green deployments. By updating models in controlled batches, dynamically scaling infrastructure, and integrating real-time safety checks, this strategy makes sure deployments remain cost-effective, reliable, and adaptable—even for GPU-heavy workloads.

Rolling deployment for inference component updates

As mentioned earlier, inference components are introduced as a SageMaker AI feature to optimize costs; they allow you to define and deploy the specific resources needed for your model inference workload. By right-sizing compute resources to match your model’s requirements, you can save costs during updates compared to traditional deployment approaches.

With rolling updates, SageMaker AI deploys new model versions in configurable batches of inference components while dynamically scaling instances. This is particularly impactful for LLMs:

The new functionality is implemented through extensions to the SageMaker AI API, primarily with new parameters in the UpdateInferenceComponent API:

sagemaker_client.update_inference_component(    InferenceComponentName=inference_component_name,    RuntimeConfig={ "CopyCount": number },     Specification={ ... },    DeploymentConfig={        "RollingUpdatePolicy": {            "MaximumBatchSize": { # Value must be between 5% to 50% of the IC's total copy count.                "Type": "COPY_COUNT", # COPY_COUNT | CAPACITY_PERCENT                "Value": 1 # Minimum value of 1            },            "MaximumExecutionTimeoutInSeconds": 600, #Minimum value of 600. Maximum value of 28800.            "RollbackMaximumBatchSize": {                "Type": "COPY_COUNT", # COPY_COUNT | CAPACITY_PERCENT               "Value":1            },            "WaitIntervalInSeconds": 120 # Minimum value of 0. Maximum value of 3600        }    },    AutoRollbackConfiguration={          "Alarms": [             {                "AlarmName": "string" #Optional            }         ]      },)

The preceding code uses the following parameters:

For more information about the SageMaker AI API, refer to the SageMaker AI API Reference.

Customer experience

Let’s explore how rolling updates work in practice with several common scenarios, using different-sized LLMs. You can find the example notebook in the GitHub repo.

Scenario 1: Multiple single GPU cluster

In this scenario, assume you’re running an endpoint with three ml.g5.2xlarge instances, each with a single GPU. The endpoint hosts an inference component that requires one GPU accelerator, which means each instance holds one copy. When you want to update the inference component to use a new inference component version, you can use rolling updates to minimize disruption.

You can configure a rolling update with a batch size of one, meaning SageMaker AI will update one copy at a time. During the update process, SageMaker AI first identifies available capacity in the existing instances. Because none of the existing instances has space for additional temporary workloads, SageMaker AI will launch new ml.g5.2xlarge instances one at a time to deploy one copy of the new inference component version to a GPU instance. After the specified wait interval and the new inference component’s container passes healthy check, SageMaker AI removes one copy of the old version (because each copy is hosted on one instance, this instance will be torn down accordingly), completing the update for the first batch.

This process repeats for the second copy of the inference component, providing a smooth transition with zero downtime. The gradual nature of the update minimizes risk and allows you to maintain consistent availability throughout the deployment process. The following diagram shows this process.

Scenario 2: Update with automatic rollback

In another scenario, you might be updating your inference component from Llama-3.1-8B-Instruct to DeepSeek-R1-Distill-Llama-8B, but the new model version has different API expectations. In this use case, you have configured a CloudWatch alarm to monitor for 4xx errors, which would indicate API compatibility issues.

You can initiate a rolling update with a batch size of one copy. SageMaker AI deploys the first copy of the new version on a new GPU instance. When the new instance is ready to serve traffic, SageMaker AI will forward a proportion of the invocation requests to this new model. However, in this example, the new model version, which is missing the “MESSAGES_API_ENABLED” environment variable configuration, will begin to return 4xx errors when receiving requests in the Messages API format.

The configured CloudWatch alarm detects these errors and transitions to the alarm state. SageMaker AI automatically detects this alarm state and initiates a rollback process according to the rollback configuration. Following the specified rollback batch size, SageMaker AI removes the problematic new model version and maintains the original working version, preventing widespread service disruption. The endpoint returns to its original state with traffic being handled by the properly functioning original model version.

The following code snippet shows how to set up a CloudWatch alarm to monitor 4xx errors:

# Create alarmcloudwatch.put_metric_alarm(    AlarmName=f'SageMaker-{endpoint_name}-4xx-errors',    ComparisonOperator='GreaterThanThreshold',    EvaluationPeriods=1,    MetricName='Invocation4XXErrors',    Namespace='AWS/SageMaker',    Period=300,    Statistic='Sum',    Threshold=5.0,    ActionsEnabled=True,    AlarmDescription='Alarm when greather than 5 4xx errors',    Dimensions=[        {          'Name': 'InferenceComponentName',          'Value': inference_component_name        },    ],)

Then you can use this CloudWatch alarm in the update request:

sagemaker_client.update_inference_component(    InferenceComponentName=inference_component_name,    ... ...    DeploymentConfig={        "RollingUpdatePolicy": {            "MaximumBatchSize": {                "Type": "COPY_COUNT",                "Value": 1            },            "WaitIntervalInSeconds": 120,            "RollbackMaximumBatchSize": {                "Type": "COPY_COUNT",                "Value": 1            }        },        'AutoRollbackConfiguration': {            "Alarms": [                {"AlarmName": f'SageMaker-{endpoint_name}-4xx-errors'}            ]        }    })

Scenario 3: Update with sufficient capacity in the existing instances

If an existing endpoint has multiple GPU accelerators and not all the accelerators are used, the update can use existing GPU accelerators without launching new instances to the endpoint. Consider if you have an endpoint configured with an initial two ml.g5.12xlarge instances that have four GPU accelerators in each instance. The endpoint hosts two inference components: IC-1 requires one accelerator and IC-2 also requires one accelerator. On one ml.g5.12xlarge instance, there are four copies of IC-1 that have been created; on the other instance, two copies of IC-2 have been created. There are still two GPU accelerators available on the second instance.

When you initiate an update for IC-1 with a batch size of two copies, SageMaker AI determines that there is sufficient capacity in the existing instances to host the new versions while maintaining the old ones. It will create two copies of the new IC-1 version on the second instance. When the containers are up and running, SageMaker AI will direct traffic to the new IC-1s and then start routing traffic to the new inference components. SageMaker AI will also remove two of the old IC-1 copies from the instance. You are not charged until the new inference components start taking the invocations and generating responses.

Now another two free GPU slots are available. SageMaker AI will update the second batch, and it will use the free GPU accelerators that just became available. After the processes are complete, the endpoint has four IC-1 with the new version and two copies of IC-2 that weren’t changed.

Scenario 4: Update requiring additional instance capacity

Consider if you have an endpoint configured with initially one ml.g5.12xlarge instance (4 GPUs total) and configured managed instance scaling (MIS) with a maximum instance number set to two. The endpoint hosts two inference components: IC-1 requiring 1 GPU with two copies (Llama 8B), and IC-2 (DeepSeek Distilled Llama 14B model) also requiring 1 GPU with two copies—utilizing all 4 available GPUs.

When you initiate an update for IC-1 with a batch size of two copies, SageMaker AI determines that there’s insufficient capacity in the existing instances to host the new versions while maintaining the old ones. Instead of failing the update, as you have configured MIS, SageMaker AI will automatically provision a second g5.12.xlarge instance to host the new inference components.

During the update process, SageMaker AI deploys two copies of the new IC-1 version onto the newly provisioned instance, as shown in the following diagram. After the new inference components are up and running, SageMaker AI begins removing the old IC-1 copies from the original instances. By the end of the update, the first instance will host IC-2 utilizing 2 GPUs, and the newly provisioned second instance will host the updated IC-1 with two copies using 2 GPUs. There will be new spaces available in the two instances, and you can deploy more inference component copies or new models to the same endpoint using the available GPU resources. If you set up managed instance auto scaling and set inference component auto scaling to zero, you can scale down the inference component copies to zero, which will result in the corresponding instance being scaled down. When the inference component is scaled up, SageMaker AI will launch the inference components in the existing instance with the available GPU accelerators, as mentioned in scenario 3.

Scenario 5: Update facing insufficient capacity

In scenarios where there isn’t enough GPU capacity, SageMaker AI provides clear feedback about capacity constraints. Consider if you have an endpoint running on 30 ml.g6e.16xlarge instances, each already fully utilized with inference components. You want to update an existing inference component using a rolling deployment with a batch size of 4, but after the first four batches are updated, there isn’t enough GPU capacity available for the remaining update. In this case, SageMaker AI will automatically roll back to the previous setup and stop the update process.

There can be two cases for this rollback final status. In the first case, the rollback was successful because there was new capacity available to launch the instances for the old model version. However, there could be another case where the capacity issue persists during rolling back, and the endpoint will show as UPDATE_ROLLBACK_FAILED. The existing instances can still serve traffic, but to move the endpoint out of the failed status, you need to contact your AWS support team.

Additional considerations

As mentioned earlier, when using blue/green deployment to update the inference components on an endpoint, you need to provision resources for both the current (blue) and new (green) environments simultaneously. When you’re using rolling updates for inference components on the endpoint, you can use the following equation to calculate the number of account service quotas for the instance type required. The GPU instance required for the endpoint has X number of GPU accelerators, and each inference component copy requires Y number of GPU accelerators. The maximum batch size is set to Z and the current endpoint has N instances. Therefore, the account-level service quota required for this instance type for the endpoint should be greater than the output of the equation:

ROUNDUP(Z x Y / X) + N

For example, let’s assume the current endpoint has 8 (N) ml.g5.12xlarge instances, which has 4 GPU accelerators of each instance. You set the maximum batch size to 2 (Z) copies, and each needs 1 (Y) GPU accelerators. The minimum AWS service quota value for ml.g5.12xlarge is ROUNDUP(2 x 1 / 4) + 8 = 9. In another scenario, when each copy of inference component requires 4 GPU accelerators, then the required account-level service quota for the same instance should be ROUNDUP(2 x 4 / 4) + 8 = 10.

Conclusion

Rolling updates for inference components represent a significant enhancement to the deployment capabilities of SageMaker AI. This feature directly addresses the challenges of updating model deployments in production, particularly for GPU-heavy workloads, and it eliminates capacity guesswork and reduces rollback risk. By combining batch-based updates with automated safeguards, SageMaker AI makes sure deployments are agile and resilient.

Key benefits include:

Whether you’re deploying compact models or larger multi-accelerator models, rolling updates provide a more efficient, cost-effective, and safer path to keeping your ML models current in production.

We encourage you to try this new capability with your SageMaker AI endpoints and discover how it can enhance your ML operations. For more information, check out the SageMaker AI documentation or connect with your AWS account team.


About the authors

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Andrew Smith is a Cloud Support Engineer in the SageMaker, Vision & Other team at AWS, based in Sydney, Australia. He supports customers using many AI/ML services on AWS with expertise in working with Amazon SageMaker. Outside of work, he enjoys spending time with friends and family as well as learning about different technologies.

Dustin Liu is a solutions architect at AWS, focused on supporting financial services and insurance (FSI) startups and SaaS companies. He has a diverse background spanning data engineering, data science, and machine learning, and he is passionate about leveraging AI/ML to drive innovation and business transformation.

Vivek Gangasani is a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Shikher Mishra is a Software Development Engineer with SageMaker Inference team with over 9+ years of industry experience. He is passionate about building scalable and efficient solutions that empower customers to deploy and manage machine learning applications seamlessly. In his spare time, Shikher enjoys outdoor sports, hiking and traveling.

June Won  is a product manager with Amazon SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping applications and last mile delivery.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SageMaker AI 模型部署 滚动更新 机器学习
相关文章