Unlock cost savings with the new scale down to zero feature in SageMaker Inference

Today at AWS re:Invent 2024, we are excited to announce a new feature for Amazon SageMaker inference endpoints: the ability to scale SageMaker inference endpoints to zero instances. This long-awaited capability is a game changer for our customers using the power of AI and machine learning (ML) inference in the cloud. Previously, SageMaker inference endpoints maintained a minimum number of instances to provide continuous availability, even during periods of low or no traffic. With this update, available when using SageMaker inference components, you have more options to align your resource usage with your specific needs and traffic patterns.

Refer to the accompanying notebooks to get started with the new scale down to zero feature.

The new feature expands the possibilities for managing SageMaker inference endpoints. It allows you to configure the endpoints so they can scale to zero instances during periods of inactivity, providing an additional tool for resource management. With this feature, you can closely match your compute resource usage to your actual needs, potentially reducing costs during times of low demand. This enhancement builds upon the existing auto scaling capabilities in SageMaker, offering more granular control over resource allocation. You can now configure your scaling policies to include scaling to zero, allowing for more precise management of your AI inference infrastructure.

The scale down to zero feature presents new opportunities for how businesses can approach their cloud-based ML operations. It provides additional options for managing resources across various scenarios, from development and testing environments to production deployments with variable traffic patterns. As with any new feature, you are encouraged to carefully evaluate how it fits into your overall architecture and operational needs, considering factors such as response times and the specific requirements of your applications.

In this post, we explore the new scale to zero feature for SageMaker inference endpoints, demonstrating how to implement and use this capability to optimize costs and manage resources more effectively. We cover the key scenarios where scaling to zero is beneficial, provide best practices for optimizing scale-up time, and walk through the step-by-step process of implementing this functionality. Additionally, we discuss how to set up scheduled scaling actions for predictable traffic patterns and test the behavior of your scaled-to-zero endpoints.

Determining when to scale to zero

Before we dive into the implementation details of the new scale to zero feature, it’s crucial to understand when and why you should consider using it. Although the ability to scale SageMaker inference endpoints to zero instances offers significant cost-saving potential, it’s crucial to understand when and how to apply this feature effectively. Not all scenarios benefit equally from scaling to zero, and in some cases, it may even impact the performance of your applications. Let’s explore why it’s important to carefully consider when to implement this feature and how to identify the scenarios where it provides the most value.

The ability to scale SageMaker inference endpoints to zero instances is particularly beneficial in three key scenarios:

Predictable traffic patterns

Sporadic or variable traffic

NoCapacityInvocationFailures

Amazon CloudWatch

Development and testing environments

By carefully evaluating your specific use case against these scenarios, you can make informed decisions about implementing scale to zero functionality. This approach makes sure you maximize cost savings without compromising on the performance and availability requirements of your ML applications. It’s important to note that although scaling to zero can provide significant benefits, it also introduces a trade-off in terms of initial response time when scaling back up. Therefore, it’s crucial to assess whether your application can tolerate this potential delay and to implement appropriate strategies to manage it. In the following sections, we dive deeper into each scenario and provide guidance on how to determine if scaling to zero is the right choice for your specific needs. We also discuss best practices for implementation and strategies to mitigate potential drawbacks.

Scale down to zero is only supported when using inference components. For more information on inference components, see Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker.

Now that we understand when to use the scale to zero feature, let’s dive into how to optimize its performance and implement it effectively. Scaling up from zero instances to serving traffic introduces a brief delay (cold start), which can impact your application’s responsiveness. To mitigate this, we first explore best practices for minimizing scale-up time. Then we walk through the step-by-step process of implementing the scale to zero functionality for your SageMaker inference endpoints.

Optimizing scale-up time best practices

When using the scale to zero feature, it’s crucial to minimize the time it takes for your endpoint to scale up and begin serving requests. The following are several best practices you can implement to decrease the scale-out time for your SageMaker inference endpoints:

Decrease model or container download time

uncompressed model format

Supercharge your auto scaling for generative AI inference – Introducing Container Caching in SageMaker Inference.

Reduce model server startup time

Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – part 1

Use faster auto scaling metrics

ConcurrentRequestsPerCopy

NoCapacityInvocationFailures

Amazon SageMaker inference launches faster auto scaling for generative AI models

Handle failed requests

NoCapacityInvocationFailures

Amazon Simple Queue Service

DescribeInferenceComponent

CurrentCopyCount

By implementing these best practices, you can help make sure your SageMaker inference endpoints can scale out quickly and efficiently to meet changes in traffic, providing a responsive and reliable experience for your end-users.

Solution overview

With these best practices in mind, let’s now walk through the process of enabling your SageMaker inference endpoints to scale down to zero instances. This process involves a few key steps that are crucial for optimizing your endpoint’s performance and cost-efficiency:

Configure your endpoint

MinInstanceCount

Define scaling policies

Scaling policy for inference component copies

Scale out from zero policy

By implementing these scaling policies, you create a flexible and cost-effective infrastructure that can automatically adjust to your workload demands and scale to zero when needed.

Now let’s see how to use this feature step by step.

Set up your endpoint

The first crucial step in enabling your SageMaker endpoint to scale to zero is properly configuring the endpoint and its associated components. This process involves three main steps:

MinInstanceCount

sagemaker_client.create_endpoint_config(    EndpointConfigName=endpoint_config_name,    ExecutionRoleArn=role,    ProductionVariants=[        {            "VariantName": variant_name,            "InstanceType": instance_type,            "InitialInstanceCount": 1,            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,            "ManagedInstanceScaling": {                "Status": "ENABLED",                "MinInstanceCount": 0,                "MaxInstanceCount": max_instance_count,            },            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},        }    ],)

sagemaker_client.create_endpoint(    EndpointName=endpoint_name,    EndpointConfigName=endpoint_config_name,)

sagemaker_client.create_inference_component(    InferenceComponentName=inference_component_name,    EndpointName=endpoint_name,    VariantName=variant_name,    Specification={        "ModelName": model_name,        "StartupParameters": {            "ModelDataDownloadTimeoutInSeconds": 3600,            "ContainerStartupHealthCheckTimeoutInSeconds": 3600,        },        "ComputeResourceRequirements": {            "MinMemoryRequiredInMb": 1024,            "NumberOfAcceleratorDevicesRequired": 1,        },    },    RuntimeConfig={        "CopyCount": 1,    },)

Add scaling policies

After the endpoint is deployed and InService, you can add the necessary scaling policies:

target tracking

step scaling

Scaling policy for inference components model copies

After you create your SageMaker endpoint and inference components, you register a new auto scaling target for Application Auto Scaling. In the following code block, you set MinCapacity to 0, which is required for your endpoint to scale down to zero:

# Register scalable targetresource_id = f"inference-component/{inference_component_name}"service_namespace = "sagemaker"scalable_dimension = "sagemaker:inference-component:DesiredCopyCount"aas_client.register_scalable_target(    ServiceNamespace=service_namespace,    ResourceId=resource_id,    ScalableDimension=scalable_dimension,    MinCapacity=0,    MaxCapacity=max_copy_count,  # Replace with your desired maximum number of model copies)

After you have registered your new scalable target, the next step is to define your target tracking policy. In the following code example, we set the TargetValue to 5. This setting instructs the auto scaling system to increase capacity when the number of concurrent requests per model reaches or exceeds 5.

# Create Target Tracking Scaling Policyaas_client.put_scaling_policy(    PolicyName="inference-component-target-tracking-scaling-policy",    PolicyType="TargetTrackingScaling",    ServiceNamespace=service_namespace,    ResourceId=resource_id,    ScalableDimension=scalable_dimension,    TargetTrackingScalingPolicyConfiguration={        "PredefinedMetricSpecification": {            "PredefinedMetricType": "SageMakerInferenceComponentConcurrentRequestsPerCopyHighResolution",        },        # Low TPS + load TPS        "TargetValue": 5,  # you need to adjust this value based on your use case        "ScaleInCooldown": 300,  # default        "ScaleOutCooldown": 300,  # default    },)

Application Auto Scaling creates two CloudWatch alarms per scaling target. The first triggers scale-out actions after 1 minute (using one 1-minute data point), and the second triggers scale-in after 15 minutes (using 90 10-second data points). The time to trigger the scaling action is usually 1–2 minutes longer than those minutes because it takes time for the endpoint to publish metrics to CloudWatch, and it also takes time for AutoScaling to react.

Scale out from zero model copies policy

To enable your endpoint to scale out from zero instances, complete the following steps:

"ScalingAdjustment": 1

ScalingAdjustment

aas_client.put_scaling_policy(    PolicyName="inference-component-step-scaling-policy",    PolicyType="StepScaling",    ServiceNamespace=service_namespace,    ResourceId=resource_id,    ScalableDimension=scalable_dimension,    StepScalingPolicyConfiguration={        "AdjustmentType": "ChangeInCapacity",        "MetricAggregationType": "Maximum",        "Cooldown": 60,        "StepAdjustments":          [             {               "MetricIntervalLowerBound": 0,               "ScalingAdjustment": 1 # you need to adjust this value based on your use case             }          ]    },)

NoCapacityInvocationFailures

When triggered, the alarm initiates the previously defined scaling policy. For more information about the NoCapacityInvocationFailures metric, see documentation.

We have also set the following:

EvaluationPeriods

DatapointsToAlarm

ComparisonOperator

GreaterThanOrEqualToThreshold

This results in waiting approximately 1 minute for the step scaling policy to trigger after our endpoint receives a single request.

cw_client.put_metric_alarm(    AlarmName='ic-step-scaling-policy-alarm',    AlarmActions=<step_scaling_policy_arn>,  # Replace with your actual ARN    MetricName='NoCapacityInvocationFailures',    Namespace='AWS/SageMaker',    Statistic='Maximum',    Dimensions=[        {            'Name': 'InferenceComponentName',            'Value': inference_component_name  # Replace with actual InferenceComponentName        }    ],    Period=30,    EvaluationPeriods=1,    DatapointsToAlarm=1,    Threshold=1,    ComparisonOperator='GreaterThanOrEqualToThreshold',    TreatMissingData='missing')

Replace <STEP_SCALING_POLICY_ARN> with the Amazon Resource Name (ARN) of the scaling policy you created in the previous step.

Notice the "MinInstanceCount": 0 setting in the endpoint configuration, which allows the endpoint to scale down to zero instances. With the scaling policy, CloudWatch alarm, and minimum instances set to zero, your SageMaker inference endpoint will now be able to automatically scale down to zero instances when not in use.

Test the solution

When our SageMaker endpoint doesn’t receive requests for 15 minutes, it will automatically scale down to zero the number of model copies:

time.sleep(500)while True:    desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)    status = desc["InferenceComponentStatus"]    print(status)    sys.stdout.flush()    if status in ["InService", "Failed"]:        break    time.sleep(30)    desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)print(desc)

After 10 additional minutes of inactivity, SageMaker automatically stops all underlying instances of the endpoint, eliminating all associated instance costs.

If we try to invoke our endpoint while instances are scaled down to zero, we get a validation error:

An error occurred (ValidationError) when calling the InvokeEndpoint operation: Inference Component has no capacity to process this request. ApplicationAutoScaling may be in-progress (if configured) or try to increase the capacity by invoking UpdateInferenceComponentRuntimeConfig API.

sagemaker_client.invoke_endpoint(    EndpointName=endpoint_name,    InferenceComponentName=inference_component_name,    Body=json.dumps(        {            "inputs": "The diamondback terrapin was the first reptile to be",            "parameters": {                "do_sample": True,                "max_new_tokens": 256,                "min_new_tokens": 256,                "temperature": 0.3,                "watermark": True,            },        }    ),    ContentType="application/json",)["Body"].read().decode("utf8")

However, after 1 minute, our step scaling policy should start. SageMaker will then start provisioning a new instance and deploy our inference component model copy to handle requests.

Schedule scaling down to zero

In some scenarios, you might observe consistent weekly traffic patterns: a steady workload Monday through Friday, and no traffic on weekends. You can optimize costs and performance by configuring scheduled actions that align with these patterns:

Weekend scale-in (Friday evening)

Workweek scale-out (Monday morning)

You can scale your endpoint to zero in two ways. The first method is to set the number of model copies to zero in your inference component using the UpdateInferenceComponentRuntimeConfig API. This approach maintains your endpoint configuration while eliminating compute costs during periods of inactivity.

sagemaker_client.update_inference_component_runtime_config(    InferenceComponentName=inference_component_name,    DesiredRuntimeConfig={        'CopyCount': 0    })

Amazon EventBridge Scheduler can automate SageMaker API calls using cron/rate expressions for recurring schedules or one-time invocations. To function, EventBridge Scheduler requires an execution role with appropriate permissions to invoke the target API operations on your behalf. For more information about how to create this role, see Set up the execution role. The specific permissions needed depend on the target API being called.

The following code creates two scheduled actions for the inference component during 2024–2025. The first schedule scales in the CopyCount to zero every Friday at 18:00 UTC+1, and the second schedule restores model capacity every Monday at 07:00 UTC+1. The schedule will start on November 29, 2024, end on December 31, 2025, and be deleted after completion.

import jsonscheduler = boto3.client('scheduler')flex_window = {    "Mode": "OFF"}# We specify the SageMaker target API for the scale in schedulescale_in_target = {    "RoleArn": role,    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:updateInferenceComponentRuntimeConfig",    "Input": json.dumps({ "DesiredRuntimeConfig": {"CopyCount": 0}, "InferenceComponentName": inference_component_name })}# Scale in our endpoint to 0 every friday at 18:00 UTC+1, starting on November 29, 2024scheduler.create_schedule(    Name="scale-to-zero-schedule",    ScheduleExpression="cron(00 18 ? * 6 2024-2025)",    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application    Target=scale_in_target,    FlexibleTimeWindow=flex_window,    ActionAfterCompletion="DELETE",    StartDate="2024-11-29T00:00:00",    EndDate="2025-12-31T23:59:59")# Specify the SageMaker target API for the scale out schedulescale_out_target = {    "RoleArn": role,    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:updateInferenceComponentRuntimeConfig",    "Input": json.dumps({ "DesiredRuntimeConfig": {"CopyCount": 2}, "InferenceComponentName": inference_component_name })}# Scale out our endpoint every Monday at 07:00 UTC+1scheduler.create_schedule(    Name="scale-out-schedule",    ScheduleExpression="cron(00 07 ? * 2 2024-2025)",    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application    Target=scale_out_target,    FlexibleTimeWindow=flex_window,    ActionAfterCompletion="DELETE",    StartDate="2024-11-29T00:00:00",    EndDate="2025-12-31T23:59:59")

The second method is to delete the inference components by calling the DeleteInferenceComponent API. This approach achieves the same cost-saving benefit while completely removing the components from your configuration. The following code creates a scheduled action that automatically deletes the inference component every Friday at 18:00 UTC during 2024–2025. It also creates a complementary scheduled action that recreates the inference component every Monday at 07:00 UTC+1.

import jsonscheduler = boto3.client('scheduler')flex_window = {    "Mode": "OFF"}# We specify the SageMaker target API for the scale in schedulescale_in_target = {    "RoleArn": role,    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:deleteInferenceComponent",    "Input": json.dumps({"InferenceComponentName": inference_component_name })}# Scale in our endpoint by deleting the IC every friday at 18:00 UTC+1scheduler.create_schedule(    Name="scale-to-zero-schedule",    ScheduleExpression="cron(00 18 ? * 6 2024-2025)",    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application    Target=scale_in_target,    FlexibleTimeWindow=flex_window,    ActionAfterCompletion="DELETE",    StartDate="2024-11-29T00:00:00",    EndDate="2025-12-31T23:59:59")# Specify the SageMaker target API for the scale up scheduleinput_config = {  "EndpointName": endpoint_name,  "InferenceComponentName": inference_component_name,  "RuntimeConfig": {    "CopyCount": 2  },  "Specification": {    "ModelName": model_name,    "StartupParameters": {        "ModelDataDownloadTimeoutInSeconds": 3600,        "ContainerStartupHealthCheckTimeoutInSeconds": 3600,    },    "ComputeResourceRequirements": {      "MinMemoryRequiredInMb": 1024,      "NumberOfAcceleratorDevicesRequired": 1    }  },  "VariantName": variant_name}scale_out_target = {    "RoleArn": role,    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:createInferenceComponent",    "Input": json.dumps(input_config)}# Scale out our endpoint by recreating the IC every Monday at 07:00 UTC+1scheduler.create_schedule(    Name="scale-out-schedule",    ScheduleExpression="cron(00 07 ? * 2 2024-2025)",    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application    Target=scale_out_target,    FlexibleTimeWindow=flex_window,    ActionAfterCompletion="DELETE",    StartDate="2024-11-29T00:00:00",    EndDate="2025-12-31T23:59:59")

To scale to zero on an endpoint with multiple inference components, all components must be either set to 0 or deleted. You can also automate this process by using EventBridge Scheduler to trigger an AWS Lambda function that handles either deletion or zero-setting of all inference components.

Performance evaluation

We evaluated the performance implications of the Scale to Zero feature by conducting tests using a Llama3-8B instruct model. These tests utilized container caching and optimized model loading techniques, and were performed with both Target Tracking and Step Scaling policies in place. Our findings for Llama3-8B instruct show that when using the Target Tracking policy, SageMaker will scale the endpoint to zero model copies in approximately 15 minutes, and then take an additional 10 minutes to fully scale down the underlying instances, for a total scale-in time of 25 minutes. Conversely, when scaling the endpoint back up from zero, the Step Scaling policy triggers the provisioning of new instances in around 1 minute, followed by provisioning the instance(s) in ~1.748 minutes. Scaling out of model copies in approximately 2.28 minutes, resulting in a total scale-out time of around 5.028 minutes.

The performance tests on LLaMa3.1 models (8B and 70B variants) demonstrate SageMaker’s Scale to Zero feature’s effectiveness, with intentionally conservative scaling times to prevent endpoint thrashing and accommodate spiky traffic patterns. For both model sizes, scaling in takes a total of 25 minutes, allowing a 15-minute buffer before initiating scale-down and an additional 10 minutes to fully decommission instances. This cautious approach helps avoid premature scaling during temporary lulls in traffic. When scaling out, the 8B model takes about 5 minutes, while the 70B model needs approximately 6 minutes. These times include a 1-minute trigger delay, followed by instance provisioning and model copy instantiation. The slightly longer scale-out times, especially for larger models, provide a balance between responsiveness and stability, ensuring the system can handle sudden traffic increases without constantly scaling up and down. This measured approach to scaling helps maintain consistent performance and cost-efficiency in environments with variable workloads.

LLaMa3.1 8B Instruct
Scale in	Time to trigger target tracking (min)	Time to scale in instance count to zero (min)		Total time (min)
Scale in	15	10		25
Scale out	Time to trigger step scaling policy (min)	Time to provision instance(s) (min)	Time to instatiate a new model copy (min)	Total time (min)
Scale out	1	1.748	2.28	5.028
LLaMa3.1 70B
Scale in	Time to trigger target tracking (min)	Time to scale in instance count to zero (min)		Total time (min)
Scale in	15	10		25
Scale out	Time to trigger step scaling policy (min)	Time to provision instance(s) (min)	Time to instatiate a new model copy (min)	Total time (min)
Scale out	1	3.018	1.984	6.002

Scale up Trials

LLaMa3.1 8B Instruct
Trial	Time to trigger step scaling policy (min)	Time to provision instance(s) (min)	Time to instantiate a new model copy (min)	Total time (min)
1	1	1.96	3.1	6.06
2	1	1.75	2.6	5.35
3	1	1.4	2.1	4.5
4	1	1.96	1.9	4.86
5	1	1.67	1.7	4.37
Average	1	1.748	2.28	5.028
LLaMa3.1 70B
Trial	Time to trigger step scaling policy (min)	Time to provision instance(s) (min)	Time to instantiate a new model copy (min)	Total time (min)
1	1	3.1	1.98	6.08
2	1	2.92	1.98	5.9
3	1	2.82	1.98	5.8
4	1	3.27	2	6.27
5	1	2.98	1.98	5.96
Average	1	3.018	1.984	6.002

Target Tracking: Scale Model Copies to Zero (min)

Scale in Instance Count to Zero (min)

Step Scaling: Scale up Model Copies from Zero (min)

Scale out Instance Count from Zero (min)

If you want more customization and faster scaling, consider using step scaling to scale model copies instead of target tracking.

Customers testimonials

The new Scale to Zero feature for SageMaker inference endpoints has sparked considerable interest across customers. We gathered initial reactions from companies who have previewed and evaluated this capability, highlighting its potential impact on AI and machine learning operations.

Atlassian, headquartered in Sydney, Australia, is a software company specializing in collaboration tools for software development and project management:

“The new Scale to Zero feature for SageMaker inference strongly aligns with our commitment to efficiency and innovation. We’re enthusiastic about its potential to revolutionize how we manage our machine learning inference resources, and we look forward to integrating it into our operations”

– Guarav Awadhwal – Senior Engineering Manager at Atlassian

iFood is a Latin American online food delivery firm based in Brazil. It works with over 300,000 restaurants, connecting them with millions of customers every month.

“The Scale to Zero feature for SageMaker Endpoints will be fundamental for iFood’s Machine Learning Operations. Over the years, we’ve collaborated closely with the SageMaker team to enhance our inference capabilities. This feature represents a significant advancement, as it allows us to improve cost efficiency without compromising the performance and quality of our ML services, given that inference constitutes a substantial part of our infrastructure expenses.”

– Daniel Vieira, MLOps Engineer Manager at iFoods

VIDA, headquartered in Jakarta, Indonesia, is a leading digital identity provider that enable individuals and business to conduct business in a safe and secure digital environment.

“SageMaker’s new Scale to Zero feature for GPU inference endpoints shows immense promise for deep fake detection operations. The potential to efficiently manage our face liveness and document verification inference models while optimizing infrastructure costs aligns perfectly with our goals. We’re excited to leverage this capability to enhance our identity verification solutions.”

– Keshav Sharma, ML Platform Architect at VIDA

APOIDEA Group is a leading AI-focused FinTech ISV company headquartered in Hong Kong. Leveraging cutting-edge generative AI and deep learning technologies, the company develops innovative AI FinTech solutions for multinational banks. APOIDEA’s products automate repetitive human analysis tasks, extracting valuable financial insights from extensive financial documents to accelerate AI-driven transformation across the industry.

“SageMaker’s Scale to Zero feature is a game changer for our AI financial analysis solution in operations. It delivers significant cost savings by scaling down endpoints during quiet periods, while maintaining the flexibility we need for batch inference and model testing. This capability is transforming how we manage our GenAI workloads and evaluate new models. We’re eager to harness its power to further optimize our deep learning and NLP model deployments.”

– Mickey Yip, VP of Product at APOIDEA Group

Fortiro, based in Melbourne, Australia, is a FinTech company specializing in automated document fraud detection and financial verification for trusted financial institutions.

“The new Scale-to-Zero capability in SageMaker is a game-changer for our MLOps and delivers great cost savings. Being able to easily scale inference endpoints and GPUs means we can take advantage of a fast, highly responsive environment, without incurring unnecessary costs. Our R&D teams constantly experiment with new AI-based document fraud detection methods, which involves a lot of testing and repeating. This capability empowers us to do this both faster and more efficiently.”

– Amir Vahid , Chief Technology Officer at Fortiro

These testimonials underscore the anticipation for SageMaker’s Scale to Zero feature. As organizations begin to implement this capability, we expect to see innovative applications that balance cost efficiency with performance in machine learning deployments.

Conclusion

In this post, we introduced the new scale to zero feature in SageMaker, an innovative capability that enables you to optimize costs by automatically scaling in your inference endpoints when they’re not in use. We guided you through the detailed process of implementing this feature, including configuring endpoints, setting up auto scaling policies, and managing inference components for both automatic and scheduled scaling scenarios.

This cost-saving functionality presents new possibilities for how you can approach your ML operations. With this feature, you can closely align your compute resource usage with actual needs, potentially reducing costs during periods of low demand. We encourage you to try this capability and start optimizing your SageMaker inference costs today.

To help you get started quickly, we’ve prepared a comprehensive notebooks containing an end-to-end example of how to configure an endpoint to scale to zero.

We encourage you to try this capability and start optimizing your SageMaker inference costs today!

About the authors

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Christian Kamwangala is an AI/ML and Generative AI Specialist Solutions Architect at AWS, based in Paris, France. He helps enterprise customers architect and implement cutting-edge AI solutions using AWS’s comprehensive suite of tools, with a focus on production-ready systems that follow industry best practices. In his spare time, Christian enjoys exploring nature and spending time with family and friends.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Raghu Ramesha is a Senior GenAI/ML Solutions Architect on the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in computer science from UT Dallas. In his free time, he enjoys traveling and photography.

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Raj Vippagunta is a Principal Engineer at Amazon SageMaker Machine Learning(ML) platform team in AWS. He uses his vast experience of 18+ years in large-scale distributed systems and his passion for machine learning to build practical service offerings in the AI and ML space. He has helped build various at-scale solutions for AWS and Amazon. In his spare time, he likes reading books, pursue long distance running and exploring new places with his family.

Determining when to scale to zero

Optimizing scale-up time best practices

Solution overview

Set up your endpoint

Add scaling policies

Scaling policy for inference components model copies

Scale out from zero model copies policy

Test the solution

Schedule scaling down to zero

Performance evaluation

Scale up Trials

Customers testimonials

Conclusion

About the authors

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签