AWS Machine Learning Blog 07月11日 02:37
Amazon SageMaker HyperPod launches model deployments to accelerate the generative AI model development lifecycle
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Amazon SageMaker HyperPod 现已支持从 Amazon SageMaker JumpStart 部署基础模型,以及从 Amazon S3 或 Amazon FSx 部署自定义或微调模型。此举旨在优化资源利用率,加速生成式 AI 模型的整个生命周期。HyperPod 提供弹性、高性能的基础设施,专为大规模模型训练和调整而设计,简化了基础设施设置,让用户专注于模型交付,而非后端管理。新功能包括一键式部署、灵活的部署选项、动态扩展和高效的资源管理。

🚀 **一键部署**: 通过 SageMaker JumpStart,用户现在可以一键部署超过 400 个开源基础模型,包括 DeepSeek-R1、Mistral 和 Llama4 等最新模型。这些模型将在由 EKS 编排的 HyperPod 集群上部署,并作为 SageMaker 终端节点或应用程序负载均衡器 (ALB) 提供。

📦 **灵活部署选项**: 提供了多种在 HyperPod 上部署模型的方式,以支持具有不同偏好和专业知识水平的团队,包括使用原生 kubectl 命令、HyperPod CLI 或 SageMaker Python SDK。

📈 **动态扩展**: HyperPod 推理现在支持根据来自 Amazon CloudWatch 和 Prometheus 的指标自动扩展部署。 借助自动扩展,模型可以有效地处理流量高峰,同时优化较低需求期间的资源使用。

⚙️ **高效资源管理**: HyperPod Task Governance 允许高效的资源分配,优先处理推理任务而不是较低优先级的训练任务,从而最大限度地提高 GPU 利用率,并近乎实时地动态扩展推理工作负载。

👁️ **全面可观测性**: 增加了对 HyperPod 上托管的推理工作负载的可观测性,包括内置功能,用于抓取指标并将其导出到可观测性平台,提供平台级指标和推理特定指标的可见性。

Today, we’re excited to announce that Amazon SageMaker HyperPod now supports deploying foundation models (FMs) from Amazon SageMaker JumpStart, as well as custom or fine-tuned models from Amazon S3 or Amazon FSx. With this launch, you can train, fine-tune, and deploy models on the same HyperPod compute resources, maximizing resource utilization across the entire model lifecycle.

SageMaker HyperPod offers resilient, high-performance infrastructure optimized for large-scale model training and tuning. Since its launch in 2023, SageMaker HyperPod has been adopted by foundation model builders who are looking to lower costs, minimize downtime, and accelerate time to market. With Amazon EKS support in SageMaker HyperPod you can orchestrate your HyperPod Clusters with EKS. Customers like Perplexity, Hippocratic, Salesforce, and Articul8 use HyperPod to train their foundation models at scale. With the new deployment capabilities, customers can now leverage HyperPod clusters across the full generative AI development lifecycle from model training and tuning to deployment and scaling.

Many customers use Kubernetes as part of their generative AI strategy, to take advantage of its flexibility, portability, and open source frameworks. You can orchestrate your HyperPod clusters with Amazon EKS support in SageMaker HyperPod so you can continue working with familiar Kubernetes workflows while gaining access to high-performance infrastructure purpose-built for foundation models. Customers benefit from support for custom containers, compute resource sharing across teams, observability integrations, and fine-grained scaling controls. HyperPod extends the power of Kubernetes by streamlining infrastructure setup and allowing customers to focus more on delivering models not managing backend complexity.

New Features: Accelerating Foundation Model Deployment with SageMaker HyperPod

Customers prefer Kubernetes for flexibility, granular control over infrastructure, and robust support for open source frameworks. However, running foundation model inference at scale on Kubernetes introduces several challenges. Organizations must securely download models, identify the right containers and frameworks for optimal performance, configure deployments correctly, select appropriate GPU types, provision load balancers, implement observability, and add auto-scaling policies to meet demand spikes. To address these challenges, we’ve launched SageMaker HyperPod capabilities to support the deployment, management, and scaling of generative AI models:

    One-click foundation model deployment from SageMaker JumpStart: You can now deploy over 400 open-weights foundation models from SageMaker JumpStart on HyperPod with just a click, including the latest state-of-the-art models like DeepSeek-R1, Mistral, and Llama4. SageMaker JumpStart models will be deployed on HyperPod clusters orchestrated by EKS and will be made available as SageMaker endpoints or Application Load Balancers (ALB). Deploy fine-tuned models from S3 or FSx for Lustre: You can seamlessly deploy your custom models from S3 or FSx. You can also deploy models from Jupyter notebooks with provided code samples. Flexible deployment options for different user personas: We’re providing multiple ways to deploy models on HyperPod to support teams that have different preferences and expertise levels. Beyond the one-click experience available in the SageMaker JumpStart UI, you can also deploy models using native kubectl commands, the HyperPod CLI, or the SageMaker Python SDK—giving you the flexibility to work within your preferred environment. Dynamic scaling based on demand: HyperPod inference now supports automatic scaling of your deployments based on metrics from Amazon CloudWatch and Prometheus with KEDA. With automatic scaling your models can handle traffic spikes efficiently while optimizing resource usage during periods of lower demand. Efficient resource management with HyperPod Task Governance: One of the key benefits of running inference on HyperPod is the ability to efficiently utilize accelerated compute resources by allocating capacity for both inference and training in the same cluster. You can use HyperPod Task Governance for efficient resource allocation, prioritization of inference tasks over lower priority training tasks to maximize GPU utilization, and dynamic scaling of inference workloads in near real-time. Integration with SageMaker endpoints: With this launch, you can deploy AI models to HyperPod and register them with SageMaker endpoints. This allows you to use similar invocation patterns as SageMaker endpoints along with integration with other open-source frameworks. Comprehensive observability: We’ve added the capability to get observability into the inference workloads hosted on HyperPod, including built-in capabilities to scrape metrics and export them to your observability platform. This capability provides visibility into both:
      Platform-level metrics such as GPU utilization, memory usage, and node health Inference-specific metrics like time to first token, request latency, throughput, and model invocations

With Amazon SageMaker HyperPod, we built and deployed the foundation models behind our agentic AI platform using the same high-performance compute. This seamless transition from training to inference streamlined our workflow, reduced time to production, and ensured consistent performance in live environments. HyperPod helped us go from experimentation to real-world impact with greater speed and efficiency.”
–Laurent Sifre, Co-founder & CTO, H.AI

Deploying models on HyperPod clusters

In this launch, we are providing new operators that manage the complete lifecycle of your generative AI models in your HyperPod cluster. These operators will provide a simplified way to deploy and invoke your models in your cluster.

Prerequisites: 

helm install hyperpod-inference-operator ./sagemaker-hyperpod-cli/helm_chart/HyperPodHelmChart/charts/inference-operator \     -n kube-system \     --set region=" + REGION + " \     --set eksClusterName=" + EKS_CLUSTER_NAME + " \     --set hyperpodClusterArn=" + HP_CLUSTER_ARN + " \     --set executionRoleArn=" + HYPERPOD_INFERENCE_ROLE_ARN + " \     --set s3.serviceAccountRoleArn=" + S3_CSI_ROLE_ARN + " \     --set s3.node.serviceAccount.create=false \     --set keda.podIdentity.aws.irsa.roleArn=\"arn:aws:iam::" + ACCOUNT_ID + ":role/keda-operator-role\" \     --set tlsCertificateS3Bucket=" + TLS_BUCKET_NAME + " \     --set alb.region=" + REGION + " \     --set alb.clusterName=" + EKS_CLUSTER_NAME + " \     --set alb.vpcId=" + VPC_ID + " \     --set jumpstartGatedModelDownloadRoleArn=" + JUMPSTART_GATED_ROLE_ARN

Architecture:

Deployment sources

Once you have the operators running in your cluster, you can then deploy AI models from multiple sources using SageMaker JumpStart, S3, or FSx:

SageMaker JumpStart 

Models hosted in SageMaker JumpStart can be deployed to your HyperPod cluster. You can navigate to SageMaker Studio, go to SageMaker JumpStart and select the open-weights model you want to deploy, and select SageMaker HyperPod. Once you provide the necessary details choose Deploy. The inference operator running in the cluster will initiate a deployment in the namespace provided.

Once deployed, you can monitor deployments in SageMaker Studio.

Alternatively, here is a YAML file that you can use to deploy the JumpStart model using kubectl. For example, the following YAML snippet will deploy DeepSeek-R1 Qwen 1.5b from SageMaker JumpStart on an ml.g5.8xlarge instance:

apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1kind: JumpStartModelmetadata:  name: deepseek-llm-r1-distill-qwen-1-5b-july03  namespace: defaultspec:  model:    modelHubName: SageMakerPublicHub    modelId: deepseek-llm-r1-distill-qwen-1-5b    modelVersion: 2.0.7  sageMakerEndpoint:    name: deepseek-llm-r1-distill-qwen-1-5b  server:    instanceType: ml.g5.8xlarge  tlsConfig:    tlsCertificateOutputS3Uri: s3://<bucket_name>/certificates

Deploying model from S3 

You can deploy model artifacts directly from S3 to your HyperPod cluster using the InferenceEndpointConfig resource. The inference operator will use the S3 CSI driver to provide the model files to the pods in the cluster. Using this configuration the operator will download the files located under the prefix deepseek15b as set by the modelLocation parameter. Here is the complete YAML example and documentation:

apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1kind: InferenceEndpointConfigmetadata:  name: deepseek15b  namespace: defaultspec:  endpointName: deepseek15b  instanceType: ml.g5.8xlarge  invocationEndpoint: invocations  modelName: deepseek15b  modelSourceConfig:    modelLocation: deepseek15b    modelSourceType: s3    s3Storage:      bucketName: mybucket      region: us-west-2

Deploying model from FSx

Models can also be deployed from FSx for Lustre volumes, high-performance storage that can be used to save model checkpoints. This provides the capability to launch a model without having to download artifacts from S3, thus saving the time taken to download the models during deployment or scaling up. Setup instructions for FSx in HyperPod cluster is provided in the Set Up an FSx for Lustre File System workshop. Once set up, you can deploy models using InferenceEndpointConfig. Here is the complete YAML file and a sample:

apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1kind: InferenceEndpointConfigmetadata:  name: deepseek15b  namespace: defaultspec:  endpointName: deepseek15b  instanceType: ml.g5.8xlarge  invocationEndpoint: invocations  modelName: deepseek15b  modelSourceConfig:    fsxStorage:      fileSystemId: fs-abcd1234    modelLocation: deepseek-1-5b    modelSourceType: fsx

Deployment experiences

We are providing multiple experiences to deploy, kubectl, the HyperPod CLI, and the Python SDK. All deployment options will need the HyperPod inference operator to be installed and running in the cluster.

Deploying with kubectl 

You can deploy models using native kubectl with YAML files as shown in the previous sections.

To deploy and monitor the status, you can run kubectl apply -f <manifest_name>.yaml.

Once deployed, you can monitor the status with:

Other resources that are generated are deployments, services, pods, and ingress. Each resource will be visible from your cluster.

To control the invocation path on your container, you can modify the invocationEndpoint parameter. Your ELB can route requests that are sent to alternate paths such as /v1/chat/completions. To modify the health check path for the container to another path such as /health, you can annotate the generated Ingress object with:

kubectl annotate ingress --overwrite <name> alb.ingress.kubernetes.io/healthcheck-path=/health.

Deploying with the HyperPod CLI

The SageMaker HyperPod CLI also offers a method of deploying using the CLI. Once you set your context, you can deploy a model, for example:

!hyp create hyp-jumpstart-endpoint \  --version 1.0 \  --model-id deepseek-llm-r1-distill-qwen-1-5b \  --model-version 2.0.4 \  --instance-type ml.g5.8xlarge \  --endpoint-name endpoint-test-jscli \  --tls-certificate-output-s3-uri s3://<bucket_name>/

For more information, see Installing the SageMaker HyperPod CLI and SageMaker HyperPod deployment documentation.

Deploying with Python SDK

The SageMaker Python SDK also provides support to deploy models on HyperPod clusters. Using the Model, Server and SageMakerEndpoint configurations, we can construct a specification to deploy on a cluster. An example notebook to deploy with Python SDK is provided here, for example:

from sagemaker.hyperpod.inference.config.hp_jumpstart_endpoint_config import Model, Server,SageMakerEndpoint, TlsConfig, EnvironmentVariablesfrom sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint# create configsmodel=Model(    model_id='deepseek-llm-r1-distill-qwen-1-5b',    model_version='2.0.4',)server=Server(    instance_type='ml.g5.8xlarge',)endpoint_name=SageMakerEndpoint(name='deepseklr1distill-qwen')tls_config=TlsConfig(tls_certificate_output_s3_uri='s3://<bucket_name>')# create specjs_endpoint=HPJumpStartEndpoint(    model=model,    server=server,    sage_maker_endpoint=endpoint_name,    tls_config=tls_config,)# use spec to deployjs_endpoint.create()

Run inference with deployed models

Once the model is deployed, you can access the model by invoking the model with a SageMaker endpoint or invoking directly using the ALB.

Invoking the model with a SageMaker endpoint

Once a model has been deployed and the SageMaker endpoint is created successfully, you can invoke your model with the SageMaker Runtime client. You can check the status of the deployed SageMaker endpoint by going to the SageMaker AI console, choosing Inference, and then Endpoints. For example, given an input file input.json we can invoke a SageMaker endpoint using the AWS CLI. This will route the request to the model hosted on HyperPod:

!aws sagemaker-runtime invoke-endpoint \        --endpoint-name "<ENDPOINT NAME>" \        --body fileb://input.json \        --content-type application/json \        --accept application/json \        output2.json

Invoke the model directly using ALB

You can also invoke the load balancer directly instead of using the SageMaker endpoint. You must download the generated certificate from S3 and then you can include it in your trust store or request. You can also bring your own certificates.

For example, you can invoke a vLLM container deployed after setting the invocationEndpoint  in the deployment YAML shown in previous section value to /v1/chat/completions.

For example, using curl:

curl --cacert /path/to/cert.pem https://<name>.<region>.elb.amazonaws.com/v1/chat/completions \     -H "Content-Type: application/json" \     -d '{        "model": "/opt/ml/model",        "messages": [            {"role": "system", "content": "You are a helpful assistant."},            {"role": "user", "content": "Who won the world series in 2020?"}        ]    }'

User experience

These capabilities are designed with different user personas in mind:

Observability

Amazon SageMaker HyperPod now provides a comprehensive, out-of-the-box observability solution that delivers deep insights into inference workloads and cluster resources. This unified observability solution automatically publishes key metrics from multiple sources including Inference Containers, NVIDIA DCGM, instance-level Kubernetes node exporters, Elastic Fabric Adapter, integrated file systems, Kubernetes APIs, and Kueue to Amazon Managed Service for Prometheus and visualizes them in Amazon Managed Grafana dashboards. With a one-click installation of this HyperPod EKS add-on, along with resource utilization and cluster utilization, users gain access to critical inference metrics:

These metrics capture model inference request and response data regardless of your model type or serving framework when deployed using inference operators with metrics enabled. You can also expose container-specific metrics that are provided by the model container such as TGI, LMI and vLLM.

You can enable metrics in JumpStart deployments by setting the metrics.enabled: true parameter:

apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1kind: JumpStartModelmetadata:  name:mistral-model  namespace: ns-team-aspec:  model:    modelId: "huggingface-llm-mistral-7b-instruct"    modelVersion: "3.19.0"  metrics:    enabled:true # Default: true (can be set to false to disable)

You can enable metrics for fine-tuned models for S3 and FSx using the following configuration. Note that the default settings are set to port 8000 and /metrics:

apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1kind: InferenceEndpointConfigmetadata:  name: inferenceendpoint-deepseeks  namespace: ns-team-aspec:  modelName: deepseeks  modelVersion: 1.0.1  metrics:    enabled: true # Default: true (can be set to false to disable)    metricsScrapeIntervalSeconds: 30 # Optional: if overriding the default 15s    modelMetricsConfig:        port: 8000 # Optional: if overriding the default 8080        path: "/custom-metrics" # Optional: if overriding the default "/metrics"

For more details, check out the blog post on HyperPod observability and documentation.

Autoscaling

Effective autoscaling handles unpredictable traffic patterns with sudden spikes during peak hours, promotional events, or weekends. Without dynamic autoscaling, organizations must either overprovision resources, leading to significant costs, or risk service degradation during peak loads. LLMs require more sophisticated autoscaling approaches than traditional applications due to several unique characteristics. These models can take minutes to load into GPU memory, necessitating predictive scaling with appropriate buffer time to avoid cold-start penalties. Equally important is the ability to scale in when demand decreases to save costs. Two types of autoscaling are supported, the HyperPod interference operator and KEDA.

Autoscaling provided by HyperPod inference operator

HyperPod inference operator provides built-in autoscaling capabilities for model deployments using metrics from AWS CloudWatch and Amazon Managed Prometheus (AMP). This provides a simple and quick way to setup autoscaling for models deployed with the inference operator. Check out the complete example to autoscale in the SageMaker documentation.

Autoscaling with KEDA

If you need more flexibility for complex scaling capabilities and need to manage autoscaling policies independently from model deployment specs, you can use Kubernetes Event-driven Autoscaling (KEDA). KEDA ScaledObject configurations support a wide range of scaling triggers including Amazon CloudWatch metrics, Amazon SQS queue lengths, Prometheus queries, and resource-based metrics like GPU and memory utilization. You can apply these configurations to existing model deployments by referencing the deployment name in the scaleTargetRef section of the ScaledObject specification. For more information, see the Autoscaling documentation.

apiVersion: keda.sh/v1alpha1kind: ScaledObjectmetadata:  name: nd-deepseek-llm-scaler  namespace: defaultspec:  scaleTargetRef:    name: deepseek-llm-r1-distill-qwen-1-5b    apiVersion: apps/v1    kind: Deployment  minReplicaCount: 1  maxReplicaCount: 3  pollingInterval: 30     # seconds between checks  cooldownPeriod: 300     # seconds before scaling down  triggers:    - type: aws-cloudwatch      metadata:        namespace: AWS/ApplicationELB        # or your metric namespace        metricName: RequestCount              # or your metric name        dimensionName: LoadBalancer           # or your dimension key        dimensionValue: app/k8s-default-albnddee-cc02b67f20/0991dc457b6e8447        statistic: Sum        threshold: "3"                        # change to your desired threshold        minMetricValue: "0"                   # optional floor        region: us-east-2                     # your AWS region        identityOwner: operator               # use the IRSA SA bound to keda-operator

Task governance

With HyperPod task governance, you can optimize resource utilization by implementing priority-based scheduling. With this approach you can assign higher priority to inference workloads to maintain low-latency requirements during traffic spikes, while still allowing training jobs to utilize available resources during quieter periods. Task governance leverages Kueue for quota management, priority scheduling, and resource sharing policies. Through ClusterQueue configurations, administrators can establish flexible resource sharing strategies that balance dedicated capacity requirements with efficient resource utilization.

Teams can configure priority classes to define their resource allocation preferences. For example, teams should create a dedicated priority class for inference workloads, such as inference with a weight of 100, to ensure they are admitted and scheduled ahead of other task types. By giving inference pods the highest priority, they are positioned to preempt lower-priority jobs when the cluster is under load, which is essential for meeting low-latency requirements during traffic surges.Additionally, teams must appropriately size their quotas. If inference spikes are expected within a shared cluster, the team should reserve a sufficient amount of GPU resources in their ClusterQueue to handle these surges. When the team is not experiencing high traffic, unused resources within their quota can be temporarily allocated to other teams’ tasks. However, once inference demand returns, those borrowed resources can be reclaimed to prioritize pending inference pods.

Here is a sample screenshot that shows both training and deployment workloads running in the same cluster. Deployments have inference-priority class which is higher than training-priority class. So a spike in inference requests has suspended the training job to enable scaling up of deployments to handle traffic.

For more information, see the SageMaker HyperPod documentation.

Cleanup

You will incur costs for the instances running in your cluster. You can scale down the instances or delete instances in your cluster to stop accruing costs.

Conclusion

With this launch, you can quickly deploy open-weights and custom models foundation model from SageMaker JumpStart, S3, and FSx to your SageMaker HyperPod cluster. SageMaker automatically provisions the infrastructure, deploys the model on your cluster, enables auto-scaling, and configures the SageMaker endpoint. You can use SageMaker to scale the compute resources up and down through HyperPod task governance as the traffic on model endpoints changes, and automatically publish metrics to the HyperPod observability dashboard to provide full visibility into model performance. With these capabilities you can seamlessly train, fine tune, and deploy models on the same HyperPod compute resources, maximizing resource utilization across the entire model lifecycle.

You can start deploying models to HyperPod today in all AWS Regions where SageMaker HyperPod is available. To learn more, visit the Amazon SageMaker HyperPod documentation or try the HyperPod inference getting started guide in the AWS Management Console.

Acknowledgements:

We would like to acknowledge the key contributors for this launch: Pradeep Cruz, Amit Modi, Miron Perel, Suryansh Singh, Shantanu Tripathi, Nilesh Deshpande, Mahadeva Navali Basavaraj, Bikash Shrestha, Rahul Sahu.


About the authors

Vivek Gangasani is a Worldwide Lead GenAI Specialist Solutions Architect for SageMaker Inference. He drives Go-to-Market (GTM) and Outbound Product strategy for SageMaker Inference. He also helps enterprises and startups deploy, manage, and scale their GenAI models with SageMaker and GPUs. Currently, he is focused on developing strategies and content for optimizing inference performance and GPU efficiency for hosting Large Language Models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Kareem Syed-Mohammed is a Product Manager at AWS. He is focuses on enabling Gen AI model development and governance on SageMaker HyperPod. Prior to this, at Amazon QuickSight, he led embedded analytics, and developer experience. In addition to QuickSight, he has been with AWS Marketplace and Amazon retail as a Product Manager. Kareem started his career as a developer for call center technologies, Local Expert and Ads for Expedia, and management consultant at McKinsey.

Piyush Daftary is a Senior Software Engineer at AWS, working on Amazon SageMaker. His interests include databases, search, machine learning, and AI. He currently focuses on building performant, scalable inference systems for large language models. Outside of work, he enjoys traveling, hiking, and spending time with family.

Chaitanya Hazarey leads software development for inference on SageMaker HyperPod at Amazon, bringing extensive expertise in full-stack engineering, ML/AI, and data science. As a passionate advocate for responsible AI development, he combines technical leadership with a deep commitment to advancing AI capabilities while maintaining ethical considerations. His comprehensive understanding of modern product development drives innovation in machine learning infrastructure.

Andrew Smith is a Senior Cloud Support Engineer in the SageMaker, Vision & Other team at AWS, based in Sydney, Australia. He supports customers using many AI/ML services on AWS with expertise in working with Amazon SageMaker. Outside of work, he enjoys spending time with friends and family as well as learning about different technologies.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SageMaker HyperPod 生成式AI 模型部署 Kubernetes
相关文章