AWS Machine Learning Blog 02月20日
Best practices for Amazon SageMaker HyperPod task governance
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了在Amazon EKS上的Amazon SageMaker HyperPod中运行生成式AI开发任务的最佳实践,通过SageMaker HyperPod任务治理,管理员可以有效管理加速计算资源的分配,并根据团队和项目的优先级强制执行策略,从而降低高达40%的成本。文章涵盖了管理员和数据科学家两个角色的体验,包括如何设置计算分配、管理配额、设置集群策略、监控资源利用率,以及数据科学家如何通过RBAC进行访问控制、提交任务等。通过这些最佳实践,组织可以更专注于加速其生成式AI创新和上市时间。

🔑**计算资源管理**: 管理员需根据团队的任务类型、是否需要预留容量以及优先级,设置计算分配。通过调整团队的fair-share权重,可以控制团队访问未利用资源的速度。权重越高,团队越早获得资源。

📊**配额与借用策略**: 管理员需要设置配额来确定每个实例类型在集群实例组中的分配,并确定团队是共享还是保留其分配的容量。总预留配额不应超过集群的可用容量,以确保配额管理的有效性。

🤝**资源共享池**: 为了维护一个所有团队都可以借用的资源池,可以设置一个专门的团队,其资源可以弥补其他团队的分配与集群总容量之间的差距。确保参与团队的计算分配设置为“Lend and Borrow”或“Lend”,以便共享此公共资源池。

🛡️**访问控制与权限管理**: 数据科学家需要承担特定的角色,每个团队都需要拥有自己的角色以及集群上的相关RBAC。RBAC可以防止数据科学家向他们不属于的团队提交任务。管理员应根据最小权限原则限制数据科学家。

🚀**任务提交与HyperPod CLI**: 数据科学家可以使用kubectl或SageMaker HyperPod CLI在Amazon EKS编排的SageMaker HyperPod集群上提交任务。通过HyperPod CLI v2.0.0,可以更快地进行迭代,并改进集群和任务管理功能,以及增强任务优先级和加速器配额分配的可见性。

At AWS re:Invent 2024, we launched a new innovation in Amazon SageMaker HyperPod on Amazon Elastic Kubernetes Service (Amazon EKS) that enables you to run generative AI development tasks on shared accelerated compute resources efficiently and reduce costs by up to 40%. Administrators can use SageMaker HyperPod task governance to govern allocation of accelerated compute to teams and projects, and enforce policies that determine the priorities across different types of tasks. The resulting improvement in utilization of compute resources enables organizations to focus on accelerating their generative AI innovation and time to market, instead of spending time coordinating resource allocation and continuously replanning their generative AI development tasks.

In this post, we provide best practices to maximize the value of SageMaker HyperPod task governance and make the administration and data science experiences seamless. We also discuss common governance scenarios when administering and running generative AI development tasks.

Prerequisites

To get started with SageMaker HyperPod task governance on an existing SageMaker HyperPod cluster orchestrated by Amazon EKS, make sure you uninstall any existing Kueue installations, and have a Kubernetes cluster running version 1.30+.

Administration experience

Administrators are the first persona interacting with SageMaker HyperPod task governance. They are responsible for managing the cluster compute allocation according to the organization’s priorities and goals.

Managing compute

The first step to managing capacity across teams is to set up compute allocations. When setting up a compute allocation, keep in mind the following considerations:

When setting up a compute allocation, an administrator sets the team’s fair-share weight, which provides relative prioritization comparative to other teams when vying for the same idle compute. Higher weight enables a team to access unutilized resources within shared capacity sooner. As a best practice, set the fair-share weight higher for teams that will require access to capacity sooner than other teams.

After the fair-share weight is set, the administrator then sets up the quota and borrowing strategy. Quota determines the allocation per instance type within the cluster’s instance groups. Borrowing strategy determines whether a team will share or reserve their allotted capacity. To enforce proper quota management, the total reserved quota should not surpass the cluster’s available capacity for that resource. For instance, if a cluster comprises 20 ml.c5.2xlarge instances, the cumulative quota assigned to teams should remain under 20.

If the compute allocations for teams allow for “Lend and Borrow” or “Lend,” the idle capacity is shared between these teams. For example, if Team A has a quota of 6 but is using only 2 for its tasks, and Team B has a quota of 5 and is using 4 for its tasks, and a task that is submitted to Team B requiring 4 resources, 3 will be borrowed from Team A based on its “Lend and Borrow” settings. If any team’s compute allocations setting is set to “Don’t Lend,” the team will not be able to borrow any additional capacity beyond its reserved capacity.

To maintain a pool or a set of resources that all teams can borrow from, users can set up a dedicated team with resources that bridge the gap between other teams’ allocations and the total cluster capacity. Make sure that this cumulative resource allocation includes the appropriate instance types and doesn’t exceed the total cluster capacity. To make sure that these resources can be shared among teams, enable the participating teams to have their compute allocations set to “Lend and Borrow” or “Lend” for this common pool of resources. In addition, every time new teams are introduced, or quota allocations are changed or there are any changes to the cluster capacity, revisit the quota allocations of all the teams, to make sure the cumulative quota remains at or below cluster capacity.

After compute allocations have been set, the administrator will also need to set a cluster policy, which is comprised of two components: task prioritization and idle compute allocation. Administrators will set up a task prioritization, which determines the priority level for tasks running in a cluster. Next, an administrator will set idle compute allocation setting to either “first come, first serve,” in which tasks are not prioritized, or “fair-share allocation,” in which idle compute is distributed to teams based on their fair-share weight.

Observability

To get started with observability, install the Amazon CloudWatch Observability add-on with Kueue metrics selected. The SageMaker HyperPod task governance dashboard provides a single pane of glass view for cluster utilization across teams. At present, you can view tasks running for PyTorch, TensorFlow, and MPI tasks. Administrators can analyze the graphs within the dashboard to understand equity in resource sharing and utilization of resources.

To view utilization of resources, users can see the following dashboard showing GPU and vCPU utilization. These graphs inform administrators where teams can further maximize their GPU utilization. In this example, administrators observe GPU utilization around 52%

Administrators have a real-time view of utilization of instances as tasks are running or moved to pending during preemption. In this example, the ML engineering team is borrowing 5 GPUs for their training task

With SageMaker HyperPod, you can additionally set up observability tools of your choice. In our public workshop, we have steps on how to set up Amazon Managed Prometheus and Grafana dashboards.

Data scientist experience

Data scientists are the second persona interacting with SageMaker HyperPod clusters. Data scientists are responsible for the training, fine-tuning, and deployment of models on accelerated compute instances. It’s important to make sure data scientists have the necessary capacity and permissions when interacting with clusters of GPUs.

Access control

When working with SageMaker HyperPod task governance, data scientists will assume their specific role. Each data science team will need to have their own role and associated role-based access control (RBAC) on the cluster. RBAC prevents data scientists from submitting tasks to teams in which they do not belong. For more information about data science role permissions, see AWS Identity and Access Management for SageMaker HyperPod. As a best practice, administrators should limit data scientists according to the principle of least privilege. After roles and access entries are set up, data scientists can assume their associated AWS Identity and Access Management (IAM) role to submit tasks to corresponding namespaces. It’s important to note that users interacting with the console dashboard who didn’t create the associated EKS cluster will need to have their role added to the AccessEntry list for the EKS cluster.

Submitting tasks

There are two ways to submit tasks on Amazon EKS orchestrated SageMaker HyperPod clusters: kubectl and the SageMaker HyperPod CLI. With both options, data scientists will need to reference their team’s namespace and task priority class in the task configuration file in order to use their allocated quota with appropriate prioritization. If the user doesn’t a specify priority class, then SageMaker HyperPod task governance will automatically assume the lowest priority.

In the following code snippet, we show the labels required in a kubectl manifest file for the researchers namespace with inference priority. Priority classes will have -priority appended to the name set in the cluster policy. For further guidance on submitting tasks to SageMaker HyperPod task governance, follow the documentation here.

metadata:    name: job-name    namespace: hyperpod-ns-researchers    labels:        kueue.x-k8s.io/queue-name: hyperpod-ns-researchers-localqueue        kueue.x-k8s.io/priority-class: inference-priority

HyperPod CLI

The HyperPod CLI was created to abstract the complexities of working with kubectl and enable developers using SageMaker HyperPod to iterate faster with custom commands. HyperPod CLI v2.0.0 introduces a new default scheduler type with autofill commands, auto discovery of namespaces, improved cluster and task management features, and enhanced visibility into task priorities and accelerator quota allocations. Data scientists can use the new HyperPod CLI to quickly submit tasks, iterate, and experiment in their generative AI development lifecycle.

Sample commands

The following is a short reference guide for helpful commands when interacting with SageMaker HyperPod task governance:

Common scenarios

SageMaker HyperPod task governance enables allocating compute quota to teams, increasing utilization of compute resources, reducing costs, and accelerating waiting tasks by priority and in turn accelerating time to market. To relate these value propositions to real work scenarios, we will talk about a enterprise and a startup situation.

Enterprises have different teams working towards various business goals, each with budgets that limit their compute access. To maximize resource utilization within budget constraints, SageMaker HyperPod task governance allows enterprises to allocate compute quotas to teams for artificial intelligence and machine learning (AI/ML) tasks. When teams use up their allocation, they can access idle compute from other teams to accelerate waiting tasks, providing optimal resource utilization across the organization.

Startups aim to maximize compute resource utilization while achieving timely allocation for high-priority tasks. SageMaker HyperPod task governance’s prioritization feature allows you to assign priorities to different task types, such as prioritizing inference over training. This makes sure that high-priority tasks receive necessary compute resources before lower-priority ones, optimizing overall resource allocation.

Now we will walk you through two common scenarios for users interacting with SageMaker HyperPod task governance.

Scenario 1: Enterprise

In the first scenario, we have an enterprise company who wants to manage compute allocations to optimize for cost. This company has five teams sharing 80 GPUs, with the following configuration:

This sample configuration reserves capacity to teams that will be constantly using instances for high-priority tasks. In addition, a few teams have the option to lend and borrow idle compute from other teams—this improves cost optimization by reserving capacity as needed and allowing non-consistent workloads to run using available idle compute with prioritization.

Scenario 2: Startup

In the second scenario, we have a startup customer who wants to provide equitable compute allocation for members of their engineering and research teams. This company has three teams sharing 15 GPUs:

This sample configuration promotes equitable compute allocation across the company because all teams have the same fair-share weight and are able to preempt tasks with lower priority.

Conclusion

In this post, we discussed best practices for efficient use of SageMaker HyperPod task governance. We also provided certain patterns that you can adopt while administering generative AI tasks, whether you are aiming to optimize for cost or optimize for equitable compute allocation. To get started with SageMaker HyperPod task governance, refer to the Amazon EKS Support in Amazon SageMaker HyperPod workshop and SageMaker HyperPod task governance.


About the Author

Nisha Nadkarni is a Senior GenAI Specialist Solutions Architect at AWS, where she guides companies through best practices when deploying large scale distributed training and inference on AWS. Prior to her current role, she spent several years at AWS focused on helping emerging GenAI startups develop models from ideation to production.

Chaitanya Hazarey leads software development for SageMaker HyperPod task governance at Amazon, bringing extensive expertise in full-stack engineering, ML/AI, and data science. As a passionate advocate for responsible AI development, he combines technical leadership with a deep commitment to advancing AI capabilities while maintaining ethical considerations. His comprehensive understanding of modern product development drives innovation in machine learning infrastructure.

 Kareem Syed-Mohammed is a Product Manager at AWS. He is focused on compute optimization and cost governance. Prior to this, at Amazon QuickSight, he led embedded analytics, and developer experience. In addition to QuickSight, he has been with AWS Marketplace and Amazon retail as a Product Manager. Kareem started his career as a developer for call center technologies, Local Expert and Ads for Expedia, and management consultant at McKinsey.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SageMaker HyperPod 任务治理 生成式AI Amazon EKS
相关文章