Governing ML lifecycle at scale: Best practices to set up cost and usage visibility of ML workloads in multi-account environments

Cloud costs can significantly impact your business operations. Gaining real-time visibility into infrastructure expenses, usage patterns, and cost drivers is essential. This insight enables agile decision-making, optimized scalability, and maximizes the value derived from cloud investments, providing cost-effective and efficient cloud utilization for your organization’s future growth. What makes cost visibility even more important for the cloud is that cloud usage is dynamic. This requires continuous cost reporting and monitoring to make sure costs don’t exceed expectations and you only pay for the usage you need. Additionally, you can measure the value the cloud delivers to your organization by quantifying the associated cloud costs.

For a multi-account environment, you can track costs at an AWS account level to associate expenses. However, to allocate costs to cloud resources, a tagging strategy is essential. A combination of an AWS account and tags provides the best results. Implementing a cost allocation strategy early is critical for managing your expenses and future optimization activities that will reduce your spend.

This post outlines steps you can take to implement a comprehensive tagging governance strategy across accounts, using AWS tools and services that provide visibility and control. By setting up automated policy enforcement and checks, you can achieve cost optimization across your machine learning (ML) environment.

Implement a tagging strategy

A tag is a label you assign to an AWS resource. Tags consist of a customer-defined key and an optional value to help manage, search for, and filter resources. Tag keys and values are case sensitive. A tag value (for example, Production) is also case sensitive, like the keys.

It’s important to define a tagging strategy for your resources as soon as possible when establishing your cloud foundation. Tagging is an effective scaling mechanism for implementing cloud management and governance strategies. When defining your tagging strategy, you need to determine the right tags that will gather all the necessary information in your environment. You can remove tags when they’re no longer needed and apply new tags whenever required.

Categories for designing tags

Some of the common categories used for designing tags are as follows:

Cost allocation tags

Automation tags

Access control tags

AWS Identity and Access Management

Technical tags

environment

owner

aws: tags

Compliance tags

Business tags

A tagging strategy also defines a standardized convention and implementation of tags across all resource types.

When defining tags, use the following conventions:

Use all lowercase for consistency and to avoid confusion Separate words with hyphens Use a prefix to identify and separate AWS generated tags from third-party tool generated tags

Tagging dictionary

When defining a tagging dictionary, delineate between mandatory and discretionary tags. Mandatory tags help identify resources and their metadata, regardless of purpose. Discretionary tags are the tags that your tagging strategy defines, and they should be made available to assign to resources as needed. The following table provides examples of a tagging dictionary used for tagging ML resources.

Tag Type	Tag Key	Purpose	Cost Allocation	Mandatory
Workload	`anycompany:workload:application-id`	Identifies disparate resources that are related to a specific application	Y	Y
Workload	`anycompany:workload:environment`	Distinguishes between `dev`, `test`, and `production`	Y	Y
Financial	`anycompany:finance:owner`	Indicates who is responsible for the resource, for example `SecurityLead`, `SecOps`, `Workload-1-Development-team`	Y	Y
Financial	`anycompany:finance:business-unit`	Identifies the business unit the resource belongs to, for example `Finance`, `Retail`, `Sales`, `DevOps`, `Shared`	Y	Y
Financial	`anycompany:finance:cost-center`	Indicates cost allocation and tracking, for example `5045`, `Sales-5045`, `HR-2045`	Y	Y
Security	`anycompany:security:data-classification`	Indicates data confidentiality that the resource supports	N	Y
Automation	`anycompany:automation:encryption`	Indicates if the resource needs to store encrypted data	N	N
Workload	`anycompany:workload:name`	Identifies an individual resource	N	N
Workload	`anycompany:workload:cluster`	Identifies resources that share a common configuration or perform a specific function for the application	N	N
Workload	`anycompany:workload:version`	Distinguishes between different versions of a resource or application component	N	N
Operations	`anycompany:operations:backup`	Identifies if the resource needs to be backed up based on the type of workload and the data that it manages	N	N
Regulatory	`anycompany:regulatory:framework`	Requirements for compliance to specific standards and frameworks, for example NIST, HIPAA, or GDPR	N	N

You need to define what resources require tagging and implement mechanisms to enforce mandatory tags on all necessary resources. For multiple accounts, assign mandatory tags to each one, identifying its purpose and the owner responsible. Avoid personally identifiable information (PII) when labeling resources because tags remain unencrypted and visible.

Tagging ML workloads on AWS

When running ML workloads on AWS, primary costs are incurred from compute resources required, such as Amazon Elastic Compute Cloud (Amazon EC2) instances for hosting notebooks, running training jobs, or deploying hosted models. You also incur storage costs for datasets, notebooks, models, and so on stored in Amazon Simple Storage Service (Amazon S3).

A reference architecture for the ML platform with various AWS services is shown in the following diagram. This framework considers multiple personas and services to govern the ML lifecycle at scale. For more information about the reference architecture in detail, see Governing the ML lifecycle at scale, Part 1: A framework for architecting ML workloads using Amazon SageMaker.

The reference architecture includes a landing zone and multi-account landing zone accounts. These should be tagged to track costs for governance and shared services.

The key contributors towards recurring ML cost that should be tagged and tracked are as follows:

Amazon DataZone

AWS Lake Formation

AWS Lake Formation

AWS Command Line Interface

Amazon SageMaker

Amazon SageMaker

Amazon SageMaker Feature Store

Amazon SageMaker Feature Store

Amazon SageMaker resources

Using tags allows you to incur costs that align with business needs. Monitoring expenses this way gives insight into how budgets are consumed.

Enforce a tagging strategy

An effective tagging strategy uses mandatory tags and applies them consistently and programmatically across AWS resources. You can use both reactive and proactive approaches for governing tags in your AWS environment.

Proactive governance uses tools such as AWS CloudFormation, AWS Service Catalog, tag policies in AWS Organizations, or IAM resource-level permissions to make sure you apply mandatory tags consistently at resource creation. For example, you can use the CloudFormation Resource Tags property to apply tags to resource types. In Service Catalog, you can add tags that automatically apply when you launch the service.

Reactive governance is for finding resources that lack proper tags using tools such as the AWS Resource Groups tagging API, AWS Config rules, and custom scripts. To find resources manually, you can use Tag Editor and detailed billing reports.

Proactive governance

Proactive governance uses the following tools:

Service catalog

TagOptions

CloudFormation Resource Tags

AWS CloudFormation Resource Tags property

Tag policies

Tag policies

AWS Resource Groups

Service Control Policies

Service Control Policies (SCPs)

Policies

Service Control Policies

Reactive governance

Reactive governance uses the following tools:

AWS Config rules

AWS Config rule

required-tags

AWS Resource Groups tagging API

AWS Resource Groups Tagging API

Creating query-based groups in AWS Resource Groups

Tag Editor

Tag Editor

Finding resources to tag

SageMaker tag propagation

Amazon SageMaker Studio provides a single, web-based visual interface where you can perform all ML development steps required to prepare data, as well as build, train, and deploy models. SageMaker Studio automatically copies and assign tags to the SageMaker Studio notebooks created by the users, so you can track and categorize the cost of SageMaker Studio notebooks.

Amazon SageMaker Pipelines allows you to create end-to-end workflows for managing and deploying SageMaker jobs. Each pipeline is composed of a sequence of steps that transform data into a trained model. Tags can be applied to pipelines similarly to how they are used for other SageMaker resources. When a pipeline is run, its tags can potentially propagate to the underlying jobs launched as part of the pipeline steps.

When models are registered in Amazon SageMaker Model Registry, tags can be propagated from model packages to other related resources like endpoints. Model packages in the registry can be tagged when registering a model version. These tags become associated with the model package. Tags on model packages can potentially propagate to other resources that reference the model, such as endpoints created using the model.

Tag policy quotas

The number of policies that you can attach to an entity (root, OU, and account) is subject to quotas for AWS Organizations. See Quotas and service limits for AWS Organizations for the number of tags that you can attach.

Monitor resources

To achieve financial success and accelerate business value realization in the cloud, you need complete, near real-time visibility of cost and usage information to make informed decisions.

Cost organization

You can apply meaningful metadata to your AWS usage with AWS cost allocation tags. Use AWS Cost Categories to create rules that logically group cost and usage information by account, tags, service, charge type, or other categories. Access the metadata and groupings in services like AWS Cost Explorer, AWS Cost and Usage Reports, and AWS Budgets to trace costs and usage back to specific teams, projects, and business initiatives.

Cost visualization

You can view and analyze your AWS costs and usage over the past 13 months using Cost Explorer. You can also forecast your likely spending for the next 12 months and receive recommendations for Reserved Instance purchases that may reduce your costs. Using Cost Explorer enables you to identify areas needing further inquiry and to view trends to understand your costs. For more detailed cost and usage data, use AWS Data Exports to create exports of your billing and cost management data by selecting SQL columns and rows to filter the data you want to receive. Data exports get delivered on a recurring basis to your S3 bucket for you to use with your business intelligence (BI) or data analytics solutions.

You can use AWS Budgets to set custom budgets that track cost and usage for simple or complex use cases. AWS Budgets also lets you enable email or Amazon Simple Notification Service (Amazon SNS) notifications when actual or forecasted cost and usage exceed your set budget threshold. In addition, AWS Budgets integrates with Cost Explorer.

Cost allocation

Cost Explorer enables you to view and analyze your costs and usage data over time, up to 13 months, through the AWS Management Console. It provides premade views displaying quick information about your cost trends to help you customize views suiting your needs. You can apply various available filters to view specific costs. Also, you can save any view as a report.

Monitoring in a multi-account setup

SageMaker supports cross-account lineage tracking. This allows you to associate and query lineage entities, like models and training jobs, owned by different accounts. It helps you track related resources and costs across accounts. Use the AWS Cost and Usage Report to track costs for SageMaker and other services across accounts. The report aggregates usage and costs based on tags, resources, and more so you can analyze spending per team, project, or other criteria spanning multiple accounts.

Cost Explorer allows you to visualize and analyze SageMaker costs from different accounts. You can filter costs by tags, resources, or other dimensions. You can also export the data to third-party BI tools for customized reporting.

Conclusion

In this post, we discussed how to implement a comprehensive tagging strategy to track costs for ML workloads across multiple accounts. We discussed implementing tagging best practices by logically grouping resources and tracking costs by dimensions like environment, application, team, and more. We also looked at enforcing the tagging strategy using proactive and reactive approaches. Additionally, we explored the capabilities within SageMaker to apply tags. Lastly, we examined approaches to provide visibility of cost and usage for your ML workloads.

For more information about how to govern your ML lifecycle, see Part 1 and Part 2 of this series.

About the authors

Gunjan Jain, an AWS Solutions Architect based in Southern California, specializes in guiding large financial services companies through their cloud transformation journeys. He expertly facilitates cloud adoption, optimization, and implementation of Well-Architected best practices. Gunjan’s professional focus extends to machine learning and cloud resilience, areas where he demonstrates particular enthusiasm. Outside of his professional commitments, he finds balance by spending time in nature.

Ram Vittal is a Principal Generative AI Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, reliable and scalable GenAI/ML systems to help enterprise customers improve their business outcomes. In his spare time, he rides motorcycle and enjoys walking with his dogs!