Track, allocate, and manage your generative AI cost and usage with Amazon Bedrock

As enterprises increasingly embrace generative AI , they face challenges in managing the associated costs. With demand for generative AI applications surging across projects and multiple lines of business, accurately allocating and tracking spend becomes more complex. Organizations need to prioritize their generative AI spending based on business impact and criticality while maintaining cost transparency across customer and user segments. This visibility is essential for setting accurate pricing for generative AI offerings, implementing chargebacks, and establishing usage-based billing models.

Without a scalable approach to controlling costs, organizations risk unbudgeted usage and cost overruns. Manual spend monitoring and periodic usage limit adjustments are inefficient and prone to human error, leading to potential overspending. Although tagging is supported on a variety of Amazon Bedrock resources—including provisioned models, custom models, agents and agent aliases, model evaluations, prompts, prompt flows, knowledge bases, batch inference jobs, custom model jobs, and model duplication jobs—there was previously no capability for tagging on-demand foundation models. This limitation has added complexity to cost management for generative AI initiatives.

To address these challenges, Amazon Bedrock has launched a capability that organization can use to tag on-demand models and monitor associated costs. Organizations can now label all Amazon Bedrock models with AWS cost allocation tags, aligning usage to specific organizational taxonomies such as cost centers, business units, and applications. To manage their generative AI spend judiciously, organizations can use services like AWS Budgets to set tag-based budgets and alarms to monitor usage, and receive alerts for anomalies or predefined thresholds. This scalable, programmatic approach eliminates inefficient manual processes, reduces the risk of excess spending, and ensures that critical applications receive priority. Enhanced visibility and control over AI-related expenses enables organizations to maximize their generative AI investments and foster innovation.

Introducing Amazon Bedrock application inference profiles

Amazon Bedrock recently introduced cross-region inference, enabling automatic routing of inference requests across AWS Regions. This feature uses system-defined inference profiles (predefined by Amazon Bedrock), which configure different model Amazon Resource Names (ARNs) from various Regions and unify them under a single model identifier (both model ID and ARN). While this enhances flexibility in model usage, it doesn’t support attaching custom tags for tracking, managing, and controlling costs across workloads and tenants.

To bridge this gap, Amazon Bedrock now introduces application inference profiles, a new capability that allows organizations to apply custom cost allocation tags to track, manage, and control their Amazon Bedrock on-demand model costs and usage. This capability enables organizations to create custom inference profiles for Bedrock base foundation models, adding metadata specific to tenants, thereby streamlining resource allocation and cost monitoring across varied AI applications.

Creating application inference profiles

Application inference profiles allow users to define customized settings for inference requests and resource management. These profiles can be created in two ways:

Single model ARN configuration

Copy from system-defined inference profile

The application inference profile ARN has the following format, where the inference profile ID component is a unique 12-digit alphanumeric string generated by Amazon Bedrock upon profile creation.

arn:aws:bedrock:<region>:<account_id>:application-inference-profile/<inference_profile_id>

System-defined compared to application inference profiles

The primary distinction between system-defined and application inference profiles lies in their type attribute and resource specifications within the ARN namespace:

System-defined inference profiles

SYSTEM_DEFINED

inference-profile

{ …"inferenceProfileArn": "arn:aws:bedrock:us-east-1:<Account ID>:inference-profile/us-1.anthropic.claude-3-sonnet-20240229-v1:0","inferenceProfileId": "us-1.anthropic.claude-3-sonnet-20240229-v1:0","inferenceProfileName": "US-1 Anthropic Claude 3 Sonnet","status": "ACTIVE","type": "SYSTEM_DEFINED",…}

Application inference profiles

type

APPLICATION

application-inference-profile

{…"inferenceProfileArn": "arn:aws:bedrock:us-east-1:<Account ID>:application-inference-profile/<Auto generated ID>","inferenceProfileId": <Auto generated ID>,"inferenceProfileName": <User defined name>,"status": "ACTIVE","type": "APPLICATION"…}

These differences are important when integrating with Amazon API Gateway or other API clients to help ensure correct model invocation, resource allocation, and workload prioritization. Organizations can apply customized policies based on profile type, enhancing control and security for distributed AI workloads. Both models are shown in the following figure.

Establishing application inference profiles for cost management

Imagine an insurance provider embarking on a journey to enhance customer experience through generative AI. The company identifies opportunities to automate claims processing, provide personalized policy recommendations, and improve risk assessment for clients across various regions. However, to realize this vision, the organization must adopt a robust framework for effectively managing their generative AI workloads.

The journey begins with the insurance provider creating application inference profiles that are tailored to their diverse business units. By assigning AWS cost allocation tags, the organization can effectively monitor and track their Bedrock spend patterns. For example, the claims processing team established an application inference profile with tags such as dept:claims, team:automation, and app:claims_chatbot. This tagging structure categorizes costs and allows assessment of usage against budgets.

Users can manage and use application inference profiles using Bedrock APIs or the boto3 SDK:

CreateInferenceProfile

GetInferenceProfile

ListInferenceProfiles

TagResource

ListTagsForResource

UntagResource

Invoke models with application inference profiles

Converse API

ConverseStream API

InvokeModel API

InvokeModelWithResponseStream API

Note that application inference profile APIs cannot be accessed through the AWS Management Console.

Invoke model with application inference profile using Converse API

The following example demonstrates how to create an application inference profile and then invoke the Converse API to engage in a conversation using that profile –

def create_inference_profile(profile_name, model_arn, tags):    """Create Inference Profile using base model ARN"""    response = bedrock.create_inference_profile(        inferenceProfileName=profile_name,        description="test",        modelSource={'copyFrom': model_arn},        tags=tags    )    print("CreateInferenceProfile Response:", response['ResponseMetadata']['HTTPStatusCode']),    print(f"{response}\n")    return response# Create Inference Profileprint("Testing CreateInferenceProfile...")tags = [{'key': 'dept', 'value': 'claims'}]base_model_arn = "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0"claims_dept_claude_3_sonnet_profile = create_inference_profile("claims_dept_claude_3_sonnet_profile", base_model_arn, tags)# Extracting the ARN and retrieving Application Inference Profile IDclaims_dept_claude_3_sonnet_profile_arn = claims_dept_claude_3_sonnet_profile['inferenceProfileArn']def converse(model_id, messages):    """Use the Converse API to engage in a conversation with the specified model"""    response = bedrock_runtime.converse(        modelId=model_id,        messages=messages,        inferenceConfig={            'maxTokens': 300,  # Specify max tokens if needed        }    )        status_code = response.get('ResponseMetadata', {}).get('HTTPStatusCode')    print("Converse Response:", status_code)    parsed_response = parse_converse_response(response)    print(parsed_response)    return response# Example of Converse API with Application Inference Profileprint("\nTesting Converse...")prompt = "\n\nHuman: Tell me about Amazon Bedrock.\n\nAssistant:"messages = [{"role": "user", "content": [{"text": prompt}]}]response = converse(claims_dept_claude_3_sonnet_profile_arn, messages)

Tagging, resource management, and cost management with application inference profiles

Tagging within application inference profiles allows organizations to allocate costs with specific generative AI initiatives, ensuring precise expense tracking. Application inference profiles enable organizations to apply cost allocation tags at creation and support additional tagging through the existing TagResource and UnTagResource APIs, which allow metadata association with various AWS resources. Custom tags such as project_id, cost_center, model_version, and environment help categorize resources, improving cost transparency and allowing teams to monitor spend and usage against budgets.

Visualize cost and usage with application inference profiles and cost allocation tags

Leveraging cost allocation tags with tools like AWS Budgets, AWS Cost Anomaly Detection, AWS Cost Explorer, AWS Cost and Usage Reports (CUR), and Amazon CloudWatch provides organizations insights into spending trends, helping detect and address cost spikes early to stay within budget.

With AWS Budgets, organization can set tag-based thresholds and receive alerts as spending approach budget limits, offering a proactive approach to maintaining control over AI resource costs and quickly addressing any unexpected surges. For example, a $10,000 per month budget could be applied on a specific chatbot application for the Support Team in the Sales Department by applying the following tags to the application inference profile: dept:sales, team:support, and app:chat_app. AWS Cost Anomaly Detection can also monitor tagged resources for unusual spending patterns, making it easier to operationalize cost allocation tags by automatically identifying and flagging irregular costs.

The following AWS Budgets console screenshot illustrates an exceeded budget threshold:

For deeper analysis, AWS Cost Explorer and CUR enable organizations to analyze tagged resources daily, weekly, and monthly, supporting informed decisions on resource allocation and cost optimization. By visualizing cost and usage based on metadata attributes, such as tag key/value and ARN, organizations gain an actionable, granular view of their spending.

The following AWS Cost Explorer console screenshot illustrates a cost and usage graph filtered by tag key and value:

The following AWS Cost Explorer console screenshot illustrates a cost and usage graph filtered by Bedrock application inference profile ARN:

Organizations can also use Amazon CloudWatch to monitor runtime metrics for Bedrock applications, providing additional insights into performance and cost management. Metrics can be graphed by application inference profile, and teams can set alarms based on thresholds for tagged resources. Notifications and automated responses triggered by these alarms enable real-time management of cost and resource usage, preventing budget overruns and maintaining financial stability for generate AI workloads.

The following Amazon CloudWatch console screenshot highlights Bedrock runtime metrics filtered by Bedrock application inference profile ARN:

The following Amazon CloudWatch console screenshot highlights an invocation limit alarm filtered by Bedrock application inference profile ARN:

Through the combined use of tagging, budgeting, anomaly detection, and detailed cost analysis, organizations can effectively manage their AI investments. By leveraging these AWS tools, teams can maintain a clear view of spending patterns, enabling more informed decision-making and maximizing the value of their generative AI initiatives while ensuring critical applications remain within budget.

Retrieving application inference profile ARN based on the tags for Model invocation

Organizations often use a generative AI gateway or large language model proxy when calling Amazon Bedrock APIs, including model inference calls. With the introduction of application inference profiles, organizations need to retrieve the inference profile ARN to invoke model inference for on-demand foundation models. There are two primary approaches to obtain the appropriate inference profile ARN.

Static configuration approach:

AWS Systems Manager Parameter Store

AWS Secrets Manager

Dynamic retrieval using the Resource Groups API:

GetResources

However, there are considerations to keep in mind. The GetResources API has throttling limits, necessitating the implementation of a caching mechanism. Organizations should maintain a cache with a Time-To-Live (TTL) based on the API’s output to optimize performance and reduce API calls. Additionally, implementing thread safety is crucial to help ensure that organizations always read the most up-to-date inference profile ARNs when the cache is being refreshed based on the TTL.

As illustrated in the following diagram, this dynamic approach involves a client making a request to the Resource Groups service with specific resource type and tag filters. The service returns the corresponding application inference profile ARN, which is then cached for a set period. The client can then use this ARN to invoke the Bedrock model through the InvokeModel or Converse API.

By adopting this dynamic retrieval method, organizations can create a more flexible and scalable system for managing application inference profiles, allowing for more straightforward adaptation to changing requirements and growth in the number of profiles.

The architecture in the preceding figure illustrates two methods for dynamically retrieving inference profile ARNs based on tags. Let’s describe both approaches with their pros and cons:

Bedrock client maintaining the cache with TTL:

ResourceGroups

GetResources

Lambda-based Method:

ResourceGroups

Lambda Extensions core

ResourceGroups

Both methods use similar filtering criteria (resource-type-filter and tag-filters) to query the ResourceGroup API, allowing for precise retrieval of inference profile ARNs based on attributes such as tenant, model, and Region. The choice between these methods depends on factors such as the expected request volume, desired latency, cost considerations, and the need for additional processing or security measures. The Lambda-based approach offers more flexibility and optimization potential, while the direct API method is simpler to implement and maintain.

Overview of Amazon Bedrock resources tagging capabilities

The tagging capabilities of Amazon Bedrock have evolved significantly, providing a comprehensive framework for resource management across multi-account AWS Control Tower setups. This evolution enables organizations to manage resources across development, staging, and production environments, helping organizations track, manage, and allocate costs for their AI/ML workloads.

At its core, the Amazon Bedrock resource tagging system spans multiple operational components. Organizations can effectively tag their batch inference jobs, agents, custom model jobs, knowledge bases, prompts, and prompt flows. This foundational level of tagging supports granular control over operational resources, enabling precise tracking and management of different workload components. The model management aspect of Amazon Bedrock introduces another layer of tagging capabilities, encompassing both custom and base models, and distinguishes between provisioned and on-demand models, each with its own tagging requirements and capabilities.

With the introduction of application inference profiles, organizations can now manage and track their on-demand Bedrock base foundation models. Because teams can create application inference profiles derived from system-defined inference profiles, they can configure more precise resource tracking and cost allocation at the application level. This capability is particularly valuable for organizations that are running multiple AI applications across different environments, because it provides clear visibility into resource usage and costs at a granular level.

The following diagram visualizes the multi-account structure and demonstrates how these tagging capabilities can be implemented across different AWS accounts.

Conclusion

In this post we introduced the latest feature from Amazon Bedrock, application inference profiles. We explored how it operates and discussed key considerations. The code sample for this feature is available in this GitHub repository. This new capability enables organizations to tag, allocate, and track on-demand model inference workloads and spending across their operations. Organizations can label all Amazon Bedrock models using tags and monitoring usage according to their specific organizational taxonomy—such as tenants, workloads, cost centers, business units, teams, and applications. This feature is now generally available in all AWS Regions where Amazon Bedrock is offered.

About the authors

Kyle T. Blocksom is a Sr. Solutions Architect with AWS based in Southern California. Kyle’s passion is to bring people together and leverage technology to deliver solutions that customers love. Outside of work, he enjoys surfing, eating, wrestling with his dog, and spoiling his niece and nephew.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Introducing Amazon Bedrock application inference profiles

Creating application inference profiles

System-defined compared to application inference profiles

Establishing application inference profiles for cost management

Invoke model with application inference profile using Converse API

Tagging, resource management, and cost management with application inference profiles

Visualize cost and usage with application inference profiles and cost allocation tags

Retrieving application inference profile ARN based on the tags for Model invocation

Overview of Amazon Bedrock resources tagging capabilities

Conclusion

About the authors

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签