AWS Machine Learning Blog 2024年11月08日
Build a multi-tenant generative AI environment for your enterprise on AWS
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

随着生成式AI应用的不断拓展,企业面临着团队孤岛和定制化工作流程导致的采用速度缓慢问题。为了加速创新,企业需要构建强大的运营模型和整体方法来简化生成式AI生命周期。本文探讨了如何通过构建一个多租户的生成式AI环境,实现生成式AI组件的集中化,并深入探讨了访问模式、治理、负责任的AI、可观察性和常见解决方案设计,例如检索增强生成。该方案利用Amazon Bedrock、Amazon API Gateway、AWS Lambda和Amazon SageMaker等AWS服务,为不同业务部门提供对基础模型的访问,并支持自定义模型部署,从而加速生成式AI的采用和创新。

🤔 **构建多租户生成式AI环境的关键设计考虑因素:**该方案需要满足生成式AI工作负载和负责任的AI治理的独特需求,同时保持对企业政策、租户和数据隔离、访问管理和成本控制的遵守。

🚀 **加速生成式AI采用的解决方案:**通过快速实验、统一模型访问和通用生成式AI组件的可重用性,加快生成式AI的采用。提供租户灵活选择最优设计和技术实现,以满足其特定用例的需求。

🛡️ **集中式治理、防护和控制:**实现集中式治理,建立防护措施和控制机制,确保生成式AI应用的安全和合规性。

📊 **模型使用和成本跟踪:**允许跟踪和审计模型使用情况,并按租户、业务部门或基础模型提供商进行成本分配。

💡 **检索增强生成(RAG)的应用:**根据用例和数据隔离要求,租户可以选择共享或隔离知识库,并实施项目级或资源级数据隔离。租户可以选择他们有权访问的数据源,选择合适的切分策略,利用共享的生成式AI基础模型将数据转换为嵌入,并将嵌入存储在他们的向量存储中。

While organizations continue to discover the powerful applications of generative AI, adoption is often slowed down by team silos and bespoke workflows. To move faster, enterprises need robust operating models and a holistic approach that simplifies the generative AI lifecycle. In the first part of the series, we showed how AI administrators can build a generative AI software as a service (SaaS) gateway to provide access to foundation models (FMs) on Amazon Bedrock to different lines of business (LOBs). In this second part, we expand the solution and show to further accelerate innovation by centralizing common Generative AI components. We also dive deeper into access patterns, governance, responsible AI, observability, and common solution designs like Retrieval Augmented Generation.

Our solution uses Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API via a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. It also uses a number of other AWS services such as Amazon API Gateway, AWS Lambda, and Amazon SageMaker.

Architecting a multi-tenant generative AI environment on AWS

A multi-tenant, generative AI solution for your enterprise needs to address the unique requirements of generative AI workloads and responsible AI governance while maintaining adherence to corporate policies, tenant and data isolation, access management, and cost control. As a result, building such a solution is often a significant undertaking for IT teams.

In this post, we discuss the key design considerations and present a reference architecture that:

Solution overview

The proposed solution consists of two parts:

The following diagram illustrates an overview of the solution.

Generative AI gateway

Shared components lie in this part. Shared components refer to the functionality and features shared by all tenants. Each component in the previous diagram can be implemented as a microservice and is multi-tenant in nature, meaning it stores details related to each tenant, uniquely represented by a tenant_id. Some components are categorized in groups based on the type of functionality they exhibit.

The standalone components are:

The component groups are as follows.

Tenant

This part represents the tenants using the AI gateway capabilities. Each tenant has different requirements and needs and their own application stack. They can integrate their application with the generative AI gateway to embed generative AI capabilities in their application. The environment Admin has access to the generative AI gateway and interacts with the core services.

Solution walkthrough

The following sections examine each part of the solution in more depth.

HTTPS endpoint

This serves as the entry point for the generative AI gateway. Incoming requests to the gateway go through this point. There are different approaches you can follow when designing the endpoint:

Tenants and access patterns

Tenants, such as LOBs or teams, use the shared services to access APIs and integrate generative AI capabilities into their applications. They can also use the playground UI to assess the suitability of generative AI for their specific use case before diving into full-fledged application development.

Here you also have the data sources, processing pipelines, vector stores, and data governance mechanisms that allow tenants to securely discover, access, andthe data they need for their specific use case. At this point, you need to consider the use case and data isolation requirements. Some applications may need to access data with personal identifiable information (PII) while others may rely on noncritical data. You also need to consider the operational characteristics and noisy neighbor risks.

Take Retrieval Augmented Generation (RAG) as an example. Depending on the use case and data isolation requirements, tenants can have a pooled knowledge base or a siloed one and implement item-level isolation or resource level isolation for the data respectively. Tenants can select data from the data sources they have access to, choose the right chunking strategy for their application, use the shared generative AI FMs for converting the data into embeddings, and store the embeddings in their vector store.

To answer user questions in real time, tenants can implement caching mechanisms to reduce latency and costs for frequent queries. Additionally, they can implement custom logic to retrieve information about previous sessions, the state of the interaction, and information specific to the end user. To generate the final response, they can again access the models and re-ranking functionality available through the gateway.

The following diagram illustrates a potential implementation of a chat-based assistant application with this approach. The tenant application uses FMs available through the generative AI gateway and its own vector store to provide personalized, relevant responses to the end user.

Shared services

The following section describes the shared services groups.

Model components

The goal of this component group is to expose a unified API to tenants for accessing underlying models irrespective of where these are hosted. It abstracts invocation details and accelerates application development. It consists of one or more components depending on the number of FM providers and number and types of custom models used. These components are illustrated in the following diagram.

In terms of how to offer FMs to your tenants, with AWS you have several options:

With AWS PrivateLink, you can create a private connection between your virtual private cloud (VPC) and Amazon Bedrock and SageMaker endpoints.

Generative AI application components

This group contains components linked to the unique requirements of generative AI applications. They’re illustrated in the following figure.

Responsible AI components

This group contains key components for Responsible AI, as shown in the following diagram.

If you use Amazon Bedrock, you can enable model invocation logging to collect input and output data and use Amazon Bedrock evaluation to run model evaluation jobs. Alternatively, you can use AWS Lambda and implement your own logic, or use open source tools such as fmeval. In SageMaker, you can enable data capture for your SageMaker real-time endpoint and use SageMaker Clarify to run the model evaluation jobs or implement your own evaluation logic. Both Amazon Bedrock and SageMaker integrate with SageMaker Ground Truth, which helps you gather ground truth data and human feedback for model responses. AWS Step Functions can help you orchestrate the end-to-end monitoring workflow.

Core services

Core services represent a collection of administrative and management components or modules. These components are designed to provide oversight, control, and governance over various aspects of the system’s operation, resource management, user and tenant administration, and model management. These are illustrated in the following diagram.

Tenant management and identity

Tenant management is a crucial aspect of multi-tenant systems, where a single instance of an application or environment serves multiple tenants or customers, each with their own isolated and secure environment. The tenant management component is responsible for managing and administering these tenants within the system.

In the proposed solution, you also need an authorization flow that establishes the identity of the user making the request. With AWS IAM Identity Center, you can create or connect workforce users and centrally manage their access across their AWS accounts and applications. With Amazon Cognito, you can authenticate and authorize users from the built-in user directory, from your enterprise directory, and from other consumer identity providers. AWS Identity and Access Management (IAM) provides fine-grained access control. You can use IAM to specify who can access which FMs and resources to maintain least privilege permissions.

For example, in one common scenario with Cognito that accesses resources with API Gateway and Lambda with a user pool. In the following diagram, when your user signs in to an Amazon Cognito user pool, your application receives JSON Web Tokens (JWTs). You can use groups in a user pool to control permissions with API Gateway by mapping group membership to IAM roles. You can submit your user pool tokens with a request to API Gateway for verification by an Amazon Cognito authorizer Lambda function. For more information, see Using API Gateway with Amazon Cognito user pools.

It is recommended that you don’t use API keys for authentication or authorization to control access to your APIs. Instead, use an IAM role, a Lambda authorizer, or an Amazon Cognito user pool.

Model onboarding

A key aspect of the generative AI gateway is allowing controlled access to foundation and custom models across tenants. For FMs available through Amazon Bedrock, the model onboarding component maintains an allowlist of approved models that tenants can access. You can use a service such as Amazon DynamoDB to track allowlisted models. Similarly, for custom models deployed on Amazon SageMaker, the component tracks which tenants have access to which model versions through entries in the DynamoDB registry table.

To enforce access control, you can use AWS Lambda authorizers with Amazon API Gateway. When a tenant application calls the model invocation API, the Lambda authorizer verifies the tenant’s identity and checks if they have permission to access the requested model based on the DynamoDB registry table. If access is permitted, temporary credentials are issued, which scope down the tenant’s permissions to just the allowed model(s). This prevents tenants from accessing models they shouldn’t have access to. The authorizer logic can be customized based on an organization’s model access policies and governance requirements.

This approach supports model end of life. By managing the model from the allowlist in the DynamoDB registry table for all or selected tenants, models not included aren’t usable automatically, with no further code changes required in the solution.

Model registry

A model registry helps manage and track different versions of custom models. Services such as Amazon SageMaker Model Registry and Amazon DynamoDB help track available models, associated generated model artifacts, and lineage. A model registry offers the following:

    Version control – To track different versions of the generative AI models. Model lineage and provenance – To track the lineage and provenance of each model version, including information about the training data, hyperparameters, model architecture, and other relevant metadata that describes the model’s origin and characteristics. Model deployment and rollback – To facilitate the deployment and usage of new model versions into production environments and the rollback to previous versions if necessary. This makes sure that models can be updated or reverted seamlessly without disrupting the system’s operation. Model governance and compliance – To verify that model versions are properly documented, audited, and conform to relevant policies or regulations. This is particularly useful in regulated industries or environments with strict compliance requirements.

Observability

Observability is crucial for monitoring the health of your application, troubleshooting issues, usage of FMs, and optimizing performance and costs.

Logging and monitoring

Amazon CloudWatch is a powerful monitoring and observability service that allows you to collect and analyze logs from your application components, including API Gateway, Amazon Bedrock, Amazon SageMaker, and custom services. Using CloudWatch to capture tenant identity in the logs across the whole stack helps you gain insights into the performance and health of your generative AI gateway down to the tenant level and proactively identify and resolve issues before they escalate. You can also set up alarms to get notified in case of unexpected behavior. Both Amazon SageMaker and Amazon Bedrock are integrated with AWS CloudTrail.

Metering

Metering helps collect, aggregate, and analyze operational and usage data and performance metrics from different parts of the solution. In systems that offer pay-per-use or subscription-based models, metering is crucial for accurately measuring and reporting resource consumption for billing purposes across the different tenants.

In this solution, you need to track the usage of FMs to effectively manage costs and optimize resource utilization. Collecting information related to the models used, number of tokens provided as input, tokens generated as output, AWS Region used, and applying tags related to the team helps you streamline the cost allocation and billing processes. You can log structured data during interactions with the FMs and collect this usage information. The following diagram shows an implementation where the Lambda function logs per tenant information in Amazon CloudWatch and invokes Amazon Bedrock. The invocation generates an AWS CloudTrail event.

Auditing

You can use an AWS Lambda function to aggregate the data from Amazon CloudWatch and store it in S3 buckets for long-term storage and further analysis. Amazon S3 provides a highly durable, scalable, and cost-effective object storage solution, making it an ideal choice for storing large volumes of data. For implementation details, refer to part 1 of this series, Build an internal SaaS service with cost and usage tracking for foundation models on Amazon Bedrock.

Once the data is in Amazon S3, you can use AWS analytics services such as Amazon Athena, AWS Glue Data Catalog, and Amazon QuickSight to uncover patterns in the cost and usage data, generate reports, visualize trends, and make informed decisions about resource allocation, budget forecasting, and cost optimization strategies. With AWS Glue Data Catalog, a centralized metadata repository, and Amazon Athena, an interactive query service, you can run one-time SQL queries directly on the data stored in Amazon S3. The following example describes usage and cost per model per tenant in Athena.

Scaling across the enterprise

The following are some design considerations for when you scale this solution across hundreds of LOBs and teams within an organization.

Conclusion

A generative AI multi-tenant architecture helps you maintain security, governance, and cost controls while scaling the use of generative AI across multiple use cases and teams. In this post, we presented a reference multi-tenant architecture to help you accelerate generative AI adoption. We showed how to standardize common generative AI components and how to expose them as shared services. The proposed architecture also addressed key aspects of governance, security, observability, and responsible AI. Finally, we discussed key considerations when scaling this architecture to hundreds of teams.

If you want to read more about this topic, check out also the following resources:

Let us know what you think in the comments section!


About the authors

Anastasia Tzeveleka is a Senior Generative AI/ML Specialist Solutions Architect at AWS. As part of her work, she helps customers across EMEA build foundation models and create scalable generative AI and machine learning solutions using AWS services.

Hasan Poonawala is a Senior AI/ML Specialist Solutions Architect at AWS, working with Healthcare and Life Sciences customers. Hasan helps design, deploy and scale Generative AI and Machine learning applications on AWS. He has over 15 years of combined work experience in machine learning, software development and data science on the cloud. In his spare time, Hasan loves to explore nature and spend time with friends and family.

Bruno Pistone is a Senior Generative AI and ML Specialist Solutions Architect for AWS based in Milan. He works with large customers helping them to deeply understand their technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. His expertise include: Machine Learning end to end, Machine Learning Industrialization, and Generative AI. He enjoys spending time with his friends and exploring new places, as well as travelling to new destinations

Vikesh Pandey is a Principal Generative AI/ML Solutions architect, specialising in financial services where he helps financial customers build and scale Generative AI/ML platforms and solution which scales to hundreds to even thousands of users. In his spare time, Vikesh likes to write on various blog forums and build legos with his kid.

Antonio Rodriguez is a Principal Generative AI Specialist Solutions Architect at Amazon Web Services. He helps companies of all sizes solve their challenges, embrace innovation, and create new business opportunities with Amazon Bedrock. Apart from work, he loves to spend time with his family and play sports with his friends.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

生成式AI Amazon Bedrock 多租户 基础模型 AWS
相关文章