AWS Machine Learning Blog 02月08日
Governing the ML lifecycle at scale, Part 4: Scaling MLOps with security and governance controls
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何构建和治理一个多账户机器学习(ML)平台,以解决数据科学团队在模型从开发环境过渡到生产环境时面临的挑战。该平台通过提供安全的自助服务环境、预定义的模板加速模型开发、集中式模型注册表实现协作和重用,以及标准化的模型审批和部署流程,帮助企业实现ML生命周期的安全、敏捷和高效管理。通过角色划分和权限控制,确保每个环节都符合组织的安全、监控和治理标准,最终提升跨职能协作效率。

🔑 **角色划分与职责明确**:文章详细阐述了ML生命周期中涉及的各种角色,如首席数据科学家、数据科学家、ML工程师、治理官和平台工程师,并明确了他们在模型开发、部署和治理过程中的具体职责,从而确保整个流程的顺利进行。

🗂️ **多账户架构设计**:文章介绍了多账户ML平台的架构,包括ML共享服务账户、ML开发账户、ML测试账户、ML生产账户和数据治理账户。每个账户都有明确的功能和用途,共同构建一个安全、隔离且高效的ML环境。

🚀 **自动化流程加速模型交付**:该平台通过自动化许多重复的手动步骤,例如使用SageMaker项目模板加速模型训练和部署,让数据科学家能够专注于模型构建和数据洞察,而非基础设施管理。集中式模型注册表促进了跨团队协作,并减少了重复工作。

Data science teams often face challenges when transitioning models from the development environment to production. These include difficulties integrating data science team’s models into the IT team’s production environment, the need to retrofit data science code to meet enterprise security and governance standards, gaining access to production grade data, and maintaining repeatability and reproducibility in machine learning (ML) pipelines, which can be difficult without a proper platform infrastructure and standardized templates.

This post, part of the “Governing the ML lifecycle at scale” series (Part 1, Part 2, Part 3), explains how to set up and govern a multi-account ML platform that addresses these challenges. The platform provides self-service provisioning of secure environments for ML teams, accelerated model development with predefined templates, a centralized model registry for collaboration and reuse, and standardized model approval and deployment processes.

An enterprise might have the following roles involved in the ML lifecycles. The functions for each role can vary from company to company. In this post, we assign the functions in terms of the ML lifecycle to each role as follows:

This ML platform provides several key benefits. First, it enables every step in the ML lifecycle to conform to the organization’s security, monitoring, and governance standards, reducing overall risk. Second, the platform gives data science teams the autonomy to create accounts, provision ML resources and access ML resources as needed, reducing resource constraints that often hinder their work.

Additionally, the platform automates many of the repetitive manual steps in the ML lifecycle, allowing data scientists to focus their time and efforts on building ML models and discovering insights from the data rather than managing infrastructure. The centralized model registry also promotes collaboration across teams, enables centralized model governance, increasing visibility into models developed throughout the organization and reducing duplicated work.

Finally, the platform standardizes the process for business stakeholders to review and consume models, smoothing the collaboration between the data science and business teams. This makes sure models can be quickly tested, approved, and deployed to production to deliver value to the organization.

Overall, this holistic approach to governing the ML lifecycle at scale provides significant benefits in terms of security, agility, efficiency, and cross-functional alignment.

In the next section, we provide an overview of the multi-account ML platform and how the different roles collaborate to scale MLOps.

Solution overview

The following architecture diagram illustrates the solutions for a multi-account ML platform and how different personas collaborate within this platform.

There are five accounts illustrated in the diagram:

Key activities and actions are numbered in the preceding diagram. Some of these activities are performed by various personas, whereas others are automatically triggered by AWS services.

    ML engineers create the pipelines in Github repositories, and the platform engineer converts them into two different Service Catalog portfolios: ML Admin Portfolio and SageMaker Project Portfolio. The ML Admin Portfolio will be used by the lead data scientist to create AWS resources (for example, SageMaker domains). The SageMaker Project Portfolio has SageMaker projects that data scientists and ML engineers can use to accelerate model training and deployment. The platform engineer shares the two Service Catalog portfolios with workload accounts in the organization. Data engineer prepares and governs datasets using services such as Amazon S3, AWS Lake Formation, and Amazon DataZone for ML. The lead data scientist uses the ML Admin Portfolio to set up SageMaker domains and the SageMaker Project Portfolio to set up SageMaker projects for their teams. Data scientists subscribe to datasets, and use SageMaker notebooks to analyze data and develop models. Data scientists use the SageMaker projects to build model training pipelines. These SageMaker projects automatically register the models in the model registry. The lead data scientist approves the model locally in the ML Dev Account. This step consists of the following sub-steps:
       After the data scientists approve the model, it triggers an event bus in Amazon EventBridge that ships the event to the ML Shared Services Account. The event in EventBridge triggers the AWS Lambda function that copies model artifacts (managed by SageMaker, or Docker images) from the ML Dev Account into the ML Shared Services Account, creates a model package in the ML Shared Services Account, and registers the new model in the model registry in the ML Shared Services account.
    ML engineers review and approve the new model in the ML Shared Services account for testing and deployment. This action triggers a pipeline that was set up using a SageMaker project. The approved models are first deployed to the ML Test Account. Integration tests will be run and endpoint validated before being approved for production deployment. After testing, the governance officer approves the new model in the CodePipeline. After the model is approved, the pipeline will continue to deploy the new model into the ML Prod Account, and creates a SageMaker endpoint.

The following sections provide details on the key components of this diagram, how to set them up, and sample code.

Set up the ML Shared Services Account

The ML Shared Services Account helps the organization standardize management of artifacts and resources across data science teams. This standardization also helps enforce controls across resources consumed by data science teams.

The ML Shared Services Account has the following features:

Service Catalog portfolios – This includes the following portfolios:

The following diagram illustrates this architecture.

As the first step, the cloud admin sets up the ML Shared Services Account by using one of the blueprints for customizations in AWS Control Tower account vending, as described in Part 1.

In the following sections, we walk through how to set up the ML Admin Portfolio. The same steps can be used to set up the SageMaker Projects Portfolio.

Bootstrap the infrastructure for two portfolios

After the ML Shared Services Account has been set up, the ML platform admin can bootstrap the infrastructure for the ML Admin Portfolio using sample code in the GitHub repository. The code contains AWS CloudFormation templates that can be later deployed to create the SageMaker Projects Portfolio.

Complete the following steps:

    Clone the GitHub repo to a local directory:
    git clone https://github.com/aws-samples/data-and-ml-governance-workshop.git
    Change the directory to the portfolio directory:
    cd data-and-ml-governance-workshop/module-3/ml-admin-portfolio
    Install dependencies in a separate Python environment using your preferred Python packages manager:
    python3 -m venv envsource env/bin/activate pip install -r requirements.txt
    Bootstrap your deployment target account using the following command:
    cdk bootstrap aws://<target account id>/<target region> --profile <target account profile>

    If you already have a role and AWS Region from the account set up, you can use the following command instead:

    cdk bootstrap
    Lastly, deploy the stack:
    cdk deploy --all --require-approval never

When it’s ready, you can see the MLAdminServicesCatalogPipeline pipeline in AWS CloudFormation.

Navigate to AWS CodeStar Connections of the Service Catalog page, you can see there’s a connection named “codeconnection-service-catalog”. If you click the connection, you will notice that we need to connect it to GitHub to allow you to integrate it with your pipelines and start pushing code. Click the ‘Update pending connection’ to integrate with your GitHub account.

Once that is done, you need to create empty GitHub repositories to start pushing code to. For example, you can create a repository called “ml-admin-portfolio-repo”. Every project you deploy will need a repository created in GitHub beforehand.

Trigger CodePipeline to deploy the ML Admin Portfolio

Complete the following steps to trigger the pipeline to deploy the ML Admin Portfolio. We recommend creating a separate folder for the different repositories that will be created in the platform.

    Get out of the cloned repository and create a parallel folder called platform-repositories:
    cd ../../.. # (as many .. as directories you have moved in)mkdir platform-repositories
    Clone and fill the empty created repository:
    cd platform-repositoriesgit clone https://github.com/example-org/ml-admin-service-catalog-repo.gitcd ml-admin-service-catalog-repocp -aR ../../ml-platform-shared-services/module-3/ml-admin-portfolio/. .
    Push the code to the Github repository to create the Service Catalog portfolio:
    git add .git commit -m "Initial commit"git push -u origin main

After it is pushed, the Github repository we created earlier is no longer empty. The new code push triggers the pipeline named cdk-service-catalog-pipeline to build and deploy artifacts to Service Catalog.

It takes about 10 minutes for the pipeline to finish running. When it’s complete, you can find a portfolio named ML Admin Portfolio on the Portfolios page on the Service Catalog console.

Repeat the same steps to set up the SageMaker Projects Portfolio, make sure you’re using the sample code (sagemaker-projects-portfolio) and create a new code repository (with a name such as sm-projects-service-catalog-repo).

Share the portfolios with workload accounts

You can share the portfolios with workload accounts in Service Catalog. Again, we use ML Admin Portfolio as an example.

    On the Service Catalog console, choose Portfolios in the navigation pane. Choose the ML Admin Portfolio. On the Share tab, choose Share.
    In the Account info section, provide the following information:
      For Select how to share, select Organization node. Choose Organizational Unit, then enter the organizational unit (OU) ID of the workloads OU.
    In the Share settings section, select Principal sharing. Choose Share.
    Selecting the Principal sharing option allows you to specify the AWS Identity and Access Management (IAM) roles, users, or groups by name for which you want to grant permissions in the shared accounts. On the portfolio details page, on the Access tab, choose Grant access. For Select how to grant access, select Principal Name. In the Principal Name section, choose role/ for Type and enter the name of the role that the ML admin will assume in the workload accounts for Name. Choose Grant access.
    Repeat these steps to share the SageMaker Projects Portfolio with workload accounts.

Confirm available portfolios in workload accounts

If the sharing was successful, you should see both portfolios available on the Service Catalog console, on the Portfolios page under Imported portfolios.

Now that the service catalogs in the ML Shared Services Account have been shared with the workloads OU, the data science team can provision resources such as SageMaker domains using the templates and set up SageMaker projects to accelerate ML models’ development while complying with the organization’s best practices.

We demonstrated how to create and share portfolios with workload accounts. However, the journey doesn’t stop here. The ML engineer can continue to evolve existing products and develop new ones based on the organization’s requirements.

The following sections describe the processes involved in setting up ML Development Accounts and running ML experiments.

Set up the ML Development Account

The ML Development account setup consists of the following tasks and stakeholders:

    The team lead requests the cloud admin to provision the ML Development Account. The cloud admin provisions the account. The team lead uses shared Service Catalog portfolios to provisions SageMaker domains, set up IAM roles and give access, and get access to data in Amazon S3, or Amazon DataZone or AWS Lake Formation, or a central feature group, depending on which solution the organization decides to use.

Run ML experiments

Part 3 in this series described multiple ways to share data across the organization. The current architecture allows data access using the following methods:

You can choose which option to use depending on your setup. For options 2, 3, and 4, the SageMaker Projects Portfolio provides project templates to run ML experiment pipelines, steps including data ingestion, model training, and registering the model in the model registry.

In the following example, we use option 2 to demonstrate how to build and run an ML pipeline using a SageMaker project that was shared from the ML Shared Services Account.

    On the SageMaker Studio domain, under Deployments in the navigation pane, choose Projects Choose Create project. There is a list of projects that serve various purposes. Because we want to access data stored in an S3 bucket for training the ML model, choose the project that uses data in an S3 bucket on the Organization templates tab. Follow the steps to provide the necessary information, such as Name, Tooling Account(ML Shared Services account id), S3 bucket(for MLOPS)  and then create the project.

It takes a few minutes to create the project.

After the project is created, a SageMaker pipeline is triggered to perform the steps specified in the SageMaker project. Choose Pipelines in the navigation pane to see the pipeline.You can choose the pipeline to see the Directed Acyclic Graph (DAG) of the pipeline. When you choose a step, its details show in the right pane.

The last step of the pipeline is registering the model in the current account’s model registry. As the next step, the lead data scientist will review the models in the model registry, and decide if a model should be approved to be promoted to the ML Shared Services Account.

Approve ML models

The lead data scientist should review the trained ML models and approve the candidate model in the model registry of the development account. After an ML model is approved, it triggers a local event, and the event buses in EventBridge will send model approval events to the ML Shared Services Account, and the artifacts of the models will be copied to the central model registry. A model card will be created for the model if it’s a new one, or the existing model card will update the version.

The following architecture diagram shows the flow of model approval and model promotion.

Model deployment

After the previous step, the model is available in the central model registry in the ML Shared Services Account. ML engineers can now deploy the model.

If you had used the sample code to bootstrap the SageMaker Projects portfolio, you can use the Deploy real-time endpoint from ModelRegistry – Cross account, test and prod option in SageMaker Projects to set up a project to set up a pipeline to deploy the model to the target test account and production account.

    On the SageMaker Studio console, choose Projects in the navigation pane. Choose Create project. On the Organization templates tab, you can view the templates that were populated earlier from Service Catalog when the domain was created. Select the template Deploy real-time endpoint from ModelRegistryCross account, test and prod and choose Select project template. Fill in the template:
      The SageMakerModelPackageGroupName is the model group name of the model promoted from the ML Dev Account in the previous step. Enter the Deployments Test Account ID for PreProdAccount, and the Deployments Prod Account ID for ProdAccount.

The pipeline for deployment is ready. The ML engineer will review the newly promoted model in the ML Shared Services Account. If the ML engineer approves model, it will trigger the deployment pipeline. You can see the pipeline on the CodePipeline console.

 

The pipeline will first deploy the model to the test account, and then pause for manual approval to deploy to the production account. ML engineer can test the performance and Governance officer can validate the model results in the test account. If the results are satisfactory, Governance officer can approve in CodePipeline to deploy the model to production account.

Conclusion

This post provided detailed steps for setting up the key components of a multi-account ML platform. This includes configuring the ML Shared Services Account, which manages the central templates, model registry, and deployment pipelines; sharing the ML Admin and SageMaker Projects Portfolios from the central Service Catalog; and setting up the individual ML Development Accounts where data scientists can build and train models.

The post also covered the process of running ML experiments using the SageMaker Projects templates, as well as the model approval and deployment workflows. Data scientists can use the standardized templates to speed up their model development, and ML engineers and stakeholders can review, test, and approve the new models before promoting them to production.

This multi-account ML platform design follows a federated model, with a centralized ML Shared Services Account providing governance and reusable components, and a set of development accounts managed by individual lines of business. This approach gives data science teams the autonomy they need to innovate, while providing enterprise-wide security, governance, and collaboration.

We encourage you to test this solution by following the AWS Multi-Account Data & ML Governance Workshop to see the platform in action and learn how to implement it in your own organization.


About the authors

Jia (Vivian) Li is a Senior Solutions Architect in AWS, with specialization in AI/ML. She currently supports customers in financial industry. Prior to joining AWS in 2022, she had 7 years of experience supporting enterprise customers use AI/ML in the cloud to drive business results. Vivian has a BS from Peking University and a PhD from University of Southern California. In her spare time, she enjoys all the water activities, and hiking in the beautiful mountains in her home state, Colorado.

Ram Vittal is a Principal ML Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, scalable, reliable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he enjoys riding motorcycle and walking with his dogs.

Dr. Alessandro Cerè is a GenAI Evaluation Specialist and Solutions Architect at AWS. He assists customers across industries and regions in operationalizing and governing their generative AI systems at scale, ensuring they meet the highest standards of performance, safety, and ethical considerations. Bringing a unique perspective to the field of AI, Alessandro has a background in quantum physics and research experience in quantum communications and quantum memories. In his spare time, he pursues his passion for landscape and underwater photography.

Alberto Menendez is a DevOps Consultant in Professional Services at AWS. He helps accelerate customers’ journeys to the cloud and achieve their digital transformation goals. In his free time, he enjoys playing sports, especially basketball and padel, spending time with family and friends, and learning about technology.

Sovik Kumar Nath is an AI/ML and Generative AI senior solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.

Viktor Malesevic is a Senior Machine Learning Engineer within AWS Professional Services, leading teams to build advanced machine learning solutions in the cloud. He’s passionate about making AI impactful, overseeing the entire process from modeling to production. In his spare time, he enjoys surfing, cycling, and traveling.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

机器学习平台 多账户架构 模型治理
相关文章