AWS Machine Learning Blog 2024年11月23日
Governing the ML lifecycle at scale, Part 3: Setting up data governance at scale
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了如何利用Amazon DataZone在大规模数据网格中建立数据治理。数据网格是一种现代数据管理方法,它将数据所有权下放,并将数据视为一种产品。通过这种方式,组织内的不同业务部门可以创建、共享和管理自己的数据资产,促进自助分析,并减少将数据实验转化为可投入生产的应用程序所需的时间。数据网格架构旨在提高对数据团队、流程和技术的投资回报率,最终通过整个企业的创新分析和机器学习项目推动业务价值。文章还以金融服务行业为例,展示了Amazon DataZone如何帮助银行安全访问和使用客户数据集,设计和实施针对特定客户需求和偏好的营销活动。

🤔 **传统数据管理和治理面临挑战:**传统方法涉及繁琐的手动流程、自定义脚本和断开连接的工具,导致数据资产难以发现、数据策略难以实施、数据血缘关系难以理解,以及缺乏集中式数据治理,最终导致数据孤岛、合规性问题和低效的数据利用。

💡 **Amazon DataZone简化数据治理和共享:**Amazon DataZone提供了一种全面解决方案,可以自动发现和编目跨多个AWS账户和VPC的数据资产,定义和实施一致的治理策略,跟踪数据血缘关系,并使用细粒度的访问控制安全地共享数据,所有这些都可以在一个平台上完成。

🏦 **金融服务行业营销案例:**在金融服务行业,有效的营销活动至关重要。Amazon DataZone的数据治理功能使银行能够安全地访问和使用其全面的客户数据集,设计和实施针对金融产品的目标营销活动,例如定期存款、投资组合和贷款产品。

⚙️ **多账户ML平台参考架构:**文章提供了使用各种AWS服务的ML平台参考架构,重点介绍了Amazon DataZone在数据管理和治理层中的作用,包括管理账户、数据治理账户、数据湖账户(生产者)和数据科学团队账户(消费者)。

🚀 **Amazon DataZone设置指南:**本文提供了在多账户环境中设置Amazon DataZone的分步指南,包括账户设置、蓝图启用、用户管理以及数据发布者和订阅者的项目配置。

This post is part of an ongoing series about governing the machine learning (ML) lifecycle at scale. To view this series from the beginning, start with Part 1. This post dives deep into how to set up data governance at scale using Amazon DataZone for the data mesh. The data mesh is a modern approach to data management that decentralizes data ownership and treats data as a product. It enables different business units within an organization to create, share, and govern their own data assets, promoting self-service analytics and reducing the time required to convert data experiments into production-ready applications. The data mesh architecture aims to increase the return on investments in data teams, processes, and technology, ultimately driving business value through innovative analytics and ML projects across the enterprise.

Organizations spanning various industries are progressively utilizing data and ML to drive innovation, enhance decision-making processes, and gain a competitive advantage. However, as data volumes and complexity continue to grow, effective data governance becomes a critical challenge. Organizations must make sure their data assets are properly managed, secured, and compliant with regulatory requirements, while also enabling seamless access and collaboration among various teams and stakeholders.

This post explores the role of Amazon DataZone, a comprehensive data management and governance service, in addressing these challenges at scale. We dive into a real-world use case from the financial services industry, where effective marketing campaigns are crucial for acquiring and retaining customers, as well as cross-selling products. By taking advantage of the data governance capabilities of Amazon DataZone, financial institutions like banks can securely access and use their comprehensive customer datasets to design and implement targeted marketing campaigns tailored to individual customer needs and preferences.

We explore the following key aspects:

By the end of this post, you will have a comprehensive understanding of how Amazon DataZone can empower organizations to establish centralized data governance, enforce consistent policies, and facilitate secure data sharing across teams and accounts, ultimately unlocking the full potential of your data assets while maintaining compliance and security.

Challenges in data management

Traditionally, managing and governing data across multiple systems involved tedious manual processes, custom scripts, and disconnected tools. This approach was not only time-consuming but also prone to errors and difficult to scale. Organizations often struggled with the following challenges:

Amazon DataZone solves these problems by providing a comprehensive solution for data management and governance:

Use case

In the competitive banking and financial services industry, effective marketing campaigns are crucial for acquiring new customers, retaining existing ones, and cross-selling products. With the data governance capabilities of Amazon DataZone, banks can securely access and use their own comprehensive customer datasets to design and implement targeted marketing campaigns for financial products, such as certificates of deposit, investment portfolios, and loan offerings. In this post, we discuss how banks can establish a centralized data catalog, enabling data publishers to share customer datasets and marketing teams to subscribe to relevant data using Amazon DataZone.

The following diagram gives a high-level illustration of the use case.

The diagram shows several accounts and personas as part of the overall infrastructure. In the given use case of using Amazon DataZone for effective marketing campaigns in the banking and financial services industry, the different accounts serve the following functions:

By separating these accounts and their responsibilities, the organization can maintain a clear separation of duties, enforce appropriate access controls, and make sure data governance policies are consistently applied across the entire data lifecycle. The data governance account, acting as the central hub, enables seamless data sharing and collaboration between the data producers (data lake accounts) and data consumers (data science team accounts), while meeting data privacy, security, and compliance requirements.

Solution overview

The following diagram illustrates the ML platform reference architecture using various AWS services. The functional architecture with different capabilities is implemented using a number of AWS services, including AWS Organizations, Amazon SageMaker, AWS DevOps services, and a data lake. For more information about the architecture in detail, refer to Part 1 of this series. In this post, we focus on the highlighted Amazon DataZone section.

The data management services function is organized through the data lake accounts (producers) and data science team accounts (consumers).

The data lake accounts are responsible for storing and managing the enterprise’s raw, curated, and aggregated datasets. Data engineers and data publishers work within these accounts to ingest, process, and publish data assets that can be consumed by other teams, such as the marketing team or data science teams. In the bank marketing use case, the data lake accounts would store and manage the bank’s customer data, including raw data from various sources, curated datasets with customer profiles, and aggregated datasets for marketing segmentation.

As producers, data engineers in these accounts are responsible for creating, transforming, and managing data assets that will be cataloged and governed by Amazon DataZone. They make sure data is produced consistently and reliably, adhering to the organization’s data governance rules and standards set up in the data governance account. Data engineers contribute to the data lineage process by providing the necessary information and metadata about the data transformations they perform.

Amazon DataZone plays a crucial role in maintaining data lineage information, enabling traceability and impact analysis of data transformations across the organization. It handles the actual maintenance and management of data lineage information, using the metadata provided by data engineers to build and maintain the data lineage.

The data science team accounts are used by data analysts, data scientists, or marketing teams to access and consume the published data assets from the data lake accounts. Within these accounts, they can perform analyses, build models, or design targeted marketing campaigns by using the governed and curated datasets made available through the data sharing and access control mechanisms of Amazon Data Zone. For example, in the bank marketing use case, the data science team accounts would be used by the bank’s marketing teams to access and analyze customer datasets, build predictive models for targeted marketing campaigns, and design personalized financial product offerings based on the shared customer data.

Using Amazon DataZone in a multi-account ML platform

You can find practical, step-by-step instructions for implementing this setup in module 2 of this AWS Multi-Account Data & ML Governance Workshop.  This workshop provides detailed guidance on setting up Amazon DataZone in the central governance account.

Conclusion

Effective governance is crucial for organizations to unlock their data’s potential while maintaining compliance and security. Amazon DataZone provides a comprehensive solution for data management and governance at scale, automating complex tasks like data cataloging, policy enforcement, lineage tracking, and secure data sharing.

As demonstrated in the financial services use case, Amazon DataZone empowers organizations to establish a centralized data catalog, enforce consistent governance policies, and facilitate secure data sharing between data producers and consumers. Financial institutions can use Amazon DataZone to gain a competitive edge by designing and implementing effective, tailored marketing campaigns while adhering to data privacy and compliance regulations.

The multi-account ML platform architecture, combined with Amazon DataZone and other AWS services, provides a scalable and secure foundation for governing data and ML workflows effectively. By following the outlined steps, you can streamline the setup and management of Amazon DataZone, enabling seamless collaboration between stakeholders involved in the data and ML lifecycle.

As data generation and utilization continue to grow, robust data governance solutions become paramount. Amazon DataZone offers a powerful approach to data management and governance, empowering organizations to unlock their data’s true value while maintaining the highest standards of security, compliance, and data privacy.


About the Authors

Ajit Mungale is a Senior Solutions Architect at Amazon Web Services with specialization in AI/ML/Generative AI, IoT and .Net technologies. At AWS, he helps customers build, migrate, and create new cost effective cloud solutions. He possesses extensive experience in developing distributed applications and has worked with multiple cloud platforms. With his deep technical knowledge and business understanding, Ajit guides organizations in leveraging the full capabilities of the cloud.

Ram Vittal is a Principal Generative AI Solutions Architect at AWS. He has over 3 decades of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure, scalable, reliable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he rides motorcycle and walks with his sheep-a-doodle!

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon DataZone 数据治理 数据网格 机器学习 金融服务
相关文章