AWS Machine Learning Blog 2024年09月19日
Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

ZetaGlobal是一家领先的数据驱动型云营销技术公司,它使用机器学习和人工智能来增强其营销平台,提高客户参与度。本文介绍了ZetaGlobal的机器学习运维平台的架构,该平台基于Airflow、Feast、dbt和MLflow等开源工具,并利用Amazon ECS和Fargate来实现可扩展性和效率。该平台简化了机器学习工作流程,包括数据摄取、特征管理、模型训练和部署,并提高了数据质量和一致性。

🤔 ZetaGlobal的机器学习运维平台旨在自动化和监控机器学习生命周期的所有阶段,并通过集成Airflow、Feast、dbt和MLflow等开源工具来实现这一目标。

🤖 该平台使用Amazon ECS和Fargate来提供可扩展的无服务器平台,其中机器学习工作流程在容器中运行。这种设置不仅简化了基础设施管理,而且确保了资源的有效利用,根据需要进行扩展或缩减。

🚀 Airflow用于工作流编排,它通过定义Python代码中的任务和依赖关系来安排和管理复杂的工作流。例如,一个有向无环图(DAG)可以自动化数据摄取、处理、模型训练和部署任务,确保每个步骤按正确顺序并在正确时间运行。

📊 Feast充当特征存储库,用于存储和提供特征,确保训练和生产环境中的模型使用一致且最新的数据。它简化了模型训练和推理的特征访问,从而显著减少了管理数据管道所需的时间和复杂性。

🗃️ dbt用于在数据仓库中转换数据,允许数据团队使用SQL定义复杂的数据模型。它促进了对数据建模的规范化方法,使其更容易确保数据质量和一致性,并在机器学习管道中保持一致。

📈 MLflow跟踪实验并管理模型。它提供了一个统一的界面来记录参数、代码版本、指标和工件,从而更容易比较实验和管理模型生命周期。

☁️ Amazon ECS提供了一个高度可扩展且安全的环境,用于运行容器化应用程序。Fargate消除了管理底层基础设施的需要,使团队能够专注于部署和运行容器。

💪 Amazon ECS和Fargate与其他AWS服务无缝集成,如Amazon Simple Storage Service(Amazon S3)、Amazon Relational Database Service(Amazon RDS)和AWS Lambda,为部署和管理应用程序创建了一个连贯的生态系统。

🔒 Amazon ECS和Fargate与CloudWatch集成,为容器化任务提供全面的监控和日志记录功能。

💰 使用Fargate可以降低成本,因为只需为容器使用的资源(vCPU和内存)付费。这种模式与维护空闲资源相比,更具成本效益。

🛡️ Fargate中的每个任务都在其自己的隔离环境中运行,从而提高了安全性。没有与其他租户共享底层计算资源。

🚀 Fargate自动根据需求扩展应用程序,确保最佳性能,无需人工干预。

💡 Amazon ECS和Fargate与其他AWS服务(如Amazon S3、Amazon RDS和AWS Lambda)无缝集成,为部署和管理应用程序创建了一个连贯的生态系统。

💯 Amazon ECS和Fargate与CloudWatch集成,为容器化任务提供全面的监控和日志记录功能。

🏆 通过使用Amazon ECS和Fargate,ZetaGlobal能够创建强大的机器学习运维平台,该平台可以帮助他们提高效率、降低成本并改进其机器学习模型的性能。

🚀 这种架构对于希望简化机器学习工作流程、提高数据质量和一致性并创建可扩展且高效的机器学习平台的任何组织来说都是一个很好的示例。

This post has been co-written with Artem Sysuev, Danny Portman, Matúš Chládek, and Saurabh Gupta from Zeta Global.

Zeta Global is a leading data-driven, cloud-based marketing technology company that empowers enterprises to acquire, grow and retain customers. The company’s Zeta Marketing Platform (ZMP) is the largest omnichannel marketing platform with identity data at its core. The ZMP analyzes billions of structured and unstructured data points to predict consumer intent by using sophisticated artificial intelligence (AI) to personalize experiences at scale. For more information, see Zeta Global’s home page.

What Zeta has accomplished in AI/ML

In the fast-evolving landscape of digital marketing, Zeta Global stands out with its groundbreaking advancements in artificial intelligence. Zeta’s AI innovations over the past few years span 30 pending and issued patents, primarily related to the application of deep learning and generative AI to marketing technology. Using AI, Zeta Global has revolutionized how brands connect with their audiences, offering solutions that aren’t just innovative, but also incredibly effective. As an early adopter of large language model (LLM) technology, Zeta released Email Subject Line Generation in 2021. This tool enables marketers to craft compelling email subject lines that significantly boost open rates and engagement, tailored perfectly to the audience’s preferences and behaviors.

Further expanding the capabilities of AI in marketing, Zeta Global has developed AI Lookalikes. This technology allows companies to identify and target new customers who closely resemble their best existing customers, thereby optimizing marketing efforts and improving their return on investment (ROI). The backbone of these advancements is ZOE, Zeta’s Optimization Engine. ZOE is a multi-agent LLM application that integrates with multiple data sources to provide a unified view of the customer, simplify analytics queries, and facilitate marketing campaign creation. Together, these AI-driven tools and technologies aren’t just reshaping how brands perform marketing tasks; they’re setting new benchmarks for what’s possible in customer engagement.

In addition to its groundbreaking AI innovations, Zeta Global has harnessed Amazon Elastic Container Service (Amazon ECS) with AWS Fargate to deploy a multitude of smaller models efficiently.

Zeta’s AI innovation is powered by a proprietary machine learning operations (MLOps) system, developed in-house.

Context

In early 2023, Zeta’s machine learning (ML) teams shifted from traditional vertical teams to a more dynamic horizontal structure, introducing the concept of pods comprising diverse skill sets. This paradigm shift aimed to accelerate project delivery by fostering collaboration and synergy among teams with varied expertise. The need for a centralized MLOps platform became apparent as ML and AI applications proliferated across various teams, leading to a maze of maintenance complexities and hindering knowledge transfer and innovation.

To address these challenges, the organization developed an MLOps platform based on four key open-source tools: Airflow, Feast, dbt, and MLflow. Hosted on Amazon ECS with tasks run on Fargate, this platform streamlines the end-to-end ML workflow, from data ingestion to model deployment. This blog post delves into the details of this MLOps platform, exploring how the integration of these tools facilitates a more efficient and scalable approach to managing ML projects.

Architecture overview

Our MLOps architecture is designed to automate and monitor all stages of the ML lifecycle. At its core, it integrates:

These components interact within the Amazon ECS environment, providing a scalable and serverless platform where ML workflows are run in containers using Fargate. This setup not only simplifies infrastructure management, but also ensures that resources are used efficiently, scaling up or down as needed.

The following figure shows the MLOps architecture.

Architectural deep dive

The following details dive deep into each of the components used in this architecture.

Airflow for workflow orchestration

Airflow schedules and manages complex workflows, defining tasks and dependencies in Python code. An example direct acyclic graph (DAG) might automate data ingestion, processing, model training, and deployment tasks, ensuring that each step is run in the correct order and at the right time.

Though it’s worth mentioning that Airflow isn’t used at runtime as is usual for extract, transform, and load (ETL) tasks. Every Airflow task calls Amazon ECS tasks with some overrides. Additionally, we’re using a custom Airflow operator called ECSTaskLogOperator that allows us to process Amazon CloudWatch logs using downstream systems.

model_training = ECSTaskLogOperator(task_id= <...>,task_definition= <...>,cluster= <...>,launch_type="FARGATE",aws_conn_id= <...>,overrides={"containerOverrides": [{"name": " <...> ","environment": [{"name": "MLFLOW_TRACKING_URI","value": "<...>"},],"command": ["mlflow", "run", <...>]}],}

Feast for feature management

Feast acts as a central repository for storing and serving features, ensuring that models in both training and production environments use consistent and up-to-date data. It simplifies feature access for model training and inference, significantly reducing the time and complexity involved in managing data pipelines.

Additionally, Feast promotes feature reuse, so the time spent on data preparation is reduced greatly.

from datetime import timedeltafrom feast import Entity, FeatureView, FeatureService, Field, SnowflakeSourcefrom feast.types import Float64entities = [Entity(name="site_id", join_keys=["SITE_ID"]),Entity(name="user_id", join_keys=["USER_ID"]),]def create_feature_view(name, table, field_name, schema_name):return FeatureView(name=name,entities=entities,ttl=timedelta(days=30),schema=[Field(name=field_name, dtype=Float64)],source=SnowflakeSource(database="<...>", schema="<...>", table=table, timestamp_field="<...>"),tags="<...>",)feature_view_1 = create_feature_view("<...>")feature_view_2 = create_feature_view("<...>")my_feature_service = FeatureService(name="my_feature_servic ",features=[feature_view_1, feature_view_1],description=""" This is my Feature Service """,owner="<...>",)

dbt for data transformation

dbt is used for transforming data within the data warehouse, allowing data teams to define complex data models in SQL. It promotes a disciplined approach to data modeling, making it easier to ensure data quality and consistency across the ML pipelines. Moreover, it provides a straightforward way to track data lineage, so we can foresee which datasets will be affected by newly introduced changes. The following figure shows schema definition and model which reference it.

MLflow for experiment tracking and model management

MLflow tracks experiments and manages models. It provides a unified interface for logging parameters, code versions, metrics, and artifacts, making it easier to compare experiments and manage the model lifecycle.

Similarly to Airflow, MLflow is also used just partially. The main parts we use are tracking the server and model registry. From our experience, artifact server has some limitations, such as limits on artifact size (because of sending it using REST API). As a result, we opted to use it only partially.

We don’t extensively use the deployment capabilities of MLflow, because in our current setup, we build custom inference containers.

Hosting on Amazon ECS with Fargate

Amazon ECS offers a highly scalable and secure environment for running containerized applications. Fargate eliminates the need for managing underlying infrastructure, allowing us to focus solely on deploying and running the containers. This abstraction layer simplifies the deployment process, enabling seamless scaling based on workload demands while optimizing resource utilization and cost efficiency.

We found it optimal to run on Fargate components of our ML workflows that don’t require GPUs or distributed processing. These include dbt pipelines, data gathering jobs, training, evaluation, and batch inference jobs for smaller models.

Furthermore, Amazon ECS and Fargate seamlessly integrate with other AWS services, such as Amazon Elastic Container Registry (Amazon ECR) for container image management and AWS Systems Manager Parameter Store for securely storing and managing secrets and configurations. Using Parameter Store, we can centralize configuration settings, such as database connection strings, API keys, and environment variables, eliminating the need for hardcoding sensitive information within container images. This enhances security and simplifies maintenance, because secrets and configuration values can be dynamically retrieved by containers at runtime, ensuring consistency across deployments.

Moreover, integrating Amazon ECS and Fargate with CloudWatch enables comprehensive monitoring and logging capabilities for containerized tasks. This can be achieved by enabling the awslogs log driver within the logConfiguration parameters of the task definitions.

Why ECS with Fargate is the solution of choice

    Serverless model:
    Cost efficiency:
    Enhanced security:
    Integration with the AWS ecosystem:

Configuring Amazon ECS with Fargate for ML workloads

Configuring Amazon ECS with Fargate for ML workloads involves the following steps.

    Docker images: ML models and applications are containerized using Docker. This includes all dependencies, libraries, and configurations needed to run the ML workload. Creating task definitions:
    IAM roles: Assign appropriate AWS Identity and Access Management (IAM) roles to the tasks for accessing other AWS resources securely. Logging using CloudWatch: Use CloudWatch for logging and monitoring the performance and health of ML workloads.

Future development and addressing emerging challenges

As the field of MLOps continues to evolve, it’s essential to anticipate and address upcoming challenges to ensure that the platform remains efficient, scalable, and user-friendly. Two primary areas of future development for our platform include:

    Enhancing bring your own model (BYOM) capabilities for external clients Reducing the learning curve for data scientists

This section outlines those challenges and proposes directions for future enhancements.

Enhancing BYOM capabilities

As machine learning becomes more democratized, there is a growing need for platforms to easily integrate models developed externally by Zeta’s clients.

Future directions:

Reducing the learning curve for data scientists

The incorporation of multiple specialized tools (Airflow, Feast, dbt, and MLflow) into the MLOps pipeline can present a steep learning curve for data scientists, potentially hindering their productivity and the overall efficiency of the ML development process.

Future directions:

We’ll do the following to help reduce the learning curve:

Conclusion

Integrating Airflow, Feast, dbt, and MLflow into an MLOps platform hosted on Amazon ECS with AWS Fargate presents a robust solution for managing the ML lifecycle. This setup not only streamlines operations but also enhances scalability and efficiency, allowing data science teams to focus on innovation rather than infrastructure management.

Additional Resources

For those looking to dive deeper, we recommend exploring the official documentation and tutorials for each tool: Airflow, Feast, dbt, MLflow) and Amazon ECS. These resources are invaluable for understanding the capabilities and configurations of each component in our MLOps platform.


About the authors

Varad Ram holds the position of Senior Solutions Architect at Amazon Web Services. He possesses extensive experience encompassing application development, cloud migration strategies, and information technology team management. Recently, his primary focus has shifted towards assisting clients in navigating the process of productizing generative artificial intelligence use cases.

Artem Sysuev is a Lead Machine Learning Engineer at Zeta, passionate about creating efficient, scalable solutions. He believes that effective processes are key to success, which led him to focus on both machine learning and MLOps. Starting with machine learning, Artem developed skills in building predictive models. Over time, he saw the need for strong operational frameworks to deploy and maintain these models at scale, which drew him to MLOps. At Zeta, he drives innovation by automating workflows and improving collaboration, ensuring smooth integration of machine learning models into production systems.

Saurabh Gupta is a Principal Engineer at Zeta Global. He is passionate about machine learning engineering, distributed systems, and big-data technologies. He has built scalable platforms that empower data scientists and data engineers, focusing on low-latency, resilient systems that streamline workflows and drive innovation. He holds a B.Tech degree in Electronics and Communication Engineering from the Indian Institute of Technology (IIT), Guwahati, and has deep expertise in designing data-driven solutions that support advanced analytics and machine learning initiatives.

Matúš Chládek is a Senior Engineering Manager for ML Ops at Zeta Global. With a career that began in Data Science, Matúš has developed a strong foundation in analytics and machine learning. Over the years, Matúš transitioned into more engineering-focused roles, eventually becoming a Machine Learning Engineer before moving into Engineering Management. Matúš’s leadership focuses on building robust, scalable infrastructure that streamlines workflows and supports rapid iteration and production-ready delivery of machine learning projects. Matúš is passionate about driving innovation at the intersection of Data Science and Engineering, making advanced analytics accessible and scalable for internal users and clients alike.

Dr. Danny Portman is a recognized thought leader in AI and machine learning, with over 30 patents focused on Deep Learning and Generative AI applications in advertising and marketing technology. He holds a Ph.D. in Computational Physics, specializing in high-performance computing models for simulating complex astrophysical systems. With a strong background in quantitative research, Danny brings a wealth of experience in applying data-driven approaches to solve problems across various sectors. As VP of Data Science and Head of AI/ML at Zeta Global, Dr. Portman leads the development of AI-driven products and strategies, and spearheads the company’s cutting-edge Generative AI R&D efforts to deliver innovative solutions for marketers.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

机器学习 机器学习运维 ZetaGlobal Airflow Feast dbt MLflow Amazon ECS Fargate
相关文章