ByteByteGo 04月15日 23:44
How Netflix Orchestrates Millions of Workflow Jobs with Maestro
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了Netflix如何设计和构建其下一代工作流编排系统Maestro,以克服旧系统Meson的局限性。Maestro采用微服务架构,拥有强大的可扩展性,能够处理大规模工作流和作业。核心组件包括Workflow Engine、Time-Based Scheduling和Signal Service,分别负责工作流的生命周期管理、定时触发和事件驱动的编排。Maestro还利用分布式队列和CockroachDB等技术实现水平扩展和数据持久化,并提供多种执行抽象,简化用户定义和运行工作流的流程。通过这些设计,Maestro能够满足Netflix日益增长的数据处理需求。

⚙️ Maestro采用微服务架构,由Workflow Engine、Time-Based Scheduling和Signal Service三大核心服务构成。Workflow Engine负责管理工作流的定义、执行和生命周期;Time-Based Scheduling提供定时触发功能;Signal Service支持事件驱动的编排,增强了工作流的灵活性。

🚀 Maestro通过分布式队列实现组件间的解耦,并使用CockroachDB进行持久化存储,保证了系统的高可用性和可扩展性。这种设计使得Maestro能够水平扩展,处理大规模工作负载,而不会影响性能和可靠性。

💡 Maestro提供了多种执行抽象,例如预定义的Step Types,简化了用户定义和运行工作流的流程。用户只需提供必要的参数,Maestro即可处理底层细节,降低了使用门槛,方便不同技术背景的用户使用。

🔄 Maestro支持ForEach循环和嵌套ForEach循环,允许用户构建包含数百万甚至数十亿个步骤的大规模工作流,极大地提高了处理复杂任务的能力,满足了Netflix日益增长的数据处理需求。

WorkOS + MCP: Authentication for AI Agents (Sponsored)

Don’t reinvent auth for AI agents. WorkOS AuthKit delivers SSO, MFA, and user management with modern APIs and minimal config.

Gain confidence that MCP servers and AI agents stay within defined permissions, using scoped OAuth access.

Watch the 3 min demo


Disclaimer: The details in this post have been derived from the articles written by the Netflix engineering team. All credit for the technical details goes to the Netflix Engineering Team. The links to the original articles and videos are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Netflix originally used an orchestrator named Meson to manage its growing number of data and machine learning (ML) workflows. 

For several years, Meson served the company well. It was responsible for scheduling and managing a large volume of tasks, approximately 70,000 workflows and 500,000 daily jobs. These jobs supported various business functions, including data transformation, recommendation models, A/B tests, and other ML-based decision systems.

However, as Netflix’s use of data grew, so did the pressure on Meson. The system began to show signs of strain, especially during periods of peak traffic, such as midnight UTC, when a large number of workflows were typically triggered. During these times, the orchestrator experienced slowdowns that led to increased operational overhead. Engineers on call had to closely monitor the system, especially during off-hours, to ensure it didn’t fail under the load.

A key limitation of Meson was its architecture. It was built using a single-leader model, which meant there was one main node responsible for coordinating all activity. Although it had high availability, the team had to vertically scale (i.e., upgrade to more powerful machines) to keep up with the growing demand. Eventually, they approached the limits of what AWS instance types could offer, making it clear that vertical scaling was not a sustainable long-term solution.

At the same time, the rate of growth was dramatic. The number of workflows in the system was doubling year over year, which added to the urgency of finding a better solution.

This need led to the creation of Maestro: a next-generation workflow orchestrator designed to overcome Meson’s bottlenecks.

Maestro was built with a distributed, scalable architecture from the ground up, capable of handling massive volumes of workflows and jobs while maintaining high reliability and low operational overhead. It was also designed to support a wide range of users and use cases, from engineers and data scientists to business analysts and content producers, making it a more versatile platform for the future of data and ML at Netflix.

In this article, we’ll look at how Netflix designed Maestro and the challenges they faced along the way.


AI-assisted coding is faster. But is it safe? (Sponsored)

Tools like Cursor, Co-Pilot, and Claude are helping developers ship code faster than ever.

But with up to 40% of AI-generated code containing vulnerabilities—and 75% of developers believing it’s safer than human-written code—the real risk is what you can’t see.

In this guide, we explore how engineering teams can move quickly without compromising security or standards.

Learn how to set up automated guardrails, enforce service ownership, and catch risks before they spread—no bottlenecks required.

Read the full guide on safe AI dev


Maestro Architecture

At the heart of Maestro's design is a microservices-based architecture that allows the system to scale efficiently.

The system is broken down into three primary services, each responsible for a different part of the workflow orchestration. The diagram below shows this architecture in detail:

Workflow Engine

The Workflow Engine is the core component of Maestro. It manages the full lifecycle of workflows from the initial definition to step-by-step execution and completion. 

In Maestro, a workflow is represented as a Directed Acyclic Graph (DAG) made up of individual units of work called steps (also referred to as jobs). These steps can include metadata, dependencies, runtime parameters, conditions, and branching logic.

A few key responsibilities of the Workflow Engine include:

Under the hood, the engine uses Netflix’s open-source Conductor library to manage the workflow state machine.

Time-Based Scheduling

Maestro supports workflows that need to be triggered at regular time intervals through the time-based scheduling service. Users can define when workflows should run using cron expressions or predefined time intervals such as hourly or weekly schedules.

This service is lightweight and designed to scale, offering an at-least-once triggering guarantee. 

However, the Maestro engine includes logic to deduplicate triggers, which means that even if the scheduler tries to trigger a workflow multiple times, it ensures that the workflow is only executed once. This provides exact-once execution in practice.

Signal Service

Time-based scheduling is especially useful for recurring tasks, but it may not always be the most efficient method. For instance, running a workflow at midnight might not make sense if the data it depends on isn’t ready yet. To address that, Maestro includes another option for triggering workflows: the signal service.

The signal service provides support for event-driven orchestration, which complements time-based scheduling by triggering workflows or steps based on real-time events instead of a fixed time.

A signal is essentially a small message that carries information (such as metadata or parameter values) related to a triggering event. This service plays a crucial role in enabling conditional, responsive execution. Signals can come from:

See the diagram below:

The Signal Service is used in two primary ways:

In addition to these capabilities, the Signal Service also tracks signal lineage. This means it can trace which workflows were triggered by which signals, creating a dependency map between upstream and downstream workflows.

Scalability Techniques for Maestro

One of the core strengths of Maestro is its ability to scale horizontally and handle very large workloads without compromising reliability or performance. 

There are a few ways Maestro achieves this:

See the diagram below that shows the forEach design for Maestro.

Maestro Execution Abstractions

Maestro is designed to serve a wide range of users from beginner data analysts to advanced engineers. Therefore, it provides multiple execution abstractions to make it easier to define and run workflows.

Some abstractions are as follows:

Step Types

Maestro includes a set of predefined step types that encapsulate common tasks. 

These step types represent standard operations, such as running a Spark job, moving data between systems (like from a database to a spreadsheet), or executing a SQL query. 

When a user chooses a step type, they only need to supply the required parameters (such as the query string, memory allocation, or table name), and Maestro handles everything else behind the scenes.

For example, if a user selects the "Spark" step type, they can simply provide the Spark SQL query, along with resource requirements, and Maestro will schedule and execute the job accordingly. If the internal logic for Spark jobs ever needs to be updated (such as changing how jobs are submitted), the Maestro team can make that change once, and it will automatically apply to all workflows that use that step type.

Notebook Execution

Maestro also supports direct execution of Jupyter notebooks, which is useful for users who prefer working in a notebook environment or want to run code that has already been developed interactively. 

These users can pass their notebooks to Maestro, and the system will schedule and run them as part of a workflow. This is especially helpful for data scientists or analysts who may not want to write workflow logic from scratch but still need to run periodic jobs.

Notebook execution integrates with other workflow steps, allowing users to include notebooks alongside other jobs within a larger DAG. Parameters can also be passed into notebooks dynamically, making them flexible components in a pipeline.

Docker Jobs

For more advanced users who need complete control over the execution environment, Maestro supports running Docker containers. These users can package their custom business logic into a Docker image, define the necessary input parameters, and let Maestro schedule and manage the container’s execution.

This option allows users to run any logic, regardless of programming language or dependencies, as long as it is packaged inside a container.

Maestro DSL Interface

To make it easy for users with different technical backgrounds to define workflows, Maestro supports multiple ways to create and manage them. Users can choose from a set of Domain Specific Languages (DSLs), graphical interfaces, or programmatic APIs, depending on what works best for their needs.

Maestro supports the following DSLs:

Apart from these DSLs, users can also define workflows in the following ways:

Parameterized Workflow Example

A key feature of Maestro is its support for parameterized workflows, which allow workflows to adapt dynamically at runtime based on input values. This reduces duplication, simplifies maintenance, and enables powerful use cases.

One example is a backfill window, which is commonly used to process historical data across a range of dates.

Here’s how it works:

See the diagram below:

This setup allows users to create a single workflow definition that can dynamically expand into hundreds or thousands of job runs, each tailored to a different date, without having to manually write out every step.

Aggregated View and Rollup

When dealing with large-scale workflows, especially those that include subworkflows, foreach loops, or deeply nested execution paths, it can become difficult for users to keep track of what's happening during and after execution. To solve this, Maestro provides two helpful features: Aggregated View and Rollup.

Aggregated Views

The aggregated view is designed to give users a high-level summary of how a workflow has performed across multiple runs. 

See the diagram below:

Instead of manually going through every instance of a workflow execution, users can see a combined overview of statuses across all runs. This includes information such as how many instances:

This helps users quickly understand the overall behavior and health of a workflow over time, making it easier to spot patterns or recurring issues.

Rollup

The rollup feature goes a step further by providing a flattened summary of all the steps in a workflow, even when the workflow contains subworkflows, foreach iterations, or nested loops. 

In large workflows, a single step can itself be a foreach loop or a call to another workflow, which then includes its steps. Rollup pulls all of this structure together into a unified view.

With rollup, users can:

Event Publishing

Maestro is designed not just to run workflows but also to integrate smoothly with other systems at Netflix. One of the key ways it achieves this is through a robust event publishing system, which keeps various services informed about what is happening inside the orchestration platform in real time.

See the diagram below:

There are two main types of events that Maestro generates.

1 - Internal Events

These are events that describe the internal lifecycle of a workflow. They include things like:

These events are used within Maestro itself and by other internal Netflix systems to track execution status, handle retries, and support user notifications or UI updates. They help ensure that the orchestration engine behaves correctly and reliably, especially under high workloads.

2 - External Events

Maestro can also publish events to external messaging systems such as Kafka or SNS (Simple Notification Service). 

These events are typically consumed by other systems that need to react to changes in workflow state. For example:

Some examples of events published externally include:

By publishing these events, Maestro allows other services in the Netflix ecosystem to stay in sync with what's happening in the orchestration platform.

Conclusion

Maestro is Netflix’s answer to the growing complexity and scale of modern data and machine learning workflows. 

Designed as a horizontally scalable, cloud-native orchestrator, it addresses the limitations of its predecessor, Meson, by providing a robust architecture built on stateless microservices, distributed queues, and a strong consistency layer via CockroachDB. 

Its support for high-level execution abstractions, dynamic parameterization, and deep modularity through features like foreach, subworkflows, and reusable templates allows users to build scalable, maintainable pipelines.

With multiple DSLs, a visual UI, and integrations like Metaflow, Maestro is accessible to a broad range of users, from engineers to analysts. Its advanced monitoring features, such as signal lineage, rollups, and event publishing, ensure transparency and real-time observability across workflows. 

As Netflix continues to grow its data infrastructure, Maestro lays a strong, flexible foundation for the future of intelligent workflow orchestration. It has also been made open-source.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Netflix Maestro 工作流编排 微服务架构
相关文章