ByteByteGo 2024年07月23日
How Stripe Scaled to 5 Million Database Queries Per Second
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Stripe的DocDB是一个内部数据库即服务平台(DBaaS),它基于MongoDB构建,并扩展了MongoDB的功能,以满足Stripe不断增长的业务需求。DocDB提供了高可用性、可扩展性和安全性,并支持各种功能,例如数据迁移、多租户和安全策略。

🤔 DocDB是Stripe内部构建的数据库即服务平台,它基于MongoDB并扩展了其功能,以满足Stripe不断增长的业务需求。DocDB的设计目标是提供高可用性、可扩展性和安全性,并支持各种功能,例如数据迁移、多租户和安全策略。 Stripe选择MongoDB作为其在线数据库的原因是,MongoDB的无模式方法比标准关系数据库对开发人员更有效率,并且MongoDB支持通过其强大的分片架构进行水平扩展。 为了提供最佳的开发人员体验,Stripe需要一个像产品一样的数据库服务,可以满足以下关键要求: * 维护最高级别的可用性、持久性和性能。 * 向开发人员公开最少的功能集,以防止出现次优客户端查询问题。 * 支持分片进行水平扩展。 * 支持多租户,并提供配额。 * 强大的安全性,包括授权策略。

💻 DocDB利用分片来实现数据库基础设施的水平可扩展性。由于数千个数据库分片分布在Stripe的产品应用程序中,分片可以有效地分配数据并进行并行处理。 然而,使用数据库分片会给应用程序带来挑战,即确定其查询的适当目标分片。为了解决这个问题,Stripe的数据库基础设施团队开发了一系列用Golang实现的数据库代理服务器。这些代理服务器负责将查询路由到正确分片。 当应用程序向数据库代理服务器发送查询时,它会执行以下步骤: * 解析查询 * 将其路由到一个或多个分片 * 合并从分片接收的结果 * 将最终结果返回给应用程序。 数据库代理服务器如何做出路由决策? 数据库代理服务器依赖于块元数据服务来做出路由决策。块表示较大集合中的一小部分数据。每个分片包含一小部分总数据,这些部分称为块。 例如,假设Stripe有一个名为“交易”的大型集合,其中包含数百万个文档,代表财务交易。为了水平扩展此集合,他们可能会根据分片键(例如客户ID或交易时间戳)将数据分成块。然后,每个块将被分配到一个特定的数据库分片。 块元数据服务管理这些块与其相应分片之间的映射。它跟踪哪个块驻留在哪个分片上,允许代理服务器将查询和请求路由到适当的分片。

🔄 数据迁移平台是DocDB的核心组件之一,它负责将数据从源分片迁移到目标分片,同时确保零停机时间和对客户端的影响最小。 数据迁移平台需要满足以下要求: * 确保迁移的数据在源分片和目标分片之间是一致且完整的。 * 防止出现长时间停机的情况。数百万家企业全天候依赖Stripe来接受客户的付款。 * 支持将任意数量的块从任意数量的源分片迁移到目标分片。此外,迁移应该以高吞吐量进行。 * 在迁移过程中,防止对源分片产生任何性能影响。 数据迁移平台的架构如下: * 协调器组件:协调器组件是平台的核心,负责协调在线数据迁移中涉及的各个步骤。 * 迁移器组件:迁移器组件负责将数据从源分片复制到目标分片。 * 块管理器组件:块管理器组件跟踪每个块的位置,并确保在迁移过程中保持数据一致性。 * 索引生成器组件:索引生成器组件负责在目标分片上创建索引,以提高查询性能。 数据迁移平台的工作流程如下: 1. 注册块迁移请求:第一步是注册一个请求,将数据库块从源分片迁移到目标分片。 2. 在目标分片上构建索引:创建请求后,将在目标分片上构建索引。在目标分片上先构建索引有一些优势: * 查询性能:在数据迁移之前在目标分片上创建索引可以提高查询性能,因为数据已经可以按索引进行排序。 * 数据一致性:在迁移过程中,可以确保目标分片上的索引与源分片上的索引保持一致。 3. 复制数据:一旦索引构建完成,迁移器组件将开始将数据从源分片复制到目标分片。 4. 验证数据一致性:数据复制完成后,块管理器组件将验证源分片和目标分片上的数据是否一致。 5. 更新元数据:一旦数据一致性得到验证,块管理器组件将更新元数据,以反映块的新位置。 6. 将流量切换到目标分片:最后,协调器组件将流量切换到目标分片。这个步骤可以确保在迁移过程中不会出现任何停机时间。 数据迁移平台是Stripe DocDB架构的关键组成部分,它使Stripe能够以高可用性和可扩展性来管理其不断增长的数据量。

WorkOS: modern identity platform for B2B SaaS (Sponsored)

Start selling to enterprises with just a few lines of code.

→ WorkOS provides a complete User Management solution along with SSO, SCIM, RBAC, & FGA.

→ Unlike other auth providers that rely on user-centric models, WorkOS is designed for B2B SaaS with an org modeling approach.

→ The APIs are flexible, easy-to-use, and modular. Pick and choose what you need and integrate in minutes.

→ Best of all, User Management is free up to 1 million MAUs and comes standard with RBAC, bot protection, impersonation, MFA, & more.

Get started today


Disclaimer: The details in this post have been derived from the Stripe Engineering Blog. All credit for the architectural details goes to Stripe’s engineering team. The links to the original articles are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

As of 2023, only 19 countries had a GDP surpassing $1 trillion. Also, in 2023, Stripe alone processed $1 trillion in total payment value.

To make the achievement even more remarkable, they managed these numbers while supporting 5 million database queries per second at five-nines (99.999%) of availability.

What was behind the success of Stripe’s infrastructure?

The secret lies in the horizontal scaling capabilities of their database.

Stripe’s database infrastructure team built an internal database-as-a-service (DBaaS) offering called DocDB. It was created as an extension of MongoDB’s community edition because of MongoDB’s flexibility and ability to handle a massive volume of real-time data at scale.

In this post, we’ll explore how DocDB works and the various features it provides that allow Stripe to operate at such an incredible scale.


Why the Need for DocDB?

The first question while looking at DocDB is this: what forced Stripe to build a DBaaS offering?

Stripe launched in 2011. At the time, they chose MongoDB as the online database because its schema-less approach made it more productive for developers than standard relational databases. MongoDB also supports horizontal scaling through its robust sharding architecture, which is shown in the diagram below:

However, to unlock the best developer experience, Stripe needed a database service that could work like a product for the development teams. MongoDB Atlas didn’t exist in 2011 and they couldn’t find an off-the-shelf DBaaS that met key requirements such as:

The solution was to build DocDB with MongoDB as the underlying storage engine. The DocDB deployment was also highly customized to provide low latency and diverse access. Some interesting stats related to DocDB are as follows:

At the heart of DocDB is the Data Movement Platform. It was originally built as a horizontal scaling solution to overcome the vertical scaling limits of MongoDB.

The Data Movement Platform made it possible to transition from running a small number of database shards (each storing tens of terabytes of data) to thousands of database shards (each with a fraction of the original data).

The platform performs multiple functions such as:

For reference, bin packing is an optimization problem where the goal is to pack a set of objects (in this case, data) into a minimum number of bins (database shards) of a fixed capacity. The objective is to minimize the number of bins used while ensuring that the total size or weight of the objects in each bin does not exceed its capacity.

How Applications Access DocDB?

DocDB leverages sharding to achieve horizontal scalability for its database infrastructure. With thousands of database shards distributed across Stripe’s product applications, sharding enables efficient data distribution and parallel processing.

However, the use of database sharding introduces a challenge for applications when determining the appropriate destination shard for their queries. 

To address this issue, Stripe’s database infrastructure team developed a fleet of database proxy servers implemented in Golang. These proxy servers handle the task of routing queries to the correct shard.

The diagram shows DocDB’s high-level infrastructure overview.

When an application sends a query to a database proxy server, it performs the following steps:

But how do database proxy servers make the routing decisions?

The database proxy servers rely on the chunk metadata service to make routing decisions. 

A chunk represents a small subset of data within a larger collection. Each shard contains a fraction of the total data, and these fractions are referred to as chunks.

For example, consider that Stripe has a large collection called “Transactions” that contains millions of documents representing financial transactions. To scale this collection horizontally, they might split the data into chunks based on a sharding key, such as customer ID or the transaction timestamp. Each chunk would then be assigned to a specific database shard.

The chunk metadata service manages the mapping between these chunks and their corresponding shards. It keeps track of which chunk resides on which shard, allowing the proxy servers to route queries and requests to the appropriate shard.

Data Organization in DocDB

At Stripe, product teams use an in-house tool called the document database control plane to create and manage their databases. When a team provisions a new database using this tool, they are creating a “logical database.”

A logical database is like a virtual container holding one or more data collections known as DocDB collections. Each DocDB collection contains related documents that serve a specific purpose for the product team.

Even though a logical database appears as a single entity to the product team, the data within the collections is spread across multiple physical databases behind the scenes. These physical databases are the actual databases running on Stripe’s infrastructure.

The diagram below shows this arrangement:

Each physical database contains a small portion (or “chunk”) of the data from the DocDB collection and is deployed on a shard. The shard consists of a primary database node and several secondary database nodes. These nodes work together as a replica set. 

The primary node handles all the write operations and replicates the data to the secondary nodes. If the primary node fails, one of the secondary nodes automatically takes over as the new primary, ensuring continuous operation and availability. 

The diagram below shows a different representation of the database hierarchy


Latest articles

If you’re not a paid subscriber, here’s what you missed.

    A Crash Course on Relational Database Design

    A Crash Course on Distributed Systems

    A Crash Course in Database Scaling Strategies

    A Crash Course in Database Sharding

    A Crash Course on Microservice Communication Patterns

    A Crash Course on Cell-based Architecture

To receive all the full articles and support ByteByteGo, consider subscribing:

Subscribe now


The Data Movement Platform

What’s the most important ability required to build a DBaaS platform that is horizontally scalable and highly elastic?

It’s the ability to migrate data across database shards with zero downtime and no impact on the client. 

Stripe achieved this ability with their Data Movement Platform. The platform had a few important requirements such as:

The diagram below shows the architecture of the Data Movement Platform:

The heart of the platform is the Coordinator component, which is responsible for orchestrating the various steps involved in online data migrations.

Step 1: Chunk Migration Registration

The first step is registering a request to migrate database chunks from source shards to target shards.

Once the request is created, indexes are built on the target shards. 

An index is a data structure that improves the speed of data retrieval operations on a database table or collection. Building the index first on the target shard has some advantages:50

Step 2: Bulk Data Import

The next step involves using a snapshot of the chunks on the source shard at a specific point in time. This snapshot is used to load the data onto one or more target database shards.

The service performing the bulk data import accepts data filters, allowing for the selective import of chunks that satisfy the specified filtering criteria. This step initially appeared straightforward. However, Stripe’s infra team encountered throughput limitations when they tried to bulk load data onto a DocDB shard.

Efforts to address the issue by batching writes and adjusting DocDB engine parameters were not successful.

A significant breakthrough came when the team explored methods to optimize the insertion order by using the fact that DocDB organizes its data using a B-tree data structure. By sorting the data based on the most common index attributes in the collections and inserting it in sorted order, the write proximity was enhanced, resulting in a 10X boost in write throughput.

Step 3: Async Replication

After the bulk data import step is completed, the next step ensures that any subsequent writes or mutations that occur on the source shard after time T are replicated to the target shard.

This is where async replication comes into play.

Stripe’s async replication systems rely on the Change Data Capture (CDC) mechanism to capture and replicate the mutations from the source shards to the target shards. 

Here’s how it works:

Step 4: Correctness Check

After the replication sync between the source and target shard, the Coordinator conducts a comprehensive check for data completeness and correctness.

This is done by comparing point-in-time snapshots. It was a deliberate design choice to avoid impacting the shard’s throughput.

Step 5: Traffic Switch

The next step is to switch the traffic of incoming requests from the source shard to the target shard.

The Coordinator orchestrates the traffic switch after the data is imported to the target shard and the mutations are replicated. The process consists of three steps:

The traffic switch protocol uses the concept of versioned gating. 

To support this, the infra team added a custom patch to MongoDB that allows a shard to enforce a version number check before serving a request. Each proxy server annotates requests to the DocDB shard with a version token number. The shard first checks the version token number and serves the request only if the token number is newer than the earlier one.

The diagram below shows the detailed process flow for the traffic switch protocol:

Here’s how the process works:

The entire traffic switch protocol takes less than two seconds to execute. Any failed reads and writes to the source shard succeed on retries that go to the target shard.

Step 6: Chunk Migration Deregistration

Finally, the migration process is concluded by marking the migration as complete in the chunk metadata service.

Also, the chunk data is dropped from the source shard to reclaim the resources.

Conclusion

Stripe’s custom-built database-as-a-service, DocDb, and its Data Movement Platform have been instrumental in achieving 99.999% uptime while enabling zero-downtime data migrations.

Some key takeaways are as follows:

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing hi@bytebytego.com







Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Stripe DocDB 数据库即服务 MongoDB 水平扩展 高可用性 数据迁移
相关文章