ByteByteGo 2024年07月11日
How PayPal Scaled Kafka to 1.3 Trillion Daily Messages
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入解析了 PayPal 如何构建其高性能 Kafka 架构,以支持每秒 2100 万条消息的峰值流量。文章详细介绍了 PayPal 在集群管理、监控和警报、配置管理以及增强和自动化等方面的最佳实践,并重点阐述了 Kafka Config Service、ACL、PayPal Kafka 库和 QA 平台等关键技术方案。

🎯 **集群管理:** PayPal 构建了专门的 Kafka Config Service,简化了客户端连接到 Kafka 集群的过程,并解决了配置管理的难题。此外,他们还引入了 ACL,确保对 Kafka 集群的访问控制,并开发了 PayPal Kafka 库,提供客户端库、监控库和安全库,简化了集成和安全管理。最后,他们还建立了与生产环境高度一致的 QA 平台,以确保开发人员能够在生产环境之前进行充分测试。 为了更好地管理 Kafka 集群,PayPal 采取了一系列措施,例如构建 Kafka Config Service、引入 ACL、开发 PayPal Kafka 库以及建立 QA 平台。这些措施有效地简化了客户端连接、配置管理、安全控制以及测试过程,提升了 Kafka 集群的管理效率。

📊 **监控和警报:** PayPal 将 Kafka 平台与监控和警报系统集成,并筛选出关键指标,通过 SignalFX 代理将这些指标发送到 SignalFX 后端。当指标超出阈值时,会触发相应的警报,从而及时发现和解决问题。 PayPal 充分利用 Kafka 提供的指标,通过筛选和发送至 SignalFX 后端,并设置警报阈值,实现了对 Kafka 平台的实时监控和警报机制,确保了平台的稳定性和可靠性。

⚙️ **配置管理:** PayPal 采用了一种基于模板的配置管理方法,使用 Terraform 来管理 Kafka 集群的配置。这种方法提高了配置的一致性和可重复性,并简化了配置变更的管理。 PayPal 通过采用模板化配置管理,并利用 Terraform 工具,实现了 Kafka 集群配置的自动化管理,提高了配置的一致性和可重复性,并降低了人工管理的复杂度。

🚀 **增强和自动化:** PayPal 通过使用 Apache Kafka Streams、Kafka Connect 和 Kafka MirrorMaker 等工具,实现了 Kafka 平台的自动化和增强。这些工具帮助 PayPal 构建了数据流管道、进行数据集成和数据摄取,并提高了平台的可用性和容错能力。 PayPal 充分利用 Kafka 的生态系统,通过使用 Apache Kafka Streams、Kafka Connect 和 Kafka MirrorMaker 等工具,实现了 Kafka 平台的自动化和增强,提升了数据流处理、数据集成、数据摄取以及平台的可用性和容错能力。

Database Performance at Scale: A Practical Guide [FREE BOOK] (Sponsored)

Discover new ways to optimize database performance – and avoid common pitfalls – in this free 270-page book.

This book shares best practices for achieving predictable low latency at high throughput. It’s based on learnings from thousands of real-world database use cases – including Discord, Disney, Strava, Expedia, Epic Games & more.

Download for free


Disclaimer: The details in this post have been derived from the article originally published on the PayPal Tech Blog. All credit for the details about PayPal’s architecture goes to their engineering team. The link to the original article is present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

In the 2022 Retail Friday, PayPal’s Kafka setup witnessed a traffic volume of 21 million messages per second.

This was about 1.3 trillion messages in a single day. 

Similarly, Cyber Monday resulted in 19 million messages per second coming to around 1.23 trillion messages in a single day.

How did PayPal scale Kafka to achieve these incredible numbers?

In this post, we will go through the complete details of PayPal’s high-performance Kafka setup that made it possible.


Kafka at PayPal

Apache Kafka is an open-source distributed event streaming platform.

PayPal adopted Kafka in 2015 and they use it for building data streaming pipelines, integration, and ingestion.

At present, PayPal’s Kafka fleet consists of over 1500 brokers hosting 20,000 topics. The 85+ clusters are expected to maintain a 99.99% availability.

Over the years, PayPal has seen tremendous growth in streaming data and they wanted to ensure high availability, fault tolerance, and optimal performance.

Some of the specific use cases where PayPal uses Kafka are as follows: 

How PayPal Operates Kafka?

PayPal’s infrastructure is spread across multiple geographically distributed data centers and security zones. Kafka clusters are deployed across these zones.

There are some key differences between data centers and security zones:

In the context of PayPal, Kafka clusters handling sensitive payment data may be placed in a high-security zone with restricted access. Clusters processing less sensitive data may reside in a different security zone.

One thing, however, is common.

Whether it is data centers or security zones, MirrorMaker is used to mirror the data across the data centers, which helps with disaster recovery and communication across security zones. 

For reference, Kafka MirrorMaker is a tool for mirroring data between Apache Kafka clusters. It leverages the Kafka Connect framework to replicate data, which improves resiliency. 

See the diagram below to get an idea about PayPal’s Kafka setup across data centers and security zones:

Operating Kafka at the scale of PayPal is a challenging task. To manage the ever-growing fleet of Kafka clusters, PayPal has focused on some key areas such as:

The diagram below shows a high-level view of PayPal’s Kafka Landscape:

In the subsequent sections, we will look at each area in greater detail.

Cluster Management

Cluster management deals with controlling Kafka clusters and reducing operational overhead. Some of the key improvements were done in areas like:

Let’s look at each improvement in more detail.


Latest articles

If you’re not a paid subscriber, here’s what you missed.

    A Crash Course on Content-Delivery Networks (CDN)

    A Crash Course on REST APIs

    API Security Best Practices

    A Crash Course in GraphQL

    HTTP1 vs HTTP2 vs HTTP3 - A Deep Dive

To receive all the full articles and support ByteByteGo, consider subscribing:

Subscribe now


Kafka Config Service

PayPal built a special Kafka config service to make it easy for clients to connect to the Kafka clusters.

Before the config service, clients had to hardcode the broker IPs in the connection configuration. 

This created a maintenance nightmare due to a couple of reasons:

The Kafka config service solved these issues. The service makes it easy for the Kafka clients to follow a standard configuration. Ultimately, it reduces operational and support overhead across teams.

The diagram below shows the Kafka client retrieving bootstrap servers and configuration from the config service using the topic information.

Kafka ACLs

Initially, any PayPal application could connect to any of the existing Kafka topics. This was an operational risk for the platform considering that it streams business-critical data.

To ensure controlled access to Kafka clusters, ACLs were introduced.

For reference, ACLs are used to define which users, groups, or processes have access to specific objects such as files, directories, applications, or network resources. It’s like a table or list specifying a particular object's permissions.

With the introduction of ACLs, applications had to authenticate and authorize access to Kafka clusters and topics. 

Apart from making the platform secure, ACLs also provided a record of every application accessing a particular topic or cluster.

PayPal Kafka Libraries 

It was important to ensure that Kafka clusters and the clients connecting to them operate securely. Also, there was a need to ensure easy integration to multiple frameworks and programming languages. They didn’t want each engineering team to reinvent the wheel.

Supported tech stack for Kafka Libraries. Source: PayPal Tech Blog

To facilitate these needs, PayPal built a few important libraries:

QA Platform

One of the great things PayPal did was to set up a production-like QA platform for Kafka for developers to test changes confidently.

This is a common problem in many organizations where the testing performed by developers is hardly indicative of the production environment, resulting in issues after launch. 

A dedicated QA platform solves this by providing a direct mapping between production and QA clusters. 

The same security standards are followed. The same topics are hosted on the clusters with the brokers spread across multiple zones within the Google Cloud Platform.

Monitoring and Alerting

Monitoring and alerting are extremely important aspects for systems operating at a high scale. Teams want to know about issues and incidents quickly so that cascading failures can be avoided.

At PayPal, the Kafka platform is integrated with the monitoring and alerting systems.

Apache Kafka provides multiple metrics. However, they have taken out a subset of metrics that help them identify issues faster. 

The Kafka Metrics library filters out the metrics and sends them to the SignalFX backend via SignalFX agents running on all brokers, Zookeepers, MirrorMakers, and Kafka clients. Individual alerts associated with these metrics are triggered whenever abnormal thresholds are breached. 

Configuration Management

Operating a critical system requires one to guard against data loss. This is not only applicable to the application data but also to the infrastructure information.

What if the infrastructure gets wiped out and we’ve to rebuild it from scratch?

At PayPal, configuration management helps them store the complete infrastructure details. This is the single source of truth that allows PayPal to rebuild the clusters in a couple of hours if needed.

They store the Kafka metadata such as topic details, clusters, and applications in an internal configuration management system. The metadata is also backed up to ensure that they have the most recent data in case it’s required to re-create clusters and topics in case of a recovery.

Enhancements and Automation

Large-scale systems require special tools to carry out operational tasks as quickly as possible.

PayPal built multiple such tools for operating their Kafka cluster. Let’s look at a few important ones:

Patching Security Vulnerabilities

PayPal uses BareMetal for deploying the Kafka brokers and virtual machines for Zookeeper and MirrorMakers.

As we can expect, all of these hosts need to be patched at frequent intervals to fix any security vulnerabilities. 

Patching requires BM restart which can cause partitions to lag. This can also lead to data loss in the case of Kafka topics that are configured with a replica set of three. 

They built a plugin to query whether a partition was lagging before patching the host, thereby ensuring only a single broker is patched at a time with no chances of data loss.

Topic Onboarding

Application teams require topics for their application functionality. To make this process standardized, PayPal built an Onboarding Dashboard to submit a new topic request.

The diagram below shows the onboarding workflow for a topic.

A special review team checks the capacity requirements for the new topic and onboards it to one of the available clusters. They use a capacity analysis tool integrated into the onboarding workflow to make the decision.

For each new application being onboarded to the Kafka system, a unique token is generated. This token is used to authenticate the client’s access to the Kafka topic. As discussed earlier, an ACL is created for the specific application and topic based on their role.

MirrorMaker Onboarding

As mentioned earlier, PayPal uses MirrorMaker for mirroring the data from one cluster to another.

For this setup, developers also use the Kafka Onboarding UI to register their requirements. After due checks by the Kafka team, the MirrorMaker instances are provisioned.

The diagram below shows the process flow for the same:

Conclusion

The Kafka platform at PayPal is a key ingredient for enabling seamless integration between multiple applications and supporting the scale of their operation.

Some important learnings to take away from this study are as follows:

References


SPONSOR US

Get your product in front of more than 500,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing hi@bytebytego.com


Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Kafka PayPal 高性能 架构 集群管理
相关文章