ByteByteGo 2024年12月04日
How LinkedIn Customizes Its 7 Trillion Message Kafka Ecosystem
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

LinkedIn作为Kafka的早期使用者,在其庞大的基础设施中广泛应用Kafka处理海量数据,每天处理超过7万亿条消息。文章深入探讨了LinkedIn如何管理其超过100个Kafka集群,4000多个服务器,以及如何应对由此带来的可扩展性和操作性挑战。文章介绍了LinkedIn的Kafka生态系统,包括Kafka集群、应用、REST代理、Schema注册表、Brooklin、Cruise Control等组件,以及LinkedIn如何维护其Kafka版本,包括上游优先和LinkedIn优先的补丁策略,以及Kafka开发工作流程。通过分析LinkedIn的Kafka实践,我们可以了解到如何应对超大规模Kafka集群的挑战,并从中获得启发,优化自身的Kafka部署和管理方案。

🤔 **LinkedIn的Kafka生态系统:**LinkedIn拥有超过100个Kafka集群,4000多个服务器,处理超过10万个主题和700万个分区,每天处理超过7万亿条消息。其生态系统包含Kafka集群、应用、REST代理、Schema注册表、Brooklin、Cruise Control等组件,确保Kafka的稳定运行和可扩展性。

⚙️ **LinkedIn Kafka版本管理:**LinkedIn维护其专属的Kafka版本,基于官方Apache Kafka版本,并添加了针对其特定需求的补丁。这些补丁通过“上游优先”或“LinkedIn优先”的方式进行管理,前者先修改官方代码,后者则先解决紧急问题,再考虑贡献到官方版本。

🔄 **Kafka开发工作流程:**LinkedIn工程师在进行Kafka修改或添加新功能时,会根据紧急程度和时间成本选择“上游优先”或“LinkedIn优先”策略。对于紧急的生产问题修复,通常采用“LinkedIn优先”;对于新功能,则鼓励“上游优先”,将贡献回馈给社区。

📊 **LinkedIn Kafka补丁示例:**LinkedIn针对其大型Kafka集群的特殊需求,开发了多个补丁,例如提升可扩展性(减少控制器内存占用、加速Broker启动和关闭)、优化操作性(简化Broker移除和添加流程)等。

📈 **LinkedIn Kafka的持续改进:**LinkedIn工程团队持续改进和定制其Kafka部署,满足特定需求,并将许多增强功能贡献回开源Apache Kafka项目,推动Kafka社区发展。

Cut your QA cycles down from hours to minutes with automated testing (Sponsored)

If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.

They get engineering teams to 80% automated end-to-end test coverage and helps them ship 5x faster by reducing QA cycles from hours to minutes.

QA Wolf takes testing off your plate. They can get you:

✔️ Unlimited parallel test runs

✔️ 24-hour maintenance and on-demand test creation

✔️ Human-verified bug reports sent directly to your team

✔️ Zero flakes, guaranteed

The result? Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.

Schedule a demo to learn more


Disclaimer: The details in this post have been derived from the LinkedIn Engineering Blog. All credit for the technical details goes to the LinkedIn engineering team. The links to the original articles are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

LinkedIn uses Apache Kafka, an open-source stream processing platform, as a key part of its infrastructure. 

Kafka was first developed at LinkedIn and later open-sourced. Many companies now use Kafka, but LinkedIn uses it on an exceptionally large scale.

LinkedIn uses Kafka for multiple tasks like:

They have over 100 Kafka clusters with more than 4,000 servers (called brokers), handling over 100,000 topics and 7 million partitions. In total, LinkedIn's Kafka system processes more than 7 trillion messages every day.

Operating Kafka at this huge scale creates challenges in terms of scalability and operability. To tackle these issues, LinkedIn maintains its version of Kafka, specifically tailored for their production needs and scale. This includes LinkedIn-specific release branches that contain patches for their production requirements and feature needs.

In this post, we’ll look at how LinkedIn manages its Kafka releases running in production and how it develops new patches to improve Kafka for the community and internal usage.

LinkedIn’s Kafka Ecosystem

Let us first take a high-level look at LinkedIn’s Kafka ecosystem.

LinkedIn's Kafka ecosystem is a crucial part of its technology stack, enabling it to handle an immense volume of messages - around 7 trillion per day. 

The ecosystem consists of several key components that work together to ensure smooth operation and scalability. See the diagram below:

Here are the details about the various components:

All these components work together to form LinkedIn’s robust and scalable Kafka ecosystem. 

The ecosystem enables LinkedIn to handle the massive volume of real-time data generated by its users and systems while maintaining high performance and reliability.

LinkedIn engineering team continuously improves and customizes its Kafka deployment to meet specific needs. They also contribute many enhancements back to the open-source Apache Kafka project.


Latest articles

If you’re not a subscriber, here’s what you missed this month.

    Stateless Architecture: The Key to Building Scalable and Resilient Systems

    Distributed Caching: The Secret to High-Performance Applications

    Speedrunning Guide: Junior to Staff Engineer in 3 years

    A Pattern Every Modern Developer Should Know: CQRS

    Why Executives Seem Out of Touch, and How to Reach Them

To receive all the full articles and support ByteByteGo, consider subscribing:

Subscribe now


LinkedIn’s Kafka Release Branches

LinkedIn maintains its special versions of Kafka, which are based on the official open-source Apache Kafka releases. These special versions are called LinkedIn Kafka Release Branches.

Each LinkedIn Kafka Release Branch starts from a specific version of Apache Kafka. For example, they might create a branch called “LinkedIn Kafka 2.3.0.x”, which is based on the Apache Kafka 2.3.0 release.

LinkedIn makes changes and adds extra code (called “patches”) to these branches to help Kafka work better for their specific needs. They have two main ways of adding these patches:

The diagram below shows the approach to managing the releases:

In the LinkedIn Kafka Release Branches, one can find a mix of different types of patches:

When LinkedIn creates a new Kafka Release Branch, it starts from the latest Apache Kafka release. Then, they look at their previous LinkedIn branch and bring over any of their patches that haven’t been added to the official Apache Kafka code yet.

They use special notes in the code changes to keep track of which patches have been added to the official release and which are still just in the LinkedIn version. Also, they regularly check the Apache Kafka code and bring in new changes to keep their branch up to date.

Finally, they perform intensive testing of their new LinkedIn Kafka Release Branch. They test it with real data and usage to ensure that it works well and performs fast, before using it for real work at LinkedIn.

Kafka Development Workflow at LinkedIn

When LinkedIn engineers want to make a change or add a new feature to Kafka, they first have to decide whether to make the change in the official Apache Kafka code (called "upstream-first") or to make the change in LinkedIn's version of Kafka first (called "LinkedIn-first" or "hotfix approach").

Here’s what the decision-making process looks like on a high level:

The diagram below shows the entire flow in more detail.

So in summary, the decision between upstream-first and LinkedIn-first depends on factors like:

Patch Examples from LinkedIn

LinkedIn has made several changes (called “patches) to Kafka to help it work better for their specific needs. These patches fall into a few main categories:

Scalability Improvements

LinkedIn has some very large Kafka clusters, with over 140 brokers and millions of data copies in a single cluster. 

With clusters this big, they sometimes have problems with the central control server being slow or running out of memory. Also, many times brokers take a long time to start up or shut down. 

To fix these problems, LinkedIn made patches to:

Operational Improvements

Sometimes, LinkedIn needs to remove brokers from a cluster and add new brokers. When they remove a broker, they want to make sure all the data on that broker is copied to other brokers first, so no data is lost. 

However, this was hard to do, because even while they were trying to move data off a broker, new data was constantly being added to it. 

To solve this, they created a new mode called “maintenance mode” for brokers. When a broker is in maintenance mode, no new data is added to it. This makes it much easier to move all the data off the broker before shutting it down.

New Features for Apache Kafka

LinkedIn has added several brand new features to their version of Kafka such as:

The LinkedIn engineering team also contributed many improvements directly to the Apache Kafka project, so everyone can benefit from them. 

Some major examples include:

Conclusion

To summarize, LinkedIn customizes Kafka heavily to handle the immense scale at which it operates. 

It also contributes many improvements upstream while maintaining release branches to rapidly address issues. Their development workflow and release branching are designed to balance urgency with contributions going back to the open-source community.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com.


Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Kafka LinkedIn 大规模集群 补丁管理 开源
相关文章