ByteByteGo 2024年09月17日
How Netflix Warms Petabytes of Cache Data
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了Netflix的多种技术,包括EVCache、缓存暖器等,以优化低延迟应用和处理各种数据情况。

🎯EVCache是Netflix的分布式内存缓存解决方案,支持多种用例,如Lookaside Cache、Transient Data Store、Primary Store和High Volume Data等,可减少流媒体应用的延迟。

🔥Netflix为解决原有缓存扩展方法的问题,创建了缓存暖器(Cache Warmer),具有Replica Warmer 和 Instance Warmer 两大功能,能快速为新存储单元填充数据。

🧩缓存暖器系统由Controller、Dumper和Populator三个主要组件构成,Controller负责管理整个缓存暖过程,Dumper从现有缓存中提取数据,Populator负责将数据填充到新缓存中。

🌟Netflix广泛使用缓存暖器进行扩展,Instance Warmer也在生产中运行,提高了系统的效率和稳定性。

Free tickets to P99 CONF — 60+ low-latency engineering talks (Sponsored)

P99 CONF is the technical conference for anyone who obsesses over high-performance, low-latency applications. Engineers from Disney, Shopify, LinkedIn, Netflix, Google, Meta, Uber + more will be sharing 60+ talks on topics like Rust, Go, Zig, distributed data systems, Kubernetes, and AI/ML.

Join 20K of your peers for an unprecedented opportunity to learn from experts like Michael Stonebraker, Bryan Cantrill, Avi Kivity, Liz Rice & Gunnar Morling & more – for free, from anywhere.

GET YOUR FREE TICKET

Bonus: Registrants are eligible to enter to win 1 of 300 free swag packs, get 30-day access to the complete O’Reilly library & learning platform, plus free digital books.


Disclaimer: The details in this post have been derived from the Netflix Tech Blog. All credit for the technical details goes to the Netflix engineering team. The links to the original articles are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

The goal of Netflix is to keep users streaming for as long as possible. 

However, a user’s typical attention span is just 90 seconds. If the streaming application is not responding fast enough, there is a high chance that the user will drop off.

EVCache is one tool that helps Netflix reduce latency on their streaming app by supporting multiple use cases such as:

The diagram below shows the various use cases in more detail.

Netflix uses EVCache as a tier-1 cache. For reference, it is a distributed in-memory caching solution based on memcached that is integrated with Netflix OSS and AWS EC2 infrastructure.

At Netflix, EVCache holds petabytes of data comprising thousands of nodes and hundreds of clusters in production. These clusters are routinely scaled up due to the increasing growth of the Netflix user base and the user data generated.

The earlier process to make the cache bigger followed the below steps:


FusionAuth: Auth. Built for Devs, by Devs. (Sponsored)

FusionAuth is a complete auth & user platform that has 10M+ downloads and is trusted by industry leaders!

Start for Free


See the diagram below for reference:

This approach worked, but it was expensive because they had to run two systems simultaneously for a while. Also, this solution had a few more problems:

To fix these problems, Netflix created a new tool called the cache warmer. This tool had two main features:

For Instance Warmer, they copy data to the new empty node from another replica in the system. This is done because the other nodes in the system have continued to operate and update data while this node was down or being replaced. By choosing a replica that doesn’t include this node, they ensure they’re getting the most up-to-date data from nodes that have been continuously operational.


Cache Warmer Design

The diagram below shows the architectural overview of the cache warming system built by Netflix.

The cache warming system consists of three main components:

Let’s look at each component one by one.

Controller

The Controller is the manager of the entire cache warming process. It sets up the environment and creates a communication channel (SQS queue) between the Dumper and Populator. 

Some of the key functions performed by the controller are as follows:

Dumper

The Dumper is part of the EVCache sidecar, which is a separate service running alongside the main cache (memcached instance) on each node. Its job is to extract data from the existing cache.

The dumping process works as follows:

The Dumper supports configurable chunk sizes for flexibility. It is also capable of parallel processing of multiple key chunks.

Populator

The Populator is responsible for filling the new cache with data from the Dumper.

Here’s how it works:

The Populator starts working as soon as data is available and doesn’t wait for the full dump. It can auto-scale based on the amount of data available to process.

Instance Warmer

In a large EVCache deployment, individual nodes (servers) can sometimes be terminated or replaced due to hardware issues or other problems. This can cause:

To minimize these issues, the Netflix engineering team also developed an Instance Warmer that can quickly fill up replaced or restarted nodes with data.

The diagram below shows the overall architecture of the Instance Warmer.

The Instance Warmer works as follows:

The instance warming process is more efficient than full replica warming because it deals with a smaller fraction of data on each node. Also, it is targeted to specific nodes rather than entire replicas.

The Implementation Results

Netflix is extensively using the cache warmer for scaling if the TTL is greater than a few hours. This has helped them handle the holiday traffic efficiently.

The chart below shows the warming up of two new replicas from one of the two existing replicas. Existing replicas had about 500 million items and 12 Terabytes of data. The warm-up took around 2 hours to complete.

The Instance Warmer is also running in production and warming up a few instances every day. The chart below shows an instance getting replaced at around 5.27. It was warmed up in less than 15 minutes with about 2.2 GB of data and 15 million items.

Conclusion

In this post, we’ve looked at the cache-warming architecture implemented by Netflix in detail.

This flexible cache-warming architecture has allowed Netflix to warm petabytes of cache data by copying it from existing replicas to one or more new replicas. Also, it has made it easy to warm specific nodes that were terminated or replaced due to hardware issues.

One of the key takeaways is the cache warmer following a loosely coupled design where the Dumper and Populator are integrated through SQS. This shows how loosely coupled systems provide flexibility and extensibility in the long run as Netflix continues to enhance its system for greater efficiency.

References:

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Netflix EVCache 缓存暖器 低延迟应用
相关文章