ByteByteGo 2024年07月11日
Scaling to 1.2 Billion Daily API Requests with Caching at RevenueCat
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

RevenueCat, a platform for managing in-app subscriptions and purchases, handles over 1.2 billion API requests daily. To ensure speed and reliability at this scale, they implemented a sophisticated caching system using Memcached. This article dives into their architecture, highlighting key goals such as low latency, keeping cache servers warm, and data consistency, along with the strategies employed to achieve them.

🎯 **Low Latency:** RevenueCat prioritizes low latency by maintaining a pool of pre-established connections to cache servers, avoiding the overhead of TCP handshakes on each request. They also implement a 'fail-fast' approach, immediately marking unresponsive servers as down, preventing cascading failures and prioritizing system stability over perfect efficiency.

🎯 **Keeping Cache Servers Warm:** To ensure cache servers remain available and full of frequently accessed data, RevenueCat uses mirrored and gutter cache pools for failover and temporary caching. They also employ dedicated cache pools for high-value data, guaranteeing space and ensuring its availability. This strategy balances availability, cost, and resource utilization.

🎯 **Maintaining Data Consistency:** RevenueCat employs robust techniques like sharding and maintaining data consistency between cache and backend systems to ensure data accuracy. They also utilize mechanisms like consistent hashing to distribute data across multiple cache servers, reducing the impact of individual server failures.

🎯 **Monitoring and Observability:** RevenueCat emphasizes monitoring and observability to ensure the caching system's health and performance. This includes tracking key metrics like cache hit rate, latency, and server utilization to identify bottlenecks and optimize the system.

Effortlessly Integrate E-Signatures into Your App with BoldSign (Sponsored)

BoldSign by Syncfusion makes it easy for developers to integrate e-signatures into applications.

Our powerful e-signature API allows you to embed signature requests, create templates, add custom branding, and more.  

It’s so easy to get started that 60% of our customers integrated BoldSign into their apps within one day.  

Why BoldSign stands out: 

Get started


Disclaimer: The details in this post have been derived from the article originally published on the RevenueCat Engineering Blog. All credit for the details about RevenueCat’s architecture goes to their engineering team. The link to the original article is present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

RevenueCat is a platform that makes it easy for mobile app developers to implement and manage in-app subscriptions and purchases. 

The staggering part is that they handle over 1.2 billion API requests per day from the apps. 

At this massive scale, a fast and reliable performance becomes critical. Some of it is achieved by distributing the workload uniformly across multiple servers.

However, an efficient caching solution also becomes the need of the hour.

Caching allows frequently accessed data to be quickly retrieved from fast memory rather than slower backend databases and systems. This can dramatically speed up response times. 

But caching also adds complexity since the cached data must be kept consistent with the source of truth in the databases. Stale or incorrect data in the cache can lead to serious issues.

For an application operating at the scale of RevenueCat, even small inefficiencies or inconsistencies in the caching layer can have a huge impact. 

In this post, we will look at how RevenueCat overcame multiple challenges to build a truly reliable and scalable caching solution using Memcached.


The Three Key Goals of Caching

RevenueCat has three key goals for its caching infrastructure:

While these main goals are highly relevant to applications operating at scale, a robust caching solution also needs supporting features such as monitoring and observability, optimization, and some sort of automated scaling.

Let’s look at each of these goals in more detail and how RevenueCat’s engineering team achieved them.

Low Latency

There’s no doubt that latency has a huge impact on user experience. 

As per a statistic by Amazon, every 100ms of latency costs them 1% in sales. While it’s hard to confirm whether this is 100% true, there’s no denying the fact that latency impacts user experience.

Even small delays of a few hundred milliseconds can make an application feel sluggish and unresponsive. As latency increases, user engagement and satisfaction plummet.

RevenueCat achieves low latency in its caching layer through two key techniques.

1 - Pre-established connections

Their cache client maintains a pool of open connections to the cache servers. 

When the application needs to make a cache request, it borrows a connection from the pool instead of establishing a new TCP one. This is because a TCP handshake could nearly double the cache response times. Borrowing the connection avoids the overhead of the TCP handshake on each request. 

But no decision comes without some tradeoff.

Keeping connections open consumes memory and other resources on both the client and server. Therefore, it’s important to carefully tune the number of connections to balance resource usage with the ability to handle traffic spikes.

2 - Fail-fast approach

If a cache server becomes unresponsive, the client immediately marks it as down for a few seconds and fails the request, treating it as a cache miss. 

In other words, the client will not retry the request or attempt to establish new connections to the problematic server during this period.

The key insight here is that even brief retry delays of 100ms can cause cascading failures under heavy load. Requests pile up, servers get overloaded, and the "retry storm" can bring the whole system down. Though it might sound counterintuitive, failing fast is crucial for a stable system.

But what’s the tradeoff here?

There may be a slight increase in cache misses when servers have temporary issues. But this is far better than risking a system-wide outage. A 99.99% cache hit rate is meaningless if 0.01% of requests trigger cascading failures. Prioritizing stability over perfect efficiency is the right call.

One potential enhancement over here could be circuit breaking where requests to misbehaving servers can be disabled based on error rates and latency measurements. This is something that Uber uses in their integrated cache solution called CacheFront.

However, the aggressive timeouts and managing connection pools likely achieve similar results with far less complexity.

Keeping Cache Servers Warm

The next goal RevenueCat had was keeping the cache servers warm.

They employed several strategies to achieve this.

1 - Planning for Failure with Mirrored and Gutter pool

RevenueCat uses fallback cache pools to handle failures. 

Their strategy is designed to handle cache server failures and maintain high availability. The two approaches they use are as follows:

Here also, there are trade-offs to consider concerning server size:

For example, having smaller servers provides benefits such as:

However, small servers also have drawbacks:

The diagram below from RevenueCat’s article shows this comparison:

Bigger servers also have some trade-offs:

This is where the strategy of using a mirrored pool for fast failover and a gutter pool for temporary caching strikes a balance between availability and cost. 

The mirrored pool ensures immediate availability. The gutter pool, on the other hand, provides a cost-effective way to handle failures temporarily.

Generally speaking, it’s better to design the cache tier based on a solid understanding of the backend capacity. Also, when using sharding, the cache, and the backend sharding should be orthogonal so that a cache server going down translates into a moderate increase on backend servers.


Latest articles

If you’re not a paid subscriber, here’s what you missed.

    A Crash Course on Cell-based Architecture

    A Crash Course on Content-Delivery Networks (CDN)

    A Crash Course on REST APIs

    API Security Best Practices

    A Crash Course in GraphQL

To receive all the full articles and support ByteByteGo, consider subscribing:

Subscribe now


2 - Dedicated Pools

Another technique they employ to keep cache servers warm is to use dedicated cache pools for certain use cases.

Here’s how the strategy works:

The dedicated pools strategy has several advantages:

3 - Handling Hot Keys

Hot keys are a common challenge in caching systems. 

They refer to keys that are accessed more frequently than others, leading to a high concentration of requests on a single cache server. This can cause performance issues and overload the server, potentially impacting the overall system.

There are two main strategies for handling hot keys:

Key Splitting

The below points explain how key splitting works:

Local Caching

Local caching is simpler when compared to key splitting.

Here are some points to explain how it works:

Avoiding Thundering Herds

When a popular key expires, all clients may request it from the backend simultaneously, causing a spike. This is known as the “thundering herd situation”.

RevenueCat avoids this situation since it tries to maintain cache consistency by updating it during the writes. However, when using low TTLs and invalidations from DB changes, the thundering herd can cause a lot of problems.

Some other potential solutions to avoid thundering herds are as follows:

Cache Server Migrations

Sometimes cache servers have to be replaced while minimizing impact on hit rates and user experience.

RevenueCat has built a coordinated cache server migration system that consists of the following steps:

    Warming up the new cluster:

      Before switching traffic, the team starts warming up the new cache cluster.

      They populate the new cluster by mirroring all the writes from the existing cluster.

      This ensures that the new cluster has the most up-to-date data before serving any requests.

    Switching a percentage of reads:

      After the new cluster is sufficiently warm, the team gradually switches a percentage of read traffic to it.

      This allows them to test the new cluster’s performance and stability under real-world load.

    Flipping all traffic:

      Once the new cluster has proven its stability and performance, the traffic is flipped over to it.

      At this point, the new cluster becomes the primary cache cluster, serving all read and write requests.

      The old cluster is kept running for a while, with writes still being mirrored to it. This allows quick fallback in case of any issues.

    Decommissioning the old cluster:

      After a period of stable operation with the new cluster as the primary, the old cluster is decommissioned.

      This frees up resources and completes the migration process.

The diagram below shows the entire migration process.

Maintaining data consistency is one of the biggest challenges when using caching in distributed systems. 

The fundamental issue is that data is stored in multiple places - the primary data store (like a database) and the cache. Keeping the data in sync across these locations in the face of concurrent reads and writes is a non-trivial problem.

See the example below that shows how a simple race condition can result in a consistency problem between the database and the cache.

What’s going on over here?

RevenueCat uses two main strategies to maintain cache consistency.

1 - Write Failure Tracking

In RevenueCat's system, a cache write failure is a strong signal that there may be an inconsistency between the cache and the primary store. 

However, there are better options than simply retrying the write because that can lead to cascading failures and overload as discussed earlier.

Instead, RevenueCat's caching client records all write failures. After recording, it deduplicates them and ensures that the affected keys are invalidated in the cache at least once (retrying as needed until successful). This guarantees that the next read for those keys will fetch fresh data from the primary store, resynchronizing the cache.

This write failure tracking allows them to treat cache writes as if they should always succeed, significantly simplifying their consistency model. They can assume the write succeeded, and if it didn't, the tracker will ensure eventual consistency.

2 - Consistent CRUD Operations

For each type of data operation (Create, Read, Update, Delete), they have developed a strategy to keep the cache and primary store in sync.

For reads, they use the standard cache-aside pattern: read from the cache, and on a miss, read from the primary store and populate the cache. They always use an "add" operation to populate, which only succeeds if the key doesn't already exist, to avoid overwriting newer values.

For updates, they use a clever strategy as follows:

If a failure occurs between steps 1 and 2, the cache remains consistent as the update never reaches the primary store. If a failure occurs between 2 and 3, the cache will be stale, but only for a short time until the reduced TTL expires. Also, any complete failures are caught by the write failure tracker that we talked about earlier.

For deletes, they use a similar TTL reduction strategy before the primary store delete. 

However, for creation, they rely on the primary store to provide unique IDs to avoid conflicts.

Conclusion

RevenueCat’s approach illustrates the complexities of running caches at a massive scale. While some details may be specific to their Memcached setup, the high-level lessons are widely relevant.

Here are some key takeaways to consider from this case study:

References:


SPONSOR US

Get your product in front of more than 500,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing hi@bytebytego.com














Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

缓存 Memcached RevenueCat 性能优化 架构
相关文章