How Uber Scaled Cassandra for Tens of Millions of Queries Per Second?

✂️Cut your QA cycles down to minutes with QA Wolf (Sponsored)

If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.

Their AI-native approach gets engineering teams to 80% automated end-to-end test coverage and helps them ship 5x faster by reducing QA cycles from hours to minutes.

QA Wolf takes testing off your plate. They can get you:

Unlimited parallel test runs

24-hour maintenance and on-demand test creation

Human-verified bug reports sent directly to your team

Zero flakes guaranteed

The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.

With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.

Schedule a demo to learn more

Disclaimer: The details in this post have been derived from the Uber Engineering Blog. All credit for the technical details goes to the Uber engineering team. The links to the original articles are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Uber’s app is a wonderful piece of engineering, enabling the movement of millions of people around the world and tens of millions of food and grocery deliveries.

One common thing across each trip and delivery is the need for low-latency and highly reliable database interactions. Uber made this a reality with their Cassandra database as a service platform.

Supporting a multitude of Uber’s mission-critical OLTP workloads for more than six years, the Cassandra database as a service has achieved some amazing stats:

Tens of millions of queries per second

Petabytes of data

Tens of thousands of Cassandra nodes

Thousands of unique keyspaces

Hundreds of unique Cassandra clusters with over 400+ nodes per cluster

Multi-region support

However, this scale wasn’t achieved overnight. Over the years, the Uber engineering team faced numerous operational challenges.

In this post, we’ll pull back the curtains from the architecture of Uber’s Cassandra setup and the multiple problems they solved to reach the desired scale.

The Cassandra Team Responsibilities

As mentioned, Uber runs Cassandra as a managed service. A dedicated team takes care of the platform's day-to-day operations. Here are the responsibilities of the team:

Implement new features in Cassandra and contribute to the community. This also includes critical bug fixes.

Integrate Cassandra into Uber’s ecosystem. This involves changes to the control plane, configuration management tools, observability, and alert management platform at Uber.

Build the managed Cassandra solution as a one-stop shop for Uber’s application teams.

Ensure 4 nines of availability (99.99%) and 24/7 support for the application teams.

Guide on best practices and data modeling needs of the application teams.

Architecture of Uber’s Cassandra Setup

The diagram below shows the overall Cassandra ecosystem at Uber.

On a high level, the Cassandra cluster spans across regions with the data replicated between them.

Uber’s in-house stateful management system Odin handles the configuration and orchestration of thousands of clusters. Together this forms the Cassandra Managed Service that powers different types of workloads, ranging from read-skewed to mixed to write-skewed.

See the diagram below to understand Odin’s role:

Let’s now look at the major components of the Cassandra database as a service architecture:

Cassandra Framework

This is an in-house framework developed by Uber and is responsible for the lifecycle of running Cassandra in Uber’s production environment.

The framework is powered by Uber’s stateful control plane Odin. It adheres to Odin’s standards and abstracts the complexity of Cassandra’s various functionalities such as:

Node selection and replacement

Rolling restarts and capacity adjustments

Decommissioning nodes

Starting and stopping nodes

Cassandra Client

The Cassandra client is the interface between the applications and the Cassandra clusters.

Uber forked the Go and Java open-source Cassandra clients and adapted them to work in Uber’s ecosystem. These clients use the service discovery mechanism to find the initial nodes to connect. This way there’s no need to hardcode the Cassandra endpoints in the application layer.

Some of the enhancements made by Uber within these clients are around additional observability features such as:

Capturing query fingerprints

DDL/DML features mapping

Service Discovery

Service discovery is a critical piece of large-scale distributed systems. It helps discover service instances on the fly, preventing hard coding and unnecessary configurations.

See the diagram below that shows the concept of service discovery:

At Uber, each Cassandra cluster is uniquely identified and their nodes can be discovered in real time.

Whenever a Cassandra node changes its status (Up, Down, or Decommission), the framework notifies the service discovery about the change. Applications (or service consumers) use the service discovery as the first contact point to connect to a Cassandra node. As the nodes change their status, the service discovery adjusts the list.

Challenges of Scaling Uber’s Cassandra Service

Since the inception of Cassandra’s service at Uber, it has continued to grow every year and more critical use cases have been added.

The service was hit with significant reliability challenges. Let’s look at some of the most important ones in more detail.

1 - Unreliable Node Replacement

Node replacement is a critical part of any large-scale fleet. There are multiple reasons for node replacement:

Hardware failures

Fleet optimization

Changes to the deployment topology

Disaster recovery

Cassandra provides a graceful node replacement approach that decommissions the existing node and adds a new node. But there were a few hiccups such as:

Node decommissioning getting stuck

Failing node addition

Data inconsistency

Every node replacement does not face these issues, but even a small percentage has the potential to impact the entire fleet. It adds operational overhead to teams. For example, even a 95% success rate means 5 failures out of 100 node replacements. In case 500 nodes are replaced every day, 25 failures may easily engage 2 engineers just recovering from these failures.

It was important to fix the root problem and that was Cassandra not cleaning up hint files for orphan nodes.

Think of it like this: You have a big family message board at home. When a family member moves out, everyone keeps their old sticky notes about them on the board. When you decide to get a new message board, you carefully copy all these notes to the new board.

The same was the case with Cassandra. A legit node N1 may store hint files locally for its peer node that was part of the Cassandra ring. However, even when the peer node is not part of the ring, node N1 does not purge the hint files. When N1 decommissions, it transfers all the orphan hint files to its next successor. Over time, the hint files keep growing, resulting in terabytes of garbage hint files. Transferring such a big file could take multiple days.

The team made a few changes in Cassandra:

Proactively purge the hint files belonging to orphan nodes.

Dynamically adjust the hint transfer rate limiter so the transfer can finish in hours instead of days.

Another fix was related to the node decommission step erroring out due to parallel activity, such as rolling restart due to fleet upgrades. The control plane wasn’t able to probe Cassandra about the decommissioned state. To handle this, they improved Cassandra's bootstrap and decommission path by exposing the state so the control plane could get the current status and take necessary action.

See the snippet below that explains the approximate change:

public boolean isDecommissionFailed(){  if (operationMode == Mode.LEAVING && hasDecommissionFailed)  {      return true;  }  return false;}public boolean isBootstrapFailed(){  if (operationMode == Mode.JOINING && hasBootstrapFailed)  {      return true;  }  return false;}

Source: Uber’s Engineering Blog

After these changes, the node replacement became 99.99% reliable.

2 - Cassandra’s Lightweight Transactions Error Rate

Few business use cases relied on Cassandra’s Lightweight Transactions at scale. However, these cases suffered higher error rates every other week.

It was a general belief that Cassandra’s Lightweight Transactions were unreliable. One of the errors was due to pending range due to multiple simultaneous node replacements. It’s similar to a mismanaged library where too many librarians are replaced at once and tracking books becomes quite difficult.

When a new node N2 replaces an old node N1, the Gossip code path on N2 continues pointing to N1’s IP address, resulting in a DNS resolution error. Ultimately, N2 could not function as expected, and restarting it was the only option.

Uber’s engineering team improved the error handling inside the Gossip protocol, making Cassandra Lightweight Transactions more robust.

3 - Data Inconsistency Issues

Another problem was related to data inconsistency due to sluggish Cassandra repairs.

Cassandra repairs are an important activity for every Cassandra cluster to fix data inconsistencies. While there are open-source solutions to trigger the repair, Uber did not want a control-plane-based solution. In their view, the repair should be an integral part of Cassandra similar to compaction.

With this goal, the Uber Cassandra team implemented the repair orchestration inside Cassandra itself. See the diagram below:

On a high level, they assigned a dedicated thread pool to the repair scheduler. The repair scheduler inside Cassandra maintains a new replicated table for the node status and repair history.

The scheduler picks the node that ran the repair first and continues orchestration to ensure each table and all ranges are repaired.

Since the implementation of a fully automated repair scheduler inside Cassandra, there has been no dependency on the control plane, which reduced the operational overhead significantly.

Conclusion

Uber’s Cassandra setup is a testament to the importance of incremental changes to build a large-scale fleet.

In this post, we’ve taken a deep look at the architecture of Uber’s Cassandra setup and the design of the managed service. We also looked at multiple challenges the engineering team overcame to make the Cassandra database service more reliable at scale.

Reference:

How Uber Optimized Cassandra Operations at Scale

Cassandra Gossip Protocol

Odin: Uber’s Stateful Platform