ByteByteGo 2024年12月11日
How Tinder Recommends To 75 Million Users with Geosharding
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了Tinder采用的Geosharding技术,包括其解决的问题、工作原理、优化shard size的考虑因素以及所使用的算法和工具等内容。

💻pgai是PostgreSQL扩展,为其带来更多AI工作流

🌐Geosharding将用户数据按地理位置划分成小区域

📈优化shard size需考虑查询效率等多种因素

📐使用Google的S2 Library和容器式负载均衡方法

Building AI Apps on Postgres? Start with pgai (Sponsored)

pgai is a PostgreSQL extension that brings more AI workflows to PostgreSQL, like embedding creation and model completion. pgai empowers developers with AI superpowers, making it easier to build search and retrieval-augmented generation (RAG) applications. Automates embedding creation with pgai Vectorizer, keeping your embeddings up to date as your data changes—no manual syncing required. Available free on GitHub or fully managed in Timescale Cloud.

pgai on GitHub


Disclaimer: The details in this post have been derived from the Tinder Technical Blog. All credit for the technical details goes to the Tinder engineering team. The links to the original articles are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Tinder is a dating app that handles billions of swipes daily, matching over 75 million users worldwide.

Their recommendation engine must deliver matches with speed and precision. However, managing a global user base and ensuring seamless performance isn’t easy, especially when searches involve massive amounts of data spread across different regions. 

To overcome the challenge, Tinder adopted Geosharding: a method of dividing user data into geographically bound "shards." 

This approach enabled the recommendation engine to focus searches only on relevant data, dramatically improving performance and scalability. The system now handles 20 times more computations than before while maintaining low latency and delivering matches faster than ever.

In this post, we’ll explore how Geosharding works, the architecture behind it, and the techniques Tinder uses to ensure data consistency.


The Initial Single-Index Approach

When Tinder started using Elasticsearch to manage its recommendation system, it stored all user data in a single "index".

Think of this index as one massive database holding information about every Tinder user worldwide. While this worked fine when the platform was smaller, it caused significant problems as Tinder grew.

Some major problems were as follows:


The MOST Hands-On Training on AI Tools you’ll ever attend, for free (Sponsored)

The biggest MYTH about AI is that it is for people in tech. Or people who can code.

In reality, AI can be game-changing for you whether you are a curious 11 year old kid wanting to research or a 50 year old professional who is willing to stay relevant in 2024.

Join the 3 hour AI Tools Training (usually $399) but free for the first 100 readers.

By the way, here’s sneak peek into what’s inside the workshop:

And a lot more that you’re not ready for, just 3 hours! ?

1.5 Million people are already RAVING about this hands-on Training on AI Tools. Don’t take our word for it? Attend for yourself and see.

Register here

(first 100 people get it for free + $500 bonus) ?


The Geosharding Solution

Geosharding was the method Tinder adopted to address the inefficiencies of using a single large index for its recommendation system. 

The idea was to divide the global user base into smaller, localized groups, or "shards," based on location. Each shard contains data for users in a specific region, allowing the system to focus its searches only on the most relevant data for a particular query.

How Geosharding Works?

Imagine Tinder’s user base as a map of the world. 

Instead of putting all user data into one massive database (like a single, global shelf), the map is divided into smaller sections or shards. Each shard corresponds to a geographic region. 

For example:

By organizing users this way, the system can search only the shard that matches a user’s location, avoiding the need to sift through irrelevant data. And all of this was made possible through Geosharding.

Geosharding improves query performance in the following ways:

Optimizing Shard Size

Despite the obvious benefits of sharding, finding the "right" size for a shard was crucial for maintaining system efficiency. 

Several factors were considered in determining this balance:

Algorithm and Tools Used For Geosharding

To implement Geosharding effectively, Tinder needed tools and algorithms that could efficiently divide the world into geographic shards while ensuring these shards were balanced in terms of user activity. 

Two key components made this possible: Google’s S2 Library and a container-based load balancing method. Let’s look at them both in more detail.

The S2 Library: Mapping the Globe Into Cells

The S2 Library is a powerful tool developed by Google for spatial mapping. It divides the Earth’s surface into a hierarchical system of cells, which are smaller regions used for geographic calculations. 

Source: S2 Geometry

Here’s how it works and why it’s ideal for Geosharding:

The main advantages of S2 for Tinder’s use case were as follows:

Load Balancing: Evenly Distributing Users Across Geoshards

Once the world was divided into S2 cells, the next challenge was to balance the "load" across these cells. 

Load refers to the activity or number of users within each shard. Without proper balancing, some shards could become "hot" (overloaded with users), while others remained underutilized. 

To solve this, Tinder used a container-based load-balancing approach which involved the following aspects:

The Abstraction Layer

After finalizing the Geosharding algorithm, Tinder designed a scalable and efficient architecture that relies on an abstraction layer to handle user data seamlessly across Geoshards. 

This layer simplifies interactions between the application, the recommendation system, and the geosharded Elasticsearch cluster. 

See the diagram below:

Here’s how the setup works:

Multi-Index vs Multi-Cluster

When building the Geosharded recommendation system, Tinder had to decide how to organize its data infrastructure to manage the geographically separated shards efficiently. 

This led to a choice between two approaches: multi-index and multi-cluster. 

Each approach had pros and cons, but Tinder ultimately chose the multi-index approach. Let’s break it down in simple terms.

What is the Multi-Index Approach?

In the multi-index setup:

The alternative was a multi-cluster setup that involved creating separate Elasticsearch clusters for each geoshard. While this approach has some advantages, it wasn’t the right fit for Tinder due to a lack of native support for cross-cluster queries and higher maintenance overhead.

To overcome the limitations of the multi-index setup, particularly the risk of uneven load distribution, Tinder implemented appropriate load balancing techniques. Some of the details about these techniques are as follows:

Handling Time Zones: Balancing Traffic Across Geoshards

One of the key challenges Tinder faced with Geosharding was the variation in traffic patterns across Geoshards due to time zones. 

See the diagram below that shows the traffic pattern of two Geoshards during a 24-hour time span:

Users within the same geoshard are typically in the same or adjacent time zones, meaning their active hours tend to overlap. For example:

If shards were assigned directly to physical servers without considering time zone effects, some servers would be overloaded during peak hours for one shard, while others would remain idle.

To solve this problem, Tinder implemented a randomized distribution of shards and replicas across physical nodes in the Elasticsearch cluster. 

Here’s how it works and why it helps:

The Overall Cluster Design

Tinder’s Geosharded recommendation system was built using a carefully designed cluster architecture to handle billions of daily swipes while maintaining speed, reliability, and scalability. 

The cluster architecture divides the responsibilities among different types of nodes, each playing a specific role in ensuring the system’s performance and fault tolerance.

See the diagram below:

The key components of the cluster are as follows:

Master Nodes

These are the managers of the cluster, responsible for overall health and coordination.

They keep track of which shards are stored on which data nodes. They also monitor the status of the nodes and redistribute shards if a node fails.

Master nodes don’t handle user queries directly. Their role is to manage the system and ensure everything runs smoothly.

Coordinating Nodes

The coordinating nodes act as the traffic controllers of the system.

When a user performs a search or swipe, their request goes to a coordinating node. This node determines:

Once the results are collected from the data nodes, the coordinating node aggregates them and sends the final response back to the user.

Data Nodes

These are the workers of the cluster, responsible for storing the actual data and processing queries. Data nodes execute the search queries sent by coordinating nodes and return the results.

Each data node holds multiple Geoshards and their replicas as follows:

Consistency Challenges

Maintaining data consistency was another significant challenge in Tinder’s Geosharded system.

This complexity arises because users frequently move between locations, and their data must be dynamically shifted between Geoshards. Without proper handling, these transitions could lead to inconsistencies, such as failed writes, outdated information, or mismatches between data locations.

The key challenges are as follows:

Tinder implemented a combination of strategies to address these issues and ensure consistent, reliable data handling across Geoshards. Let’s look at a few of those techniques in detail.

1 - Guaranteed Write Ordering with Apache Kafka

When multiple updates occur for the same user data, ensuring they are processed in the correct order is critical to avoid inconsistencies.

Apache Kafka was used as the backbone for managing data updates because it guarantees that messages within a partition are delivered in the same order they were sent.

Tinder assigns each user a unique key (for example, their user ID) and uses consistent hashing to map updates for that user to a specific Kafka partition. This means all updates for a particular user are sent to the same partition, ensuring they are processed in the order they are produced.

See the diagram below:

The consumers (parts of the system that read and process messages) read updates sequentially from the partitions, ensuring that no updates are skipped or processed out of order.

This guarantees that the latest update always reflects the most recent user activity.

2 - Strongly Consistent Reads with the Elasticsearch Get API

Elasticsearch is a "near real-time" search engine, meaning recently written data may not be immediately available for queries. This delay can lead to inconsistencies when trying to retrieve or update data.

Tinder, however, leveraged Elasticsearch’s Get API, which forces the index to refresh before retrieving data. This ensures that any pending updates are applied before the data is accessed.

Reindexing and Refeeding:

In cases where data inconsistencies occurred due to upstream failures or mismatches between shards, a mechanism was needed to realign the datastore with the source of truth.

Data from the source shard was copied to the target shard when a user moved between Geoshards.

If inconsistencies were detected (e.g., missing data in a shard), the system periodically re-synced the search datastore with the source datastore using a background process.

This ensured that even if temporary errors occurred, the system would self-correct over time.

Conclusion

Tinder’s implementation of Geosharding demonstrates the complexities of scalability, performance, and data consistency in a global application. 

By dividing its user base into geographically bound shards, Tinder optimized its recommendation engine to handle billions of daily swipes while maintaining lightning-fast response times. Leveraging tools like the S2 Library and Apache Kafka, along with algorithms for load balancing and consistency, the platform transformed its infrastructure to support a seamless user experience across the globe.

This architecture improved performance by handling 20 times more computations than the previous system. It also addressed challenges like traffic imbalances across time zones and potential data inconsistencies during shard migrations. With randomized shard distribution, dynamic replica adjustments, and intelligent reindexing, Tinder ensured reliability, fault tolerance, and scalability.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Tinder Geosharding pgai S2 Library 负载均衡
相关文章