ByteByteGo 02月05日
How Google Spanner Powers Trillions of Rows with 5 Nines Availability
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了谷歌Cloud Spanner的架构及其如何实现全球分布式、强一致性和可扩展性。Cloud Spanner结合了传统关系数据库和NoSQL系统的优势,通过多版本数据库、TrueTime技术和Paxos共识算法,实现了数据的持久性和高可用性。文章详细介绍了Spanner的数据分片、动态分区、Paxos复制机制以及多区域部署策略,揭示了其在处理大规模全球操作时保持健壮性和可靠性的关键技术。此外,还阐述了Spanner如何管理读写事务,确保数据完整性和一致性。

🗄️Cloud Spanner是一种由Google开发的革命性数据库系统,它独特地结合了传统关系数据库的优势和通常与NoSQL系统相关的可扩展性。它的突出特点是在实现水平扩展的同时,提供基于SQL的查询和关系数据库结构,适用于现代高需求应用程序。

🌐Cloud Spanner的架构设计支持其作为全球分布式、高度一致且可扩展的数据库的角色。在最高层面上,Spanner被组织成一个称为“宇宙”的逻辑实体,该实体跨越多个物理或逻辑位置(称为区域)。每个区域半独立运行,包含spanservers,这些是处理数据存储和事务操作的专用服务器。

🔄Spanner使用Paxos共识算法来管理跨多个区域的复制。每个拆分都有多个副本,Paxos确保这些副本保持一致。在这些副本中,选择一个作为领导者,负责管理该拆分的所有写入事务。领导者协调更新以确保以一致的顺序应用它们。如果领导者失败,Paxos会选举一个新的领导者,从而确保在没有人工干预的情况下继续可用。

How to monitor AWS container environments at scale (Sponsored)

In this eBook, Datadog and AWS share insights into the changing state of containers in the cloud and explore why orchestration technologies are an essential part of managing ever-changing containerized workloads.

Learn more about:

Download the eBook


Disclaimer: The details in this post have been derived from Google Blogs and Research Papers. All credit for the technical details goes to the Google engineering team. The links to the original articles are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Cloud Spanner is a revolutionary database system developed by Google that uniquely combines the strengths of traditional relational databases with the scalability typically associated with NoSQL systems. 

Designed to handle massive workloads across multiple regions, Cloud Spanner provides a globally distributed, strongly consistent, and highly available platform for data management. Its standout feature is its ability to offer SQL-based queries and relational database structures while achieving horizontal scalability. This makes it suitable for modern, high-demand applications.

Here are some features of Cloud Spanner:

Overall, Google Spanner is a powerful solution for enterprises that need a database capable of handling global-scale operations while maintaining the robustness and reliability of traditional relational systems.

In this article, we’ll learn about Google Cloud Spanner's architecture and how it supports the various capabilities that make it a compelling database option.

The Architecture of Cloud Spanner

The architecture of Cloud Spanner is designed to support its role as a globally distributed, highly consistent, and scalable database.

At the highest level, Spanner is organized into what is called a universe, a logical entity that spans multiple physical or logical locations known as zones. 

Each zone operates semi-independently and contains spanservers. These are specialized servers that handle data storage and transactional operations. Spanservers are built on concepts from Bigtable, Google’s earlier distributed storage system, and include enhancements to support complex transactional needs and multi-versioned data.

Some of the key architectural components of Spanner are as follows:

1 - Data Sharding and Tablets

Cloud Spanner manages data by breaking it into smaller chunks called tablets, distributed across multiple spanservers.

Each tablet holds data as key-value pairs, with a timestamp for versioning. This structure allows Spanner to act as a multi-version database where old versions of data can be accessed if needed.

Tablets are stored on Colossus, Google’s distributed file system. Colossus provides fault-tolerant and high-performance storage, enabling Spanner to scale storage independently of compute resources.

2 - Dynamic Partitioning

Data within tables is divided into splits, which are ranges of contiguous keys. These splits can be dynamically adjusted based on workload or size.

When a split grows too large or experiences high traffic, it is automatically divided into smaller splits and redistributed across spanservers. This process, known as dynamic sharding, ensures even load distribution and optimal performance.

Each split is replicated across zones for redundancy and fault tolerance.

3 - Paxos-Based Replication

Spanner uses the Paxos consensus algorithm to manage replication across multiple zones. Each split has multiple replicas, and Paxos ensures that these replicas remain consistent.

Among these replicas, one is chosen as the leader, responsible for managing all write transactions for that split. The leader coordinates updates to ensure they are applied in a consistent order.

If the leader fails, Paxos elects a new leader, ensuring continued availability without manual intervention. The replicas not serving as leaders can handle read operations, reducing the workload on the leader and improving scalability.

4 - Multi-Zone Deployments

Spanner instances span multiple zones within a region, with replicas distributed across these zones. This setup enhances availability because even if one zone fails, other zones can continue serving requests.

For global deployments, data can be replicated across continents, providing low-latency access to users worldwide.

5 - Colossus Distributed File System

All data is stored on Colossus, which is designed for distributed and replicated file storage. Colossus ensures high durability by replicating data across physical machines, making it resilient to hardware failures.

The file system is decoupled from the compute resources, allowing the database to scale independently and perform efficiently.

Paxos Mechanism in Spanner

The Paxos Mechanism is a critical component of Spanner’s architecture.

It operates on the principle of distributed consensus, where a group of replicas (known as a Paxos group) agrees on a single value, such as a transaction's commit or the leader responsible for handling updates.

The leadership assignment works as follows:

The key responsibilities of the Paxos Leader are as follows:

Failures are inevitable in distributed systems, but Paxos ensures that Spanner remains available and consistent despite such issues.

If the current leader fails due to a machine or zone outage, the Paxos group detects the failure and elects a new leader. The new leader is chosen from the remaining replicas in the Paxos group. This process avoids downtime and ensures that the data remains accessible.

Read and Write Transactions in Spanner

Cloud Spanner manages transactions with a robust approach that ensures strong consistency, reliability, and high performance. 

Let’s look at how write and read transactions work in more detail:

1 - Write Transactions

Write transactions in Cloud Spanner are designed to guarantee atomicity (all-or-nothing execution) and consistency (all replicas agree on the data). These transactions are managed by Paxos leaders coordinating the process to ensure data integrity even during failures.

Here are the steps involved in the process:

There is some difference in the way Spanner handles single-split write versus multi-split write.

For example, in a single-split write, suppose a user wants to add a row with ID 7 and value "Seven" to a table. 

However, for a multi-split write, if a transaction modifies rows in multiple splits (for example, writing to rows 2000, 3000, and 4000), Spanner uses a two-phase commit protocol: 

2 - Read Transactions

Read transactions in Spanner are optimized for speed and scalability. They provide strong consistency without requiring locks, which allows reads to be processed efficiently even under high workloads.

The different types of reads are as follows:

See the diagram below that shows the strong consistent read scenario.

Also, the diagram below shows the stale reads scenario.

3 - Deadlock Prevention

Spanner avoids deadlocks—a situation where two or more transactions wait for each other to release locks—by using the wound-wait algorithm. Here’s how it works:

4 - Reliability and Durability

Spanner’s design ensures that data remains consistent and available even during failures.

All writes are stored in Google’s Colossus distributed file system, which replicates data across multiple physical machines. Even if one machine or zone fails, the data can be recovered from other replicas.

TrueTime ensures that all transactions occur in a globally consistent order, even in a distributed environment. This guarantees that once a transaction is visible to one client, it is visible to all clients.

The TrueTime API

The TrueTime API is one of the key innovations in Cloud Spanner, enabling it to function as a globally distributed, strongly consistent database. 

TrueTime solves one of the most challenging problems in distributed systems: providing a globally synchronized and consistent view of time across all nodes in a system, even those spread across multiple regions and data centers.

TrueTime is based on a combination of atomic clocks and GPS clocks, which work together to provide highly accurate and reliable time synchronization. 

By using both atomic and GPS clocks, TrueTime mitigates the weaknesses of each system. For example:

Here’s how it works:

Time Representation and Uncertainty

TrueTime represents time as an interval instead of a single point, explicitly acknowledging the uncertainty inherent in distributed systems.

Key Features Enabled By TrueTime

TrueTime provides some important features that make it so useful:

Conclusion

Google Spanner stands as a great achievement in database engineering, seamlessly blending the reliability and structure of traditional relational databases with the scalability and global availability often associated with NoSQL systems. 

Its innovative architecture, supported by the Paxos consensus mechanism and the TrueTime API, provides a great foundation for handling distributed transactions, ensuring external consistency, and maintaining high performance at a global scale.

Ultimately, Google Spanner redefines what is possible in distributed database systems, setting a standard for scalability, reliability, and innovation.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Cloud Spanner 分布式数据库 谷歌云
相关文章