ByteByteGo 04月29日 23:50
How Meta Built Threads to Support 100 Million Signups in 5 Days
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta 的社交平台 Threads 在短短五个月内迅速上线,并在发布后一周内获得了超过一亿的注册用户。这一惊人的增长速度,得益于 Meta 强大的基础设施,特别是 ZippyDB 和 Async 这两个关键平台。ZippyDB 是一个分布式键值存储,负责管理状态和搜索;而 Async 则是一个无服务器计算引擎,处理数十亿的后台任务。这两者共同支撑了 Threads 的快速发展,展示了 Meta 在大规模系统构建方面的深厚技术积累。

🚀 ZippyDB 作为 Meta 内部的分布式键值存储,为 Threads 提供了关键的数据存储和检索能力。它基于 RocksDB 构建,通过 Data Shuttle 实现复制,并由 Shard Manager 管理分片和故障转移,确保了高可用性和地理冗余。

🌍 ZippyDB 采用多租户设计,支持多种使用场景,包括元数据服务和产品功能状态,从而提高了硬件利用率,并实现了跨工作负载的隔离。它将数据划分为分片,并支持强一致性写入和区域读取优化,以适应不同的工作负载需求。

🔄 ZippyDB 通过 μshards(微分片)实现了更细粒度的数据管理,并支持两种映射策略:紧凑映射和 Akkio 映射。Akkio 映射能够动态调整数据分布,优化延迟和负载,特别适用于全球产品。

🛡️ ZippyDB 提供了多种一致性级别,包括强一致性、有界滞后(最终一致性)、读你所写一致性和快速确认写入模式,以满足不同应用对数据一致性和性能的需求。

📝 ZippyDB 支持事务,并使用乐观并发控制来确保跨多个键的原子读-修改-写操作。所有事务默认都是可序列化的,通过检查冲突和重试机制,避免了锁争用,确保数据的一致性。

Generate your MCP server with Speakeasy (Sponsored)

Like it or not, your API has a new user: AI agents. Make accessing your API services easy for them with an MCP (Model Context Protocol) server. Speakeasy uses your OpenAPI spec to generate an MCP server with tools for all your API operations to make building agentic workflows easy.

Once you've generated your server, use the Speakeasy platform to develop evals, prompts and custom toolsets to take your AI developer platform to the next level.

Start Generating


Disclaimer: The details in this post have been derived from the articles written by the Meta engineering team. All credit for the technical details goes to the Meta/Threads Engineering Team. The links to the original articles and videos are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Threads, Meta’s newest social platform, launched on July 5, 2023, as a real-time, public conversation space. 

Built in under five months by a small engineering team, the product received immediate momentum. Infrastructure teams had to respond immediately to the incredible demand. 

When a new app hits 100 million signups in under a week, the instinct is to assume someone built a miracle backend overnight. That’s not what happened with Threads. There was no time to build new systems or bespoke scaling plans. The only option was to trust the machinery already in place.

And that machinery worked quite smoothly. As millions signed up in 5 days, the backend systems held on, and everything from the user’s perspective worked as intended.

Threads didn’t scale because it was lucky. It scaled because it inherited Meta’s hardened infrastructure: platforms shaped by a decade of lessons from Facebook, Instagram, and WhatsApp. 

This article explores two of those platforms that played a key role in the successful launch of Threads:

Neither of these systems was built for Threads. But Threads wouldn’t have worked without them.

ZippyDB was already managing billions of reads and writes daily across distributed regions. Also, Async had been processing trillions of background jobs across more than 100,000 servers, quietly powering everything from feed generation to follow suggestions.

ZippyDB: Key-Value at Hyperscale

ZippyDB is Meta’s internal, distributed key-value store designed to offer strong consistency, high availability, and geographical resilience at massive scale. 

At its core, it builds on RocksDB for storage, extends replication with Meta’s Data Shuttle (a Multi-Paxos-based protocol), and manages placement and failover through a system called Shard Manager.

Unlike purpose-built datastores tailored to single products, ZippyDB is a multi-tenant platform. Dozens of use cases (from metadata services to product feature state) share the same infrastructure. This design ensures higher hardware utilization, centralized observability, and predictable isolation across workloads.

The Architecture of ZippyDB

ZippyDB doesn’t treat deployment as a monolith. It’s split into deployment tiers: logical groups of compute and storage resources distributed across geographic regions. 

Each tier serves one or more use cases and provides fault isolation, capacity management, and replication boundaries. The most commonly used is the wildcard tier, which acts as a multi-tenant default, balancing hardware utilization with operational simplicity. Dedicated tiers exist for use cases with strict isolation or latency constraints.

Within each tier, data is broken into shards, the fundamental unit of distribution and replication. Each shard is independently managed and:

This hybrid replication model (strong quorum-based writes paired with regional read optimization) gives ZippyDB flexibility across a spectrum of workloads.

See the diagram below that shows the concept of region-based replication supported by ZippyDB.

To push scalability even further, ZippyDB introduces a layer of logical partitioning beneath shards: μshards (micro-shards). These are small, related key ranges that provide finer-grained control over data locality and mobility.

Applications don’t deal directly with physical shards. Instead, they write to μshards, which ZippyDB dynamically maps to underlying storage based on access patterns and load balancing requirements.

ZippyDB supports two primary strategies for managing μshard-to-shard mapping:

In ZippyDB, the Shard Manager acts as the external controller for leadership and failover. It doesn’t participate in the data path but plays a critical role in keeping the system coordinated.

The Shard Manager assigns a Primary replica to each shard and defines an epoch: a versioned leadership lease. The epoch ensures only one node has write authority at any given time. When the Primary changes (for example, due to failure), Shard Manager increments the epoch and assigns a new leader. The Primary sends regular heartbeats to the Shard Manager. If the heartbeats stop, the Shard Manager considers the Primary unhealthy and triggers a leader election by promoting a new node and bumping the epoch.

See the diagram below that shows the role of the Shard Manager:

Consistency and Durability in ZippyDB

In distributed systems, consistency is rarely black-and-white. ZippyDB embraces this by giving clients per-request control over consistency and durability levels, allowing teams to tune system behavior based on workload characteristics.

1 - Strong Consistency

Strong consistency in ZippyDB ensures that reads always reflect the latest acknowledged writes, regardless of where the read or write originated. To achieve this, ZippyDB routes these reads to the primary replica, which holds the current Paxos lease for the shard. The lease ensures that only one primary exists at any time, and only it can serve linearizable reads.

If the lease state is unclear (for example, during a leadership change), the read may fall back to a quorum check to avoid split-brain scenarios. This adds some latency, but maintains correctness.

2 - Bounded Staleness (Eventual Consistency)

Eventual consistency in ZippyDB isn’t the loose promise it implies in other systems. Here, it means bounded staleness: a read may be slightly behind the latest write, but it will never serve stale data beyond a defined threshold.

Follower replicas (often located closer to users) serve these reads. ZippyDB uses heartbeats to monitor follower lag, and only serves reads from replicas that are within an acceptable lag window. This enables fast, region-local reads without compromising on the order of operations.

3 - Read-Your-Writes Consistency

For clients that need causal guarantees, ZippyDB supports a read-your-writes model. 

After a write, the server returns a version number (based on Paxos sequence ordering). The client caches this version and attaches it to subsequent reads. ZippyDB then ensures that reads reflect data at or after that version.

This model works well for session-bound workloads, like profile updates followed by an immediate refresh.

4 - Fast-Ack Write Mode

In scenarios where write latency matters more than durability, ZippyDB offers a fast-acknowledgment write mode. Writes are acknowledged as soon as they are enqueued on the primary for replication, not after they’re fully persisted in the quorum.

This boosts throughput and responsiveness, but comes with trade-offs:

This mode fits well in systems that can tolerate occasional loss or use an idempotent retry.

Transactions and Conditional Writes

ZippyDB supports transactional semantics for applications that need atomic read-modify-write operations across multiple keys. Unlike systems that offer tunable isolation levels, ZippyDB keeps things simple and safe: all transactions are serializable by default.

Transactions

ZippyDB implements transactions using optimistic concurrency control:

If a conflict is detected, the transaction is rejected, and the client retries. This avoids lock contention but works best when write conflicts are rare.

ZippyDB maintains recent write history to validate transaction eligibility. To keep overhead low, it prunes older states periodically. Transactions spanning epochs (i.e., across primary failovers) are automatically rejected, which simplifies correctness guarantees at the cost of some availability during leader changes.

Conditional Writes

For simpler use cases, ZippyDB exposes a conditional write API that maps internally to a server-side transaction. This API allows operations like:

These operations avoid the need for client-side reads and round-trips. Internally, ZippyDB evaluates the precondition, checks for conflicts, and commits the write as a transaction if it passes.

This approach simplifies client code and improves performance in cases where logic depends on the current key's presence or state.

Why was ZippyDB critical To Threads?

Threads didn’t have months to build a custom data infrastructure. It needed to read and write at scale from day one. ZippyDB handled several core responsibilities, such as:

What made ZippyDB valuable wasn’t just performance but adaptability. As a multi-tenant system, it supported fast onboarding of new services. Teams didn’t have to provision custom shards or replicate schema setups. They configured what they needed and got the benefit of global distribution, consistency guarantees, and monitoring from day one.

At launch, Threads was expected to grow. But few predicted the velocity: 100 million users in under a week. That kind of growth doesn’t allow for manual shard planning or last-minute migrations.

ZippyDB’s resharding protocol turned a potential bottleneck into a non-event. Its clients map data into logical shards, which are dynamically routed to physical machines. When load increases, the system can:

No changes are required in the application code. The system handles remapping and movement transparently. Automation tools orchestrate these transitions, enabling horizontal scale-out at the moment it's needed, not hours or days later.

This approach allowed Threads to start small, conserve resources during early development, and scale adaptively as usage exploded, without risking outages or degraded performance.

During Threads’ launch window, the platform absorbed thousands of machines in a matter of hours. Multi-tenancy played a key role here. Slack capacity from lower-usage keyspaces was reallocated, and isolation boundaries ensured that Threads could scale up without starving other workloads.

Async - Serverless at Meta Scale

Async is Meta’s internal serverless compute platform, formally known as XFaaS (eXtensible Function-as-a-Service). 

At peak, Async handles trillions of function calls per day across more than 100,000 servers. It supports multiple languages such as HackLang, Python, Haskell, and Erlang.

What sets Async apart is the fact that it abstracts away everything between writing a function and running it on a global scale. There is no need for service provisioning. Drop code into Async, and it inherits Meta-grade scaling, execution guarantees, and disaster resilience.

Async’s Role in Threads

When Threads launched, one of the most critical features wasn’t visible in the UI. It was the ability to replicate a user’s Instagram follow graph with a single tap. Behind that one action sat millions of function calls: each new Threads user potentially following hundreds or thousands of accounts in bulk.

Doing that synchronously would have been a non-starter. Blocking the UI on graph replication would have led to timeouts, poor responsiveness, and frustrated users. Instead, Threads offloaded that work to Async.

Async queued those jobs, spread them across the fleet, and executed them in a controlled manner. That same pattern repeated every time a celebrity joined Threads—millions of users received follow recommendations and notifications, all piped through Async without spiking database load or flooding services downstream.

How Async Handled the Surge

Async didn’t need to be tuned for Threads. It scaled the way it always does.

Several features were key to the success of Threads:

This wasn’t reactive tuning. It was a proactive adaptation. Async observed system load and adjusted flow rates automatically. Threads engineers didn’t need to page anyone or reconfigure services. Async matched its execution rate to what the ecosystem could handle.

Developer Experience

One of the most powerful aspects of Async is that the Threads engineers didn’t need to think about scale. Once business logic was written and onboarded into Async, the platform handled the rest:

These guarantees allowed engineers to focus on product behavior, not operational limits. Async delivered a layer of predictable elasticity, absorbing traffic spikes that would have crippled a less mature system.

Conclusion

The Threads launch was like a stress test for Meta’s infrastructure. And the results spoke for themselves. A hundred million users joined in less than a week. But there were no major outages. 

That kind of scale doesn’t happen by chance. 

ZippyDB and Async weren’t built with Threads in mind. But Threads only succeeded because those systems were already in place, hardened by a decade of serving billions of users across Meta’s core apps. They delivered consistency, durability, elasticity, and observability without demanding custom effort from the product team. 

This is the direction high-velocity engineering is heading: modular infrastructure pieces that are composable and deeply battle-tested. Not all systems need to be cutting-edge. However, the ones handling critical paths, such as state, compute, and messaging, must be predictable, reliable, and invisible.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Threads ZippyDB Async 分布式存储 数据库
相关文章