ByteByteGo 2024年12月18日
How Statsig Streams 1 Trillion Events A Day
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Statsig平台在过去一年事件量增长了20倍,这得益于OpenAI、Atlassian等知名客户的信赖。为应对海量数据,Statsig构建了高效的流式架构,包含请求记录器、日志处理和路由三个核心组件。数据摄取层负责接收、验证并存储数据,通过负载均衡、重批处理等手段确保数据不丢失;消息队列层利用Pub/Sub和GCS降低成本,实现数据高效传输;业务逻辑层进行数据处理、去重和定制化;路由层则将数据分发到不同的目的地,如数据仓库或实时流,满足不同需求。该架构在保证高可靠性的同时,兼顾了成本效益。

📥 数据摄取层:负责接收、验证和存储数据,通过负载均衡、身份验证和重批处理等机制,确保数据不丢失。数据先经过负载均衡器,然后进行身份验证,接着被添加到内部队列中,最后进行重批处理并写入Google Cloud Storage (GCS),利用GCS的成本效益优势。

✉️ 消息队列层:利用Pub/Sub和GCS的组合,实现高效的数据传输和存储。Pub/Sub存储指向GCS中实际数据的元数据,而GCS则存储压缩后的批量数据,减少了成本和运营开销。Zstandard (ZSTD)压缩技术进一步提高了存储效率。

⚙️ 业务逻辑层:对数据进行处理和定制化,包括重批处理、去重和业务逻辑插件。通过缓存技术如Memcached实现高效去重,并允许不同团队根据需求添加自定义插件,以满足多样化的业务场景。处理后的数据最终会被写入到目标存储。

🚀 路由和集成层:根据数据需求将数据路由到不同的目的地,包括数据仓库和实时事件流。对于延迟不敏感的数据,使用批量作业进行处理,而对于时间敏感的数据,则使用存储写入API进行实时传输,并支持通过Webhook发送到第三方系统。

The newsletter is a collaboration between ByteByteGo and members of the Statsig engineering team (Pablo Beltran and Brent Echols).

Statsig is a modern feature management, experimentation, and analytics platform that enables teams to improve their product velocity.

Over the past year, Statsig has seen a staggering 20X growth in its event volume. This growth has been driven by high-profile customers such as OpenAI, Atlassian, Flipkart, and Figma, who rely on Statsig’s platform for experimentation and analytics.

Statsig currently processes over a trillion events daily, which is a remarkable achievement for any organization, particularly one of their size as a startup.

However, this rapid growth comes with immense challenges.

Statsig must not only scale its infrastructure to handle the sheer volume of data but also ensure that its systems remain reliable and maintain high uptime. It also needs to keep the costs in check to stay competitive.

In this post, we’ll look at Statsig’s streaming architecture, which helps it handle the event volume. We’ll also look at the cost-efficiency steps taken by Statsig’s team.

Statsig’s Streaming Architecture

On a high level, Statsig’s streaming architecture consists of 3 main components:

See the diagram below for a quick overview of the architecture:

The Architectural Components

While the high-level view gives a broad look at Statsig’s pipeline architecture, let us also look at each component in more detail to gain a better understanding.

1 - Data Ingestion Layer

The data ingestion layer in Statsig's pipeline is the first and one of the most crucial stages of the system. 

It is responsible for receiving, authenticating, organizing, and securely storing data in a way that prevents loss, even under challenging conditions. 

The request recorder is a key functionality of the data ingestion layer, with its specific role being the first step in handling incoming data. However, the data ingestion layer includes the load balancer, authentication, rebatching, and persistence-related functionalities.

See the diagram below:

Here’s a simple breakdown of the various steps:

2 - Message Queue Layer

The Message Queue Layer is a critical stage in Statsig’s pipeline that manages how data flows between different components. This layer is designed to handle enormous volumes of data efficiently while keeping operational costs low.

See the diagram below:

As you can see, it consists of two main components:

The Pub/Sub Topic

Pub/Sub is a serverless messaging system that facilitates communication between different parts of the pipeline. Since it is serverless, there’s no need to worry about maintaining servers or managing complex deployments. This reduces overhead for the engineering team.

Pub/Sub receives metadata about the data stored in Google Cloud Storage (GCS). Instead of directly storing all the event data, it acts as a pointer system, referring downstream systems to the actual data stored in GCS.

GCS Bucket

Directly using Pub/Sub for all data storage would be prohibitively expensive. Therefore, Statsig offloads most of the data to GCS to reduce storage and operational costs.

The pipeline writes bulk data into GCS in compressed batches. Pub/Sub stores only the metadata (like file pointers) needed to locate this data in GCS. Downstream components can then use these pointers to retrieve the data when required.

The GCS Bucket stores the actual data in a compressed format, using Zstandard (ZSTD) compression for efficiency. For reference, Zstandard compression is highly efficient, providing better compression rates (around 95%) than other methods like zlib, with lower CPU usage. This ensures data is stored in a smaller footprint while maintaining high processing speeds.

3 - Business Logic Layer

The Business Logic Layer is where the heavy lifting happens in Statsig's pipeline. 

This layer is designed to process data while ensuring accuracy and preparing it for final use by various downstream systems. It handles complex logic, customization, and data formatting.

See the diagram below that shows the various steps that happen in this layer:

Let’s look at each step in more detail:

Rebatching

This step combines smaller batches of incoming data into larger ones for processing. By handling larger batches, the system reduces the overhead of dealing with multiple small data chunks.

The system is designed with an “at least once” guarantee. This means that even if something goes wrong during processing, the data is not lost. It will be retried until successfully processed.

Stateful Processing To Remove Redundancy

This step focuses on deduplication, which involves filtering out repeated or redundant data. For instance, if the same event gets recorded multiple times, this step ensures only one instance is kept.

To achieve this, the system uses caching solutions like Memcached. Memcached provides quick access to previously processed data, enabling the system to identify duplicates efficiently.

Ultimately, deduplication reduces unnecessary processing.

Business Logic Plugins

This layer allows different teams within Statsig to insert custom business logic tailored to their specific needs. For example, one team might add specific tags or attributes to the data, while another might modify event structures for a particular customer.

By using plugins, the system can support diverse use cases without requiring a separate pipeline for each customer. This makes the pipeline both scalable and versatile.

Writer

Once the data has been cleaned, transformed, and customized, the Writer finalizes it by writing it to the appropriate destination. 

This could be a database, a data warehouse, or an analytics tool, depending on where the data is needed.

4 - Routing and Integration Layer

The routing and integration layer in Statsig's pipeline is responsible for directing processed data to its final destination.

See the diagram below:

Let’s look at each branch in more detail:

Warehouse Router

The Warehouse Router is responsible for deciding where the data should go based on factors like customer preferences, event types, and priority. It dynamically routes data to various destinations such as BigQuery or other data warehouses.

Here’s how it works:

The Warehouse Router guarantees efficient resource utilization by distinguishing between latency-sensitive and latency-insensitive data. It saves costs without compromising on performance for urgent tasks.

The Side Effects Service

This service handles external integrations triggered by specific events in the pipeline. For example:

It supports any kind of event-level trigger, making it highly customizable for customer-specific workflows. 

The Real-Time Event Stream

This service is designed for situations where data needs to be accessed almost instantaneously. For example:

It uses Redis, a fast in-memory data store, to cache and retrieve data in real-time so that customers querying the data experience minimal delays. 

The Shadow Pipeline

The Shadow Pipeline is an important testing feature in Statsig’s event streaming system.

It acts as a safety net to ensure that any updates or changes to the system don’t disrupt its ability to process over a trillion events a day. 

Here’s a closer look at how the shadow pipeline works:

Statsig’s Cost Optimization Strategies

Statsig employed multiple cost optimization strategies to handle the challenge of processing over a trillion events daily while keeping operational expenses as low as possible. 

These strategies involve a mix of technical solutions, infrastructure choices, and design decisions. 

Let’s break down each key effort in more detail:

GCS Upload via Pub/Sub

Instead of sending all event data directly into Pub/Sub, Statsig writes the majority of the data to Google Cloud Storage (GCS) in a compressed format.

Using GCS is significantly cheaper than relying solely on Pub/Sub for storing large amounts of data. It helps reduce costs while maintaining scalability. Pub/Sub is used only to pass file pointers (metadata) that direct downstream systems to retrieve the data from GCS.

Async Workloads on Spot Nodes

Statsig runs non-time-sensitive tasks (asynchronous workloads) on spot nodes, which are temporary virtual machines offered at a lower price.

Leveraging spot nodes reduces VM costs without compromising performance for less urgent processes. Also, since these workloads don’t require constant uptime, occasional interruptions don’t impact the system’s overall functionality.

Deduplication with Memcache

A large portion of incoming events may include duplicates, which add unnecessary processing overhead. Deduplication is a key feature that saves processing resources and ensures downstream systems only handle unique data.

To handle deduplication, Statsig uses Memcache to identify and discard duplicate events early in the pipeline.

Zstandard (zstd) Compression

Statsig switched from using zlib compression to ZSTD, a more efficient compression algorithm.

ZSTD achieves better compression rates (around 95%) while using less CPU power, compared to zlib’s 90% compression. This improvement reduced storage requirements and processing power.

Batching Efficiency via CPU Optimization

Statsig also adjusted the CPU allocation for its request recorder (from 2 CPU to 12 CPU), enabling it to handle larger batches of data more efficiently.

This is because larger batches reduce the number of write operations to storage systems, improving cost efficiency while maintaining high throughput.

Load Jobs vs. Live Streaming

For data that doesn’t need to be processed immediately, Statsig uses load jobs to process and upload data in bulk, which is much cheaper.

On the other hand, for time-sensitive data, they use the storage write API, which provides low-latency delivery but at a higher cost.

Differentiating between these two types of data saves money while meeting customer requirements for both real-time and batch processing.

Optimized CPU and Memory Utilization

Statsig tunes CPU and memory usage based on actual host utilization rather than pod utilization.

Also, pods are configured without strict usage limits, allowing them to make full use of available resources when needed. This prevents underutilization of expensive hardware resources and maximizes cost-effectiveness.

Aggressive Host-Level Resource Stacking

Statsig stacks multiple pods onto a single host aggressively to use every bit of available CPU and memory.

By fine-tuning flow control and concurrency settings, they prevent resource contention while maintaining high performance. This approach helps achieve cost efficiency at the host level by reducing the number of machines needed.

Conclusion

Statsig’s journey to streaming over a trillion events daily shows how a company can achieve massive scale without compromising efficiency through innovative engineering.

By designing a robust data pipeline with key components like a reliable ingestion layer, scalable message queues, and cost-optimized routing and integration layers, Statsig has built an infrastructure capable of supporting rapid growth while maintaining high reliability and performance. Also, leveraging features and tools like Pub/Sub, GCS, and advanced compression techniques, the platform balances the challenges of low latency, data integrity, and cost-effectiveness.

A key differentiator for Statsig is its approach to cost optimization and scalability, achieved through strategies such as using spot nodes, implementing deduplication, and differentiating latency-sensitive from latency-insensitive workloads. These efforts not only ensure the system's resilience but also allow them to offer their platform at competitive prices to a wide range of customers.

Reference:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com.





Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Statsig 流式架构 数据处理 云计算 高可用
相关文章