ByteByteGo 2024年08月13日
Counting Billions of Content Usage at Canva
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Canva的创作者计划每月处理数十亿次内容使用,需要准确的计数以确保创作者获得公平的报酬。本文介绍了Canva的工程师团队在构建内容使用计数服务过程中,从最初的基于MySQL的架构,到迁移到DynamoDB,最终采用基于OLAP的Snowflake解决方案的演进过程,以及他们所面临的挑战和经验教训。

📈 **基于MySQL的初始架构** Canva最初采用基于MySQL的架构,该架构包括用于存储使用数据的MySQL数据库,以及用于处理不同管道阶段(数据收集、去重和聚合)的独立工作服务。该架构简单易懂,但随着数据量的增长,它面临着可扩展性挑战。 - 由于每个使用记录都需要至少一次数据库往返,因此对于N个记录,数据库查询次数为O(N),这随着数据量的增长而变得越来越成问题。 - MySQL RDS不支持通过分区进行水平扩展,每次需要更多存储空间时,都需要将RDS实例大小加倍,这导致了大量的运营开销。 - 当MySQL RDS实例达到数TB时,维护它变得非常昂贵。 - 在事件发生时,查找和修复问题很困难,因为工程师必须查看数据库并手动修复错误数据。

💻 **迁移到DynamoDB** 为了解决基于MySQL的计数服务的可扩展性限制,Canva团队最初考虑将DynamoDB作为潜在的解决方案。DynamoDB以处理大规模、高吞吐量工作负载而闻名,非常适合Canva快速增长的数据需求。 - 团队将数据收集阶段的原始使用事件迁移到DynamoDB,这立即缓解了存储限制。 - 然而,经过仔细评估,Canva决定不完全迁移到DynamoDB,因为DynamoDB虽然可以有效地解决存储可扩展性问题,但无法解决处理可扩展性的根本问题。 - 团队发现很难消除对频繁数据库往返的依赖,这是他们现有系统中的一个关键瓶颈。

📰 **基于OLAP的计数服务** Canva最新的架构采用了基于OLAP的Snowflake解决方案,它将存储和计算分离,并采用ELT(提取、加载、转换)方法。 - 从各种来源提取原始使用数据,并使用Canva数据平台团队提供的可靠数据复制管道将其加载到Snowflake中。 - 利用Snowflake强大的计算能力和DBT(数据构建工具)进行复杂的转换,这些转换被编写为类似SQL的查询,允许在源数据上直接进行端到端计算。 - 消除了中间输出,而是将中间转换输出作为SQL视图进行物化。 - 该架构的主要优点包括: - 管道延迟从超过一天缩短到不到一小时。 - 事件处理变得更加容易管理,大多数问题可以通过简单地重新运行整个管道来解决,无需手动干预数据库。 - 存储数据减少了50%以上,消除了数千行去重和聚合计算代码。 - 事件数量下降到每几个月或更少一次。

📁 **新解决方案的挑战** 尽管新解决方案带来了很多优势,但它也带来了新的挑战: - 转换复杂性:使用类似SQL的语言编写的,并作为独立服务部署的数据转换作业,引入了新的部署和兼容性考虑因素。 - 数据卸载:Canva需要构建一个可靠的机制,将数据从Snowflake卸载到其他系统,以便进行进一步的分析或处理。

Hands-on Rust Developer Workshop: Build a Low-Latency Social Media App (Sponsored)

During this free interactive workshop oriented for developers, engineers, and architects, you will learn how to: 

If you’re an application developer with an interest in Rust, Tokio, and event-driven architectures this workshop is for you! This is a great way to discover the NoSQL strategies used by top teams and apply them in a guided, supportive environment.

Register for Free


Disclaimer: The details in this post have been derived from the Canva Engineering Blog. All credit for the technical details goes to the Canva engineering team. The links to the original articles are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

What if incorrect counting results in incorrect payments? 

Either the company making the payment or the users receiving that payment based on the incorrect count lose money. Both scenarios are problematic from the business perspective.

This is exactly the situation that Canva faced when they launched the Creators Program.

As you might already know, Canva is a tool that makes design accessible to everyone worldwide. One of the main ways they make this possible is through the Canva Creators Program.

In the three years since its launch, the use of content from this program has doubled every 18 months. They process billions of content uses monthly based on which the creators are paid. This includes the use of templates, images, videos, and more.

It is a critical requirement for Canva to count the usage data of this content accurately since the payments made to the creators depend on this data. However, it also presents some big challenges:

In this post, we will look at the various architectures Canva’s engineering team experimented with to implement a robust counting service and the lessons they learned in the process.


The Initial Counting Service Design

Canva's original design for the content usage counting service was built on a MySQL database, a familiar and widely-used technology stack.

This initial design comprised several key components: 

The process flow for the solution could be broken down into three main steps:

The diagram below shows the architecture on a high level.

This architecture employed a single-threaded sequential process for deduplication, using a pointer to track the latest scanned record. While this approach made it easier to reason about and verify data processing, especially during troubleshooting or incident recovery, it faced significant scalability challenges. 

The system required at least one database round trip per usage record, resulting in O(N) database queries for N records, which became increasingly problematic as data volume grew. 

The initial MySQL-based architecture prioritized simplicity and familiarity over scalability. It allowed for quick implementation but created substantial challenges as the system expanded. 


Latest articles

If you’re not a paid subscriber, here’s what you missed.

    A Crash Course on Microservices Design Patterns

    A Crash Course on Domain-Driven Design

    "Tidying" Code

    A Crash Course on Relational Database Design

    A Crash Course on Distributed Systems

To receive all the full articles and support ByteByteGo, consider subscribing:

Subscribe now


Migration to DynamoDB

Faced with the scalability limitations of the MySQL-based counting service, the team initially looked to DynamoDB as a potential solution.

This decision was primarily driven by DynamoDB's reputation for handling large-scale, high-throughput workloads - a perfect fit for Canva's rapidly growing data needs. 

The migration process began with moving raw usage events from the data collection stage to DynamoDB, which provided immediate relief to the storage constraints. This initial success prompted the team to consider moving the entirety of their data to DynamoDB. It was a move that would have necessitated a substantial rewrite of their codebase.

However, after careful evaluation, Canva decided against a full migration to DynamoDB. 

While DynamoDB could have effectively addressed the storage scalability issues, it wouldn't have solved the fundamental problem of processing scalability. The team found it challenging to eliminate the need for frequent database round trips, which was a key bottleneck in their existing system. 

This reveals a crucial lesson in system design: sometimes, what appears to be a storage problem is a processing problem in disguise.

Canva's approach clearly shows the importance of thoroughly analyzing the root causes of system limitations before committing to major architectural changes. It also highlights the complexity of scaling data-intensive applications, where the interplay between storage and processing capabilities can be subtle and non-obvious.

The OLAP-based Counting Service

Canva's latest architecture for the content usage counting service shows a shift from traditional OLTP databases to an OLAP-based solution, specifically using Snowflake. 

The change came after realizing that previous attempts with MySQL and DynamoDB couldn't adequately address their scalability and processing needs. The new architecture altered how Canva processed and stored the usage data, adopting an ELT (Extract, Load, Transform) approach.

The diagram below shows the new architecture:

In the extraction phase, Canva pulled raw usage data from various sources, including web browsers and mobile apps. This data was then loaded into Snowflake using a reliable data replication pipeline provided by Canva’s data platform team. The reliability of this data replication was crucial, as it formed the foundation for all subsequent processing.

The transformation phase used Snowflake's powerful computational capabilities. It also utilized DBT (Data Build Tool) to define complex transformations. 

These transformations were written as SQL-like queries, allowing for end-to-end calculations directly on the source data. For example, one transformation aggregated usages per brand using a SQL query that selected data from a previous step named 'daily_template_usages' and grouped it by ‘day_id’ and ‘template_brand’.

The SQL below shows the aggregate query.

The main steps in the transformation process were as follows:

A key aspect of this new architecture was the elimination of intermediary outputs. Instead of persisting data at various pipeline stages, Canva materialized intermediate transformation outputs as SQL Views. 

The Advantage of OLAP Database

The separation of storage and computing in an OLAP database like Snowflake was a game-changer for Canva. It enabled them to scale computational resources independently. 

As a result, they could now aggregate billions of usage records within minutes, a task that previously took over a day. This improvement was largely due to most of the computation being done in memory, which is several orders of magnitude faster than the database round trips required in their previous architecture.

There were several improvements such as:

Challenges of the New Solution

Despite the advantages, the solution also introduced new challenges such as:

Conclusion

Canva’s journey of implementing the counting service for the Creators Program is full of learning for software developers and architects. 

Some of the key points to take away are as follows:

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Canva 内容使用计数 架构演进 MySQL DynamoDB Snowflake OLAP ELT
相关文章