少点错误 前天 12:17
Make Data Pipelines Debuggable by Storing All Source References
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文分享了通过在数据库中存储数据源信息,以改进数据处理流程和调试效率的经验。作者曾任职于一家公司,该公司从不同来源收集数据,并对其进行处理和去重,以便数据科学家和客户支持团队使用。为了解决数据缺失和不匹配的问题,作者团队开始将所有输出数据与原始数据源关联起来,通过在数据库中建立索引,能够轻松追溯数据的来源,从而显著提升了调试效率,并降低了维护成本。

💾 **数据源存储:** 将所有原始数据存储在S3存储桶中,并按来源和日期进行组织,以便后续处理和追溯。

🔗 **数据关联:** 通过创建`source`表,将输入数据与数据库中的输出数据关联起来,确保可以追踪数据的来源。

🔍 **调试效率:** 通过数据库查询,可以轻松地回答“为什么输出CSV中缺少某些数据?”和“输出数据与预期不符的原因是什么?”等问题,从而快速定位和解决数据质量问题。

💰 **成本效益:** 存储成本相对较低,与节省的人力成本相比,数据库存储的投入是值得的,特别是在初创公司阶段。

Published on June 8, 2025 4:16 AM GMT

A few jobs ago, I worked at company that collected data from disparate sources, then processed and deduplicated it into spreadsheets for ingestion by the data science and customer support teams. Some common questions the engineering team got were:

To debug these problems, the process was to try to reverse engineer where the data came from, then try to guess which path that data took through the monolithic data processor.

This is the story of how we stopped doing that, and started storing references to all source data for every piece of output data.

(This is reconstructed from memory so I no longer claim 100% accuracy)

Get Source Data Into Your Database

The good news is that before I even started at this company, it was well understood that we needed to keep our data somewhere durable, since we might need to reprocess it one day. This involved putting all of the data in an S3 bucket, organized by source and date.

In order to tie our outputs to our inputs, we needed to first get our inputs into the database. We previously had some intermediate formats, but we wanted to know exactly where our data was coming from. So we added a source table:

csv_source

idfilename
1client-x-2024-07-01.csv
2client-x-2024-08-01.csv

csv_data

idcsv_source_idfirst_namelast_namejob
101BrendanLongBasket Weaver
112BrendanLongSenior Basket Weaver
122ExampleMcExampletonBasket Weaver

(Ok, so we didn't literally upload the CSV and only uploaded it after determining the standardized column names, sort-of like this)

Tie Your Output Data to the Source Data

Now that we had our source data, so the next step was to tie our output data to it, so we could start to answer questions like "Where did this Brendan Long guy come from?" and "Where did this Basket Weaver job come from?". Our output data frequently needed to deduplicate data from multiple sources, and the obvious choice here would be to link to the source of the winning data, but for debugging, we don't just want to know about the winning data, we want to know about all of it.

So we added all all of the sources for a piece of data. We did this with join tables, but the examples will show them as inline arrays to keep this readable.

jobs

idcsv_data_idsname
20{10,12}Basket Weaver
21{11}Senior Basket Weaver

employees

idcsv_data_idsfirst_namelast_namejob_id
30{10,11}BrendanLong21

Profit

Now that we have our data linked, we can answer the questions above trivially with database queries.

Why is the data in some input CSV missing in the output?

For example, why don't we see a row with Brendan the Basket Weaver from 2024-07-01 in our output?

SELECT employees., jobs.FROM csv_sourceLEFT JOIN csv_data ON csv_source.id = csv_data.csv_source_idLEFT JOIN employees ON csv_data.id = ANY(employees.csv_data_ids)LEFT JOIN jobs ON employees.job_id = jobs.idWHERE csv_source.date = '2024-07-01' AND csv_data.first_name = 'Brendan' AND csv_data.last_name = 'Long'
idcsv_data_idsfirst_namelast_namejob_ididcsv_data_idsname
30{10,11}BrendanLong2121{11}Senior Basket Weaver

We can see from this output that we do have data for Brendan Long, but we see it in multiple places (csv_data_id = 10 and 11), and the winning row shows that he's a Senior Basket Weaver now.

In cases like this, we left it implicit that the newest data always wins, but in cases where it was non-obvious, we would also include a column to show why we picked one source over another.

Why is data in the output CSV not matching what we expect?

For example, where did the data for this Brendan Long the Senior Basket Weaver come from?

We can answer this one by starting from the employees table and joining back to the source data:

SELECT csv_source., csv_data.FROM employeesLEFT JOIN csv_data ON csv_data.id = ANY(employees.csv_data_ids)LEFT JOIN csv_source ON csv_source.id = csv_data.csv_source_idLEFT JOIN jobs ON employees.job_id = jobs.idWHERE employees.first_name = 'Brendan' AND employees.last_name = 'Long'

Result:

idfilenameidcsv_source_idfirst_namelast_namejob
1client-x-2024-07-01.csv101BrendanLongBasket Weaver
2client-x-2024-08-01.csv112BrendanLongSenior Basket Weaver

From this, we can easily see that Brendan Long the Senior Basket weaver came from client-x-2024-08-01.csv, and overwrote data from client-x-2024-07-01.csv.

But What About the Storage Costs?

Storage is insanely cheap compared to the time of 2+ employees, at least at startup scale. RDS storage costs around $0.115 per GB-month, so we probably paid the equivalent of a few hours of an engineer's salary to store this data the entire time I was at the company.

Go Forth and Debug

With these changes, much effort was saved in debugging data quality issues. The data scientists and client support specialists could run queries to answer questions for themselves, and the engineers were able to go back to the important work of writing database clients in OCaml[1].

  1. ^

    Claude thought this was a joke when I asked it for editing feedback, but this isn't a joke.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

数据处理 数据质量 数据库 调试
相关文章