MarkTechPost@AI 14小时前
Step-by-Step Guide to Creating Synthetic Data Using the Synthetic Data Vault (SDV)
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何使用SDV(Synthetic Data Vault)这一开源Python库来生成高质量的合成数据。由于真实世界的数据常常面临成本高昂、混乱以及隐私限制等问题,合成数据提供了一个有效的解决方案。文章详细讲解了使用SDV生成合成数据的步骤,包括安装SDV库、导入数据和元数据、训练模型以及评估合成数据的质量。通过示例代码和图表,展示了如何使用SDV生成与真实数据具有相似模式和分布的合成数据,从而克服数据隐私和可用性挑战,并支持稳健的数据分析和机器学习工作流程。

🛠️SDV (Synthetic Data Vault) 是一个开源 Python 库,利用机器学习生成逼真的表格数据。它从真实数据中学习模式,并创建高质量的合成数据,用于安全共享、测试和模型训练,有效解决真实数据成本高、杂乱和受隐私规则限制的问题。

📊SDV 使用元数据来解释数据,元数据包括表名、主键、每列的数据类型以及可选的列格式(如日期时间模式或 ID 模式)。元数据可以从 JSON 文件加载,也可以使用 SDV 自动推断,但可能需要人工审核和更新以确保准确性。

📈文章展示了如何使用 SDV 训练模型并生成合成数据,通过调整 `num_rows` 参数控制生成的数据行数。同时,SDV 提供了评估合成数据质量的工具,可以生成质量报告,并通过内置绘图工具比较合成数据与真实数据的分布情况,例如销售额和月度趋势。

🔍通过将合成数据与真实数据进行比较,文章验证了关键指标(如销售额分布和月度趋势)的一致性。使用 `evaluate_quality` 函数生成质量报告,并使用 `get_column_plot` 函数可视化特定列的比较情况,确保合成数据的质量和可用性。

Real-world data is often costly, messy, and limited by privacy rules. Synthetic data offers a solution—and it’s already widely used:

SDV (Synthetic Data Vault) is an open-source Python library that generates realistic tabular data using machine learning. It learns patterns from real data and creates high-quality synthetic data for safe sharing, testing, and model training.

In this tutorial, we’ll use SDV to generate synthetic data step by step.

We will first install the sdv library:

from sdv.io.local import CSVHandlerconnector = CSVHandler()FOLDER_NAME = '.' # If the data is in the same directorydata = connector.read(folder_name=FOLDER_NAME)salesDf = data['data']

Next, we import the necessary module and connect to our local folder containing the dataset files. This reads the CSV files from the specified folder and stores them as pandas DataFrames. In this case, we access the main dataset using data[‘data’].

from sdv.metadata import Metadatametadata = Metadata.load_from_json('metadata.json')

We now import the metadata for our dataset. This metadata is stored in a JSON file and tells SDV how to interpret your data. It includes:

Here is a sample metadata.json format:

{  "METADATA_SPEC_VERSION": "V1",  "tables": {    "your_table_name": {      "primary_key": "your_primary_key_column",      "columns": {        "your_primary_key_column": { "sdtype": "id", "regex_format": "T[0-9]{6}" },        "date_column": { "sdtype": "datetime", "datetime_format": "%d-%m-%Y" },        "category_column": { "sdtype": "categorical" },        "numeric_column": { "sdtype": "numerical" }      },      "column_relationships": []    }  }}
from sdv.metadata import Metadatametadata = Metadata.detect_from_dataframes(data)

Alternatively, we can use the SDV library to automatically infer the metadata. However, the results may not always be accurate or complete, so you might need to review and update it if there are any discrepancies.

from sdv.single_table import GaussianCopulaSynthesizersynthesizer = GaussianCopulaSynthesizer(metadata)synthesizer.fit(data=salesDf)synthetic_data = synthesizer.sample(num_rows=10000)

With the metadata and original dataset ready, we can now use SDV to train a model and generate synthetic data. The model learns the structure and patterns in your real dataset and uses that knowledge to create synthetic records.

You can control how many rows to generate using the num_rows argument.

from sdv.evaluation.single_table import evaluate_qualityquality_report = evaluate_quality(    salesDf,    synthetic_data,    metadata)

The SDV library also provides tools to evaluate the quality of your synthetic data by comparing it to the original dataset. A great place to start is by generating a quality report

You can also visualize how the synthetic data compares to the real data using SDV’s built-in plotting tools. For example, import get_column_plot from sdv.evaluation.single_table to create comparison plots for specific columns:

from sdv.evaluation.single_table import get_column_plotfig = get_column_plot(    real_data=salesDf,    synthetic_data=synthetic_data,    column_name='Sales',    metadata=metadata)   fig.show()

We can observe that the distribution of the ‘Sales’ column in the real and synthetic data is very similar. To explore further, we can use matplotlib to create more detailed comparisons—such as visualizing the average monthly sales trends across both datasets.

import pandas as pdimport matplotlib.pyplot as plt# Ensure 'Date' columns are datetimesalesDf['Date'] = pd.to_datetime(salesDf['Date'], format='%d-%m-%Y')synthetic_data['Date'] = pd.to_datetime(synthetic_data['Date'], format='%d-%m-%Y')# Extract 'Month' as year-month stringsalesDf['Month'] = salesDf['Date'].dt.to_period('M').astype(str)synthetic_data['Month'] = synthetic_data['Date'].dt.to_period('M').astype(str)# Group by 'Month' and calculate average salesactual_avg_monthly = salesDf.groupby('Month')['Sales'].mean().rename('Actual Average Sales')synthetic_avg_monthly = synthetic_data.groupby('Month')['Sales'].mean().rename('Synthetic Average Sales')# Merge the two series into a DataFrameavg_monthly_comparison = pd.concat([actual_avg_monthly, synthetic_avg_monthly], axis=1).fillna(0)# Plotplt.figure(figsize=(10, 6))plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Actual Average Sales'], label='Actual Average Sales', marker='o')plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Synthetic Average Sales'], label='Synthetic Average Sales', marker='o')plt.title('Average Monthly Sales Comparison: Actual vs Synthetic')plt.xlabel('Month')plt.ylabel('Average Sales')plt.xticks(rotation=45)plt.grid(True)plt.legend()plt.ylim(bottom=0)  # y-axis starts at 0plt.tight_layout()plt.show()

This chart also shows that the average monthly sales in both datasets are very similar, with only minimal differences.

In this tutorial, we demonstrated how to prepare your data and metadata for synthetic data generation using the SDV library. By training a model on your original dataset, SDV can create high-quality synthetic data that closely mirrors the real data’s patterns and distributions. We also explored how to evaluate and visualize the synthetic data, confirming that key metrics like sales distributions and monthly trends remain consistent. Synthetic data offers a powerful way to overcome privacy and availability challenges while enabling robust data analysis and machine learning workflows.


Check out the Notebook on GitHub. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

The post Step-by-Step Guide to Creating Synthetic Data Using the Synthetic Data Vault (SDV) appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SDV 合成数据 数据隐私 机器学习
相关文章