AWS Machine Learning Blog 04月09日 00:45
Build an enterprise synthetic data strategy using Amazon Bedrock
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

在AI技术快速发展的背景下,合成数据正成为解决数据隐私问题的关键。本文探讨了如何利用亚马逊Bedrock生成合成数据,以在保护敏感数据的同时,驱动AI模型的训练、分析和应用开发。文章深入分析了合成数据的优势、挑战,以及构建高质量合成数据的关键步骤,并强调了差分隐私在合成数据生成中的重要性。通过合成数据,企业能够释放数据价值,同时遵守严格的隐私法规。

💡 **合成数据优势**: 合成数据模仿真实数据的统计特性和模式,但在本质上是虚构的,这使得企业能够在不暴露敏感信息的情况下,安全地进行AI模型的训练、分析和应用开发,有效弥合了数据效用和隐私保护之间的差距。

🤔 **合成数据挑战**: 生成高质量的合成数据面临多重挑战,包括确保数据质量、管理偏见、平衡隐私与实用性、验证数据准确性以及弥合与现实世界的差距。这些因素需要仔细考虑,以确保合成数据能够准确反映真实世界的复杂性。

🛠️ **Amazon Bedrock解决方案**: Amazon Bedrock提供了一系列功能,用于构建生成式AI应用,尤其注重安全、隐私和负责任的AI。通过Amazon Bedrock,可以定义数据规则、生成代码以及组装和扩展合成数据集,从而创建出多样化、真实的合成数据,用于系统测试、机器学习模型训练等。

🛡️ **差分隐私的重要性**: 即使是合成数据,也可能存在隐私风险。差分隐私通过在数据生成过程中注入经过校准的噪声,确保无法从单个数据点或源数据集中的机密信息中推断出任何内容。这有助于保护原始数据的隐私,同时保持合成数据的统计特性,使其在分析和建模中保持有用。

The AI landscape is rapidly evolving, and more organizations are recognizing the power of synthetic data to drive innovation. However, enterprises looking to use AI face a major roadblock: how to safely use sensitive data. Stringent privacy regulations make it risky to use such data, even with robust anonymization. Advanced analytics can potentially uncover hidden correlations and reveal real data, leading to compliance issues and reputational damage. Additionally, many industries struggle with a scarcity of high-quality, diverse datasets needed for critical processes like software testing, product development, and AI model training. This data shortage can hinder innovation, slowing down development cycles across various business operations.

Organizations need innovative solutions to unlock the potential of data-driven processes without compromising ethics or data privacy. This is where synthetic data comes in—a solution that mimics the statistical properties and patterns of real data while being entirely fictitious. By using synthetic data, enterprises can train AI models, conduct analyses, and develop applications without the risk of exposing sensitive information. Synthetic data effectively bridges the gap between data utility and privacy protection. However, creating high-quality synthetic data comes with significant challenges:

In this post, we explore how to use Amazon Bedrock for synthetic data generation, considering these challenges alongside the potential benefits to develop effective strategies for various applications across multiple industries, including AI and machine learning (ML). Amazon Bedrock offers a broad set of capabilities to build generative AI applications with a focus on security, privacy, and responsible AI. Built within the AWS landscape, Amazon Bedrock is designed to help maintain the security and compliance standards required for enterprise use.

Attributes of high-quality synthetic data

To be truly effective, synthetic data must be both realistic and reliable. This means it should accurately reflect the complexities and nuances of real-world data while maintaining complete anonymity. A high-quality synthetic dataset exhibits several key characteristics that facilitate its fidelity to the original data:

Solution overview

Generating useful synthetic data that protects privacy requires a thoughtful approach. The following figure represents the high-level architecture of the proposed solution. The process involves three key steps:

    Identify validation rules that define the structure and statistical properties of the real data. Use those rules to generate code using Amazon Bedrock that creates synthetic data subsets. Combine multiple synthetic subsets into full datasets.

Let’s explore these three key steps for creating useful synthetic data in more detail.

Step 1: Define data rules and characteristics

    To create synthetic datasets, start by establishing clear rules that capture the essence of your target data: Use domain-specific knowledge to identify key attributes and relationships. Study existing public datasets, academic resources, and industry documentation. Use tools like AWS Glue DataBrew, Amazon Bedrock, or open source alternatives (such as Great Expectations) to analyze data structures and patterns. Develop a comprehensive rule-set covering:
      Data types and value ranges Inter-field relationships Quality standards Domain-specific patterns and anomalies

This foundational step makes sure your synthetic data accurately reflects real-world scenarios in your industry.

Step 2: Generate code with Amazon Bedrock

Transform your data rules into functional code using Amazon Bedrock language models:

    Choose an appropriate Amazon Bedrock model based on code generation capabilities and domain relevance. Craft a detailed prompt describing the desired code output, including data structures and generation rules. Use the Amazon Bedrock API to generate Python code based on your prompts. Iteratively refine the code by:
      Reviewing for accuracy and efficiency Adjusting prompts as needed Incorporating developer input for complex scenarios

The result is a tailored script that generates synthetic data entries matching your specific requirements and closely mimicking real-world data in your domain.

Step 3: Assemble and scale the synthetic dataset

Transform your generated data into a comprehensive, real-world representative dataset:

    Use the code from Step 2 to create multiple synthetic subsets for various scenarios. Merge subsets based on domain knowledge, maintaining realistic proportions and relationships. Align temporal or sequential components and introduce controlled randomness for natural variation. Scale the dataset to required sizes, reflecting different time periods or populations. Incorporate rare events and edge cases at appropriate frequencies. Generate accompanying metadata describing dataset characteristics and the generation process.

The end result is a diverse, realistic synthetic dataset for uses like system testing, ML model training, or data analysis. The metadata provides transparency into the generation process and data characteristics. Together, these measures result in a robust synthetic dataset that closely parallels real-world data while avoiding exposure of direct sensitive information. This generalized approach can be adapted to various types of datasets, from financial transactions to medical records, using the power of Amazon Bedrock for code generation and the expertise of domain knowledge for data validation and structuring.

Importance of differential privacy in synthetic data generation

Although synthetic data offers numerous benefits for analytics and machine learning, it’s essential to recognize that privacy concerns persist even with artificially generated datasets. As we strive to create high-fidelity synthetic data, we must also maintain robust privacy protections for the original data. Although synthetic data mimics patterns in actual data, if created improperly, it risks revealing details about sensitive information in the source dataset. This is where differential privacy enters the picture. Differential privacy is a mathematical framework that provides a way to quantify and control the privacy risks associated with data analysis. It works by injecting calibrated noise into the data generation process, making it virtually impossible to infer anything about a single data point or confidential information in the source dataset.

Differential privacy protects against re-identification exploits by adversaries attempting to extract details about data. The carefully calibrated noise added to synthetic data makes sure that even if an adversary tries, it is computationally infeasible to tie an output back to specific records in the original data, while still maintaining the overall statistical properties of the dataset. This allows the synthetic data to closely reflect real-world characteristics and remain useful for analytics and modeling while protecting privacy. By incorporating differential privacy techniques into the synthetic data generation process, you can create datasets that not only maintain statistical properties of the original data but also offer strong privacy guarantees. It enables organizations to share data more freely, collaborate on sensitive projects, and develop AI models with reduced risk of privacy breaches. For instance, in healthcare, differentially private synthetic patient data can accelerate research without compromising individual patient confidentiality.

As we continue to advance in the field of synthetic data generation, the incorporation of differential privacy is becoming not just a best practice, but a necessary component for responsible data science. This approach paves the way for a future where data utility and privacy protection coexist harmoniously, fostering innovation while safeguarding individual rights. However, although differential privacy offers strong theoretical guarantees, its practical implementation can be challenging. Organizations must carefully balance the trade-off between privacy and utility, because increasing privacy protection often comes at the cost of reduced data utility.

Build synthetic datasets for Trusted Advisor findings with Amazon Bedrock

In this post, we guide you through the process of creating synthetic datasets for AWS Trusted Advisor findings using Amazon Bedrock. Trusted Advisor provides real-time guidance to optimize your AWS environment, improving performance, security, and cost-efficiency through over 500 checks against AWS best practices. We demonstrate the synthetic data generation approach using the “Underutilized Amazon EBS Volumes” check (checkid: DAvU99Dc4C) as an example.

By following this post, you will gain practical knowledge on:

This approach can be applied across over 500 Trusted Advisor checks, enabling you to build comprehensive, privacy-aware datasets for testing, training, and analysis. Whether you’re looking to enhance your understanding of Trusted Advisor recommendations or develop new optimization strategies, synthetic data offers powerful possibilities.

Prerequisites

To implement this approach, you must have an AWS account with the appropriate permissions.

    AWS Account Setup:
      IAM permissions for:
        Amazon Bedrock AWS Trusted Advisor Amazon EBS
    AWS Service Access:
      Access enabled for Amazon Bedrock in your Region Access to Anthropic Claude model in Amazon Bedrock Enterprise or Business support plan for full Trusted Advisor access
    Development Environment:
      Python 3.8 or later installed Required Python packages:
        pandas numpy random boto3
    Knowledge Requirements:
      Basic understanding of:
         Python programming  AWS services (especially EBS and Trusted Advisor)  Data analysis concepts  JSON/YAML file format

Define Trusted Advisor findings rules

Begin by examining real Trusted Advisor findings for the “Underutilized Amazon EBS Volumes” check. Analyze the structure and content of these findings to identify key data elements and their relationships. Pay attention to the following:

As you study these elements, note the typical ranges, patterns, and distributions for each attribute. For example, observe how volume sizes correlate with volume types, or how usage patterns differ between development and production environments. This analysis will help you create a set of rules that accurately reflect real-world Trusted Advisor findings.

After analyzing real Trusted Advisor outputs for the “Underutilized Amazon EBS Volumes” check, we identified the following crucial patterns and rules:

The following is an example ruleset:

Analysis of the AWS Trusted Advisor finding for "Underutilized Amazon EBS Volumes":1. Columns in the Trusted Advisor Finding:- Region- Volume ID- Volume Name- Volume Type- Volume Size- Monthly Storage Cost- Snapshot ID- Snapshot Name- Snapshot Age2. Key Columns and Their Significance:- Region: AWS region where the EBS volume is located- Volume ID: Unique identifier for the EBS volume- Volume Type: Type of EBS volume (e.g., gp2, io1, st1)- Volume Size: Size of the volume in GB- Monthly Storage Cost: Estimated cost for storing the volume- Snapshot ID: Identifier of the most recent snapshot (if any)- Snapshot Age: Age of the most recent snapshot3. Relationships and Patterns:- Volume ID and Snapshot ID relationship: Each volume may have zero or more snapshots- Region and cost correlation: Storage costs may vary by region- Volume Type and Size correlation: Certain volume types have size limitations- Volume Size and Cost correlation: Larger volumes generally cost more- Snapshot Age and utilization: Older snapshots might indicate less active volumes4. Data Types and Formats:- Region: String (e.g., "us-east-1")- Volume ID: String starting with "vol-"- Volume Name: String (can be null)- Volume Type: String (gp2, gp3, io1, io2, st1, sc1, standard)- Volume Size: Integer (in GB)- Monthly Storage Cost: Decimal number- Snapshot ID: String starting with "snap-" (can be null)- Snapshot Name: String (can be null)

Generate code with Amazon Bedrock

With your rules defined, you can now use Amazon Bedrock to generate Python code for creating synthetic Trusted Advisor findings.

The following is an example prompt for Amazon Bedrock:

Give me python code to create a 100 row pandas df with the following data:<<Copy paste the ruleset from the above step>>

You can submit this prompt to the Amazon Bedrock chat playground using Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock, and receive generated Python code. Review this code carefully, verifying it meets all specifications and generates realistic data. If necessary, iterate on your prompt or make manual adjustments to the code to address any missing logic or edge cases.

The resulting code will serve as the foundation for creating varied and realistic synthetic Trusted Advisor findings that adhere to the defined parameters. By using Amazon Bedrock in this way, you can quickly develop sophisticated data generation code that would otherwise require significant manual effort and domain expertise to create.

Create data subsets

With the code generated by Amazon Bedrock and refined with your custom functions, you can now create diverse subsets of synthetic Trusted Advisor findings for the “Underutilized Amazon EBS Volumes” check. This approach allows you to simulate a wide range of real-world scenarios. In the following sample code, we have customized the volume_id and snapshot_id format to begin with vol-9999 and snap-9999, respectively:

import pandas as pdimport numpy as npimport randomdef generate_volume_id():return f"vol-9999{''.join(random.choices('0123456789abcdef', k=17))}"def generate_snapshot_id():return f"snap-9999{''.join(random.choices('0123456789abcdef', k=17))}"def generate_volume_name():prefixes = ['app', 'db', 'web', 'cache', 'log']suffixes = ['prod', 'dev', 'test', 'staging']return f"{random.choice(prefixes)}-{random.choice(suffixes)}-{random.randint(1, 100)}"def step3_generate_base_data():# Generate synthetic datanum_records = 1000regions = ['us-east-1', 'us-west-2', 'eu-west-1', 'ap-southeast-1']volume_types = ['gp2', 'gp3', 'io1', 'io2', 'st1', 'sc1', 'standard']data = {'Region': np.random.choice(regions, num_records),'Volume ID': [generate_volume_id() for _ in range(num_records)],'Volume Name': [generate_volume_name() if random.random() > 0.3 else None for _ in range(num_records)],'Volume Type': np.random.choice(volume_types, num_records, p=[0.4, 0.2, 0.1, 0.1, 0.1, 0.05, 0.05]),'Volume Size': np.random.choice(range(1, 1001), num_records),'Monthly Storage Cost': np.random.uniform(0.1, 100, num_records).round(2),'Snapshot ID': [generate_snapshot_id() if random.random() > 0.4 else None for _ in range(num_records)],'Snapshot Name': [f"snapshot-{i}" if random.random() > 0.6 else None for i in range(num_records)],'Snapshot Age': [random.randint(1, 365) if random.random() > 0.4 else None for _ in range(num_records)]}df = pd.DataFrame(data)# Apply some logic and constraintsdf.loc[df['Volume Type'] == 'gp2', 'Volume Size'] = df.loc[df['Volume Type'] == 'gp2', 'Volume Size'].clip(1, 16384)df.loc[df['Volume Type'] == 'io1', 'Volume Size'] = df.loc[df['Volume Type'] == 'io1', 'Volume Size'].clip(4, 16384)df.loc[df['Volume Type'] == 'st1', 'Volume Size'] = df.loc[df['Volume Type'] == 'st1', 'Volume Size'].clip(500, 16384)df.loc[df['Volume Type'] == 'sc1', 'Volume Size'] = df.loc[df['Volume Type'] == 'sc1', 'Volume Size'].clip(500, 16384)# Adjust Monthly Storage Cost based on Volume Size and Typedf['Monthly Storage Cost'] = df.apply(lambda row: row['Volume Size'] * random.uniform(0.05, 0.15) * (1.5 if row['Volume Type'] in ['io1', 'io2'] else 1), axis=1).round(2)# Ensure Snapshot ID, Name, and Age are consistentdf.loc[df['Snapshot ID'].isnull(), 'Snapshot Name'] = Nonedf.loc[df['Snapshot ID'].isnull(), 'Snapshot Age'] = None# Add some underutilized volumesdf['Underutilized'] = np.random.choice([True, False], num_records, p=[0.7, 0.3])df.loc[df['Underutilized'], 'Monthly Storage Cost'] *= random.uniform(1.2, 2.0)return df

This code creates subsets that include:

Combine and scale the dataset

The process of combining and scaling synthetic data involves merging multiple generated datasets while introducing realistic anomalies to create a comprehensive and representative dataset. This step is crucial for making sure that your synthetic data reflects the complexity and variability found in real-world scenarios. Organizations typically introduce controlled anomalies at a specific rate (usually 5–10% of the dataset) to simulate various edge cases and unusual patterns that might occur in production environments. These anomalies help in testing system responses, developing monitoring solutions, and training ML models to identify potential issues.

When generating synthetic data for underutilized EBS volumes, you might introduce anomalies such as oversized volumes (5–10 times larger than needed), volumes with old snapshots (older than 365 days), or high-cost volumes with low utilization. For instance, a synthetic dataset might include a 1 TB gp2 volume that’s only using 100 GB of space, simulating a real-world scenario of overprovisioned resources. See the following code:

import pandas as pdimport numpy as npimport randomdef introduce_anomalies(df, anomaly_rate=0.1):"""Introduce various volume-related anomalies into the dataset.:param df: The input DataFrame:param anomaly_rate: The rate at which to introduce anomalies (default 10%):return: DataFrame with anomalies introduced"""num_anomalies = int(len(df) * anomaly_rate)anomaly_indices = np.random.choice(df.index, num_anomalies, replace=False)df['Anomaly'] = pd.NA  # Initialize Anomaly column with pandas NAfor idx in anomaly_indices:anomaly_type = random.choice(['oversized_volume','old_snapshot','high_cost_low_size','mismatched_type','very_old_volume'])if anomaly_type == 'oversized_volume':df.at[idx, 'Volume Size'] = int(df.at[idx, 'Volume Size'] * random.uniform(5, 10))df.at[idx, 'Monthly Storage Cost'] *= random.uniform(5, 10)elif anomaly_type == 'old_snapshot':df.at[idx, 'Snapshot Age'] = random.randint(365, 1000)elif anomaly_type == 'high_cost_low_size':df.at[idx, 'Volume Size'] = random.randint(1, 10)df.at[idx, 'Monthly Storage Cost'] *= random.uniform(10, 20)elif anomaly_type == 'mismatched_type':if df.at[idx, 'Volume Type'] in ['gp2', 'gp3']:df.at[idx, 'Volume Type'] = random.choice(['io1', 'io2'])else:df.at[idx, 'Volume Type'] = random.choice(['gp2', 'gp3'])elif anomaly_type == 'very_old_volume':df.at[idx, 'Volume Name'] = f"old-volume-{random.randint(1, 100)}"if pd.notna(df.at[idx, 'Snapshot Age']):df.at[idx, 'Snapshot Age'] = random.randint(1000, 2000)df.at[idx, 'Anomaly'] = anomaly_typereturn df

The following screenshot shows an example of sample rows generated.

Validate the synthetic Trusted Advisor findings

Data validation is a critical step that verifies the quality, reliability, and representativeness of your synthetic data. This process involves performing rigorous statistical analysis to verify that the generated data maintains proper distributions, relationships, and patterns that align with real-world scenarios. Validation should include both quantitative metrics (statistical measures) and qualitative assessments (pattern analysis). Organizations should implement comprehensive validation frameworks that include distribution analysis, correlation checks, pattern verification, and anomaly detection. Regular visualization of the data helps in identifying inconsistencies or unexpected patterns.

For EBS volume data, validation might include analyzing the distribution of volume sizes across different types (gp2, gp3, io1), verifying that cost correlations match expected patterns, and making sure that introduced anomalies (like underutilized volumes) maintain realistic proportions. For instance, validating that the percentage of underutilized volumes aligns with typical enterprise environments (perhaps 15–20% of total volumes) and that the cost-to-size relationships remain realistic across volume types.

The following figures show examples of our validation checks.

    The following screenshot shows statistics of the generated synthetic datasets.
    The following figure shows the proportion of underutilized volumes in the generated synthetic datasets.
    The following figure shows the distribution of volume sizes in the generated synthetic datasets.
    The following figure shows the distribution of volume types in the generated synthetic datasets.
    The following figure shows the distribution of snapshot ages in the generated synthetic datasets.

Enhancing synthetic data with differential privacy

After exploring the steps to create synthetic datasets for the Trusted Advisor “Underutilized Amazon EBS Volumes” check, it’s worth revisiting how differential privacy strengthens this approach. When a cloud consulting firm analyzes aggregated Trusted Advisor data across multiple clients, differential privacy through OpenDP provides the critical privacy-utility balance needed. By applying carefully calibrated noise to computations of underutilized volume statistics, consultants can generate synthetic datasets that preserve essential patterns across Regions and volume types while mathematically guaranteeing individual client confidentiality. This approach verifies that the synthetic data maintains sufficient accuracy for meaningful trend analysis and recommendations, while eliminating the risk of revealing sensitive client-specific infrastructure details or usage patterns—making it an ideal complement to our synthetic data generation pipeline.

Conclusion

In this post, we showed how to use Amazon Bedrock to create synthetic data for enterprise needs. By combining language models available in Amazon Bedrock with industry knowledge, you can build a flexible and secure way to generate test data. This approach helps create realistic datasets without using sensitive information, saving time and money. It also facilitates consistent testing across projects and avoids ethical issues of using real user data. Overall, this strategy offers a solid solution for data challenges, supporting better testing and development practices.

In part 2 of this series, we will demonstrate how to use pattern recognition for different datasets to automate rule-set generation needed for the Amazon Bedrock prompts to generate corresponding synthetic data.


About the authors

Devi Nair is a Technical Account Manager at Amazon Web Services, providing strategic guidance to enterprise customers as they build, operate, and optimize their workloads on AWS. She focuses on aligning cloud solutions with business objectives to drive long-term success and innovation.

Vishal Karlupia is a Senior Technical Account Manager/Lead at Amazon Web Services, Toronto. He specializes in generative AI applications and helps customers build and scale their AI/ML workloads on AWS. Outside of work, he enjoys being outdoors and keeping bonfires alive.

Srinivas Ganapathi is a Principal Technical Account Manager at Amazon Web Services. He is based in Toronto, Canada, and works with games customers to run efficient workloads on AWS.

Nicolas Simard is a Technical Account Manager based in Montreal. He helps organizations accelerate their AI adoption journey through technical expertise, architectural best practices, and enables them to maximize business value from AWS’s Generative AI capabilities.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

合成数据 数据隐私 Amazon Bedrock 差分隐私
相关文章