AWS Machine Learning Blog 02月21日
Generate synthetic counterparty (CR) risk data with generative AI using Amazon Bedrock LLMs and RAG
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了如何使用生成式AI模型和检索增强生成(RAG)技术,为金融领域用例生成高质量的合成数据。针对场外交易(OTC)衍生品,展示了如何生成交易对手风险(CR)数据,这对于直接在双方之间交易的OTC衍生品至关重要。该方案通过数据索引、数据生成和数据验证三个步骤,利用Amazon Titan Text Embeddings V2模型和Anthropic的Claude Haiku,结合Chroma向量数据库,实现了高效且经济的合成数据生成,并通过Q-Q图和相关性热图进行数据验证,确保合成数据的质量和代表性。

🔑 OTC衍生品交易中,准确评估交易对手风险至关重要。ABC银行希望建立机器学习模型来评估交易对手的风险,但面临数据偏差、倾斜和多样性不足的挑战。

💡 该解决方案采用RAG方法,首先通过Amazon Titan Text Embeddings V2将代表性的CR数据向量化并存储在Chroma向量数据库中,实现高效的数据检索。

🚀 在数据生成阶段,用户请求被向量化后在Chroma数据库中查找匹配数据,并结合Anthropic的Claude Haiku生成高质量的合成数据,Claude Haiku以其速度和效率著称,能以1:5的输入输出比率生成大量数据。

📊 通过Q-Q图和相关性热图等统计方法对生成的合成CR数据进行验证,确保关键属性(如cp_exposure、cp_replacement_cost和cp_settlement_risk)的分布和关系与真实数据一致,从而提高数据的可靠性和实用性。

Data is the lifeblood of modern applications, driving everything from application testing to machine learning (ML) model training and evaluation. As data demands continue to surge, the emergence of generative AI models presents an innovative solution. These large language models (LLMs), trained on expansive data corpora, possess the remarkable capability to generate new content across multiple media formats—text, audio, and video—and across various business domains, based on provided prompts and inputs.

In this post, we explore how you can use these LLMs with advanced Retrieval Augmented Generation (RAG) to generate high-quality synthetic data for a finance domain use case. You can use the same technique for synthetic data for other business domain use cases as well. For this post, we demonstrate how to generate counterparty risk (CR) data, which would be beneficial for over-the-counter (OTC) derivatives that are traded directly between two parties, without going through a formal exchange.

Solution overview

OTC derivatives are typically customized contracts between counterparties and include a variety of financial instruments, such as forwards, options, swaps, and other structured products. A counterparty is the other party involved in a financial transaction. In the context of OTC derivatives, the counterparty refers to the entity (such as a bank, financial institution, corporation, or individual) with whom a derivative contract is made.

For example, in an OTC swap or option contract, one entity agrees to terms with another party, and each entity becomes the counterparty to the other. The responsibilities, obligations, and risks (such as credit risk) are shared between these two entities according to the contract.

As financial institutions continue to navigate the complex landscape of CR, the need for accurate and reliable risk assessment models has become paramount. For our use case, ABC Bank, a fictional financial services organization, has taken on the challenge of developing an ML model to assess the risk of a given counterparty based on their exposure to OTC derivative data.

Building such a model presents numerous challenges. Although ABC Bank has gathered a large dataset from various sources and in different formats, the data may be biased, skewed, or lack the diversity needed to train a highly accurate model. The primary challenge lies in collecting and preprocessing the data to make it suitable for training an ML model. Deploying a poorly suited model could result in misinformed decisions and significant financial losses.

We propose a generative AI solution that uses the RAG approach. RAG is a widely used approach that enhances LLMs by supplying extra information from external data sources not included in their original training. The entire solution can be broadly divided into three steps: indexing, data generation, and validation.

Data indexing

In the indexing step, we parse, chunk, and convert the representative CR data into vector format using the Amazon Titan Text Embeddings V2 model and store this information in a Chroma vector database. Chroma is an open source vector database known for its ease of use, efficient similarity search, and support for multimodal data and metadata. It offers both in-memory and persistent storage options, integrates well with popular ML frameworks, and is suitable for a wide range of AI applications. It is particularly beneficial for smaller to medium-sized datasets and projects requiring local deployment or low resource usage. The following diagram illustrates this architecture.

Here are the steps for data indexing:

Data generation

When the user requests data for a certain scenario, the request is converted into vector format and then looked up in the Chroma database to find matches with the stored data. The retrieved data is augmented with the user request and additional prompts to Anthropic’s Claude Haiku on Amazon Bedrock. Anthropic’s Claude Haiku was chosen primarily for its speed, processing over 21,000 tokens per second, which significantly outpaces its peers. Moreover, Anthropic’s Claude Haiku’s efficiency in data generation is remarkable, with a 1:5 input-to-output token ratio. This means it can generate a large volume of data from a relatively small amount of input or context. This capability not only enhances the model’s effectiveness, but also makes it cost-efficient for our application, where we need to generate numerous data samples from a limited set of examples. Anthropic’s Claude Haiku LLM is invoked iteratively to efficiently manage token consumption and help prevent reaching the maximum token limit. The following diagram illustrates this workflow.

Here are the steps for data generation:

Data validation

When validating the synthetic CR data generated by the LLM, we employed Q-Q plots and correlation heat maps focusing on key attributes such as cp_exposure, cp_replacement_cost, and cp_settlement_risk. These statistical tools serve crucial roles in promoting the quality and representativeness of the synthetic data. By using the Q-Q plots, we can assess whether these attributes follow a normal distribution, which is often expected in many clinical and financial variables. By comparing the quantiles of our synthetic data against theoretical normal distributions, we can identify significant deviations that might indicate bias or unrealistic data generation.

Simultaneously, the correlation heat maps provide a visual representation of the relationships between these attributes and others in the dataset. This is particularly important because it helps verify that the LLM has maintained the complex interdependencies typically observed in real CR data. For instance, we would expect certain correlations between exposure and replacement cost, or between replacement cost and settlement risk. By making sure these correlations are preserved in our synthetic data, we can be more confident that analyses or models built on this data will yield insights that are applicable to real-world scenarios. This rigorous validation process helps to mitigate the risk of introducing artificial patterns or biases, thereby enhancing the reliability and utility of our synthetic CR dataset for subsequent research or modeling tasks.

We’ve created a Jupyter notebook containing three parts to implement the key components of the solution. We provide code snippets from the notebooks for better understanding.

Prerequisites

To set up the solution and generate test data, you should have the following prerequisites:

Setup

Here are the steps to setup the environment.

import sys!{sys.executable} -m pip install -r requirements.txt

The content of the requirements.txt is given here.

boto3langchainlangchain-communitystreamlitchromadb==0.4.15numpyjqlangchain-awsseabornmatplotlibscipy

The following code snippet will perform all the necessary imports.

from pprint import pprint from uuid import uuid4 import chromadb from langchain_community.document_loaders import JSONLoader from langchain_community.embeddings import BedrockEmbeddingsfrom langchain_community.vectorstores import Chroma from langchain_text_splitters import RecursiveCharacterTextSplitter

Index data in the Chroma database

In this section, we show how indexing of data is done in a Chroma database as a locally maintained open source vector store. This index data is used as context for generating data.

The following code snippet shows the preprocessing steps of loading the JSON data from a file and splitting it into smaller chunks:

def load_using_jsonloaer(path):    loader = JSONLoader(path,                            jq_schema=".[]",                            text_content=False)    documents = loader.load()    return documentsdef split_documents(documents):    doc_list = [item for item in documents]    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=0)    texts = text_splitter.split_documents(doc_list)    return texts

The following snippet shows how an Amazon Bedrock embedding instance is created. We used the Amazon Titan Embeddings V2 model:

def get_bedrock_embeddings():    aws_region = "us-east-1"    model_id = "amazon.titan-embed-text-v2:0" #look for latest version of model    bedrock_embeddings = BedrockEmbeddings(model_id=model_id, region_name=aws_region)    return bedrock_embeddings

The following code shows how the embeddings are created and then loaded in the Chroma database:

persistent_client = chromadb.PersistentClient(path="../data/chroma_index")collection = persistent_client.get_or_create_collection("test_124")print(collection)    #     query the databasevector_store_with_persistent_client = Chroma(collection_name="test_124",                                                 persist_directory="../data/chroma_index",                                                 embedding_function=get_bedrock_embeddings(),                                                 client=persistent_client)load_json_and_index(vector_store_with_persistent_client)

Generate data

The following code snippet shows the configuration used during the LLM invocation using Amazon Bedrock APIs. The LLM used is Anthropic’s Claude 3 Haiku:

config = Config(    region_name='us-east-1',    signature_version='v4',    retries={        'max_attempts': 2,        'mode': 'standard'    })bedrock_runtime = boto3.client('bedrock-runtime', config=config)model_id = "anthropic.claude-3-haiku-20240307-v1:0" #look for latest version of modelmodel_kwrgs = {    "temperature": 0,    "max_tokens": 8000,    "top_p": 1.0,    "top_k": 25,    "stop_sequences": ["company-1000"],}# Initialize the language modelllm = ChatBedrock(    model_id=model_id,    model_kwargs=model_kwrgs,    client=bedrock_runtime,)

The following code shows how the context is fetched by looking up the Chroma database (where data was indexed) for matching embeddings. We use the same Amazon Titan model to generate the embeddings:

def get_context(scenario):    region_name = 'us-east-1'    credential_profile_name = "default"    titan_model_id = "amazon.titan-embed-text-v2:0"    kb_context = []    be = BedrockEmbeddings(region_name=region_name,                           credentials_profile_name=credential_profile_name,                           model_id=titan_model_id)    vector_store = Chroma(collection_name="test_124", persist_directory="../data/chroma_index",                      embedding_function=be)    search_results = vector_store.similarity_search(scenario, k=3)    for doc in search_results:        kb_context.append(doc.page_content)    return json.dumps(kb_context)

The following snippet shows how we formulated the detailed prompt that was passed to the LLM. We provided examples for the context, scenario, start index, end index, records count, and other parameters. The prompt is subjective and can be adjusted for experimentation.

# Create a prompt templateprompt_template = ChatPromptTemplate.from_template(    "You are a financial data expert tasked with generating records "    "representing company OTC derivative data and "    "should be good enough for investor and lending ML model to take decisions "    "and data should accurately represent the scenario: {scenario} \n "    "and as per examples given in context: "    "and context is {context} "    "the examples given in context is for reference only, do not use same values while generating dataset."    "generate dataset with the diverse set of samples but record should be able to represent the given scenario accurately."    "Please ensure that the generated data meets the following criteria: "    "The data should be diverse  and realistic, reflecting various industries, "    "company sizes, financial metrics. "    "Ensure that the generated data follows logical relationships and correlations between features "    "(e.g., higher revenue typically corresponds to more employees, "    "better credit ratings, and lower risk). "    "And Generate {count} records starting from index {start_index}. "    "generate just JSON as per schema and do not include any text or message before or after JSON. "    "{format_instruction} \n"    "If continuing, start after this record: {last_record}\n"    "If stopping, do not include this record in the output."    "Please ensure that the generated data is well-formatted and consistent.")

The following code snippet shows the process for generating the synthetic data. You can call this method in an iterative manner to generate more records. The input parameters include scenario, context, count, start_index, and last_record. The response data is also formatted into CSV format using the instruction provided by the following:

output_parser.get_format_instructions(): def generate_records(start_index, count, scenario, context, last_record=""):    try:        response = chain.invoke({            "count": count,            "start_index": start_index,            "scenario": scenario,            "context": context,            "last_record": last_record,            "format_instruction": output_parser.get_format_instructions(),            "data_set_class_schema": DataSet.schema_json()        })                return response    except Exception as e:        print(f"Error in generate_records: {e}")        raise e

Parsing the output generated by the LLM and representing it in CSV was quite challenging. We used a Pydantic parser to parse the JSON output generated by the LLM, as shown in the following code snippet:

class CustomPydanticOutputParser(PydanticOutputParser):    def parse(self, text: str) -> BaseModel:        # Extract JSON from the text        try:            # Find the first occurrence of '{'            start = text.index('{')            # Find the last occurrence of '}'            end = text.rindex('}') + 1            json_str = text[start:end]            # Parse the JSON string            parsed_json = json.loads(json_str)            # Use the parent class to convert to Pydantic object            return super().parse_with_cls(parsed_json)        except (ValueError, json.JSONDecodeError) as e:            raise ValueError(f"Failed to parse output: {e}")

The following code snippet shows how the records are generated in an iterative manner with 10 records in each invocation to the LLM:

def generate_full_dataset(total_records, batch_size, scenario, context):    dataset = []    total_generated = 0    last_record = ""    batch: DataSet = generate_records(total_generated,                                      min(batch_size, total_records - total_generated),                                      scenario, context, last_record)    # print(f"batch: {type(batch)}")    total_generated = len(batch.records)    dataset.extend(batch.records)    while total_generated < total_records:        try:            batch = generate_records(total_generated,                                     min(batch_size, total_records - total_generated),                                     scenario, context, batch.records[-1].json())            processed_batch = batch.records            if processed_batch:                dataset.extend(processed_batch)                total_generated += len(processed_batch)                last_record = processed_batch[-1].start_index                print(f"Generated {total_generated} records.")            else:                print("Generated an empty or invalid batch. Retrying...")                time.sleep(10)        except Exception as e:            print(f"Error occurred: {e}. Retrying...")            time.sleep(5)    return dataset[:total_records]  # Ensure exactly the requested number of records

Verify the statistical properties of the generated data

We generated Q-Q plots for key attributes of the generated data: cp_exposure, cp_replacement_cost, and cp_settlement_risk, as shown in the following screenshots. The Q-Q plots compare the quantiles of the data distribution with the quantiles of a normal distribution. If the data isn’t skewed, the points should approximately follow the diagonal line.

As the next step of verification, we created a corelation heat map of the following attributes: cp_exposure, cp_replacement_cost, cp_settlement_risk, and risk. The plot is perfectly balanced with the diagonal elements showing a value of 1. The value of 1 indicates the column is perfectly co-related to itself. The following screenshot is the correlation heatmap.

Clean up

It’s a best practice to clean up the resources you created as part of this post to prevent unnecessary costs and potential security risks from leaving resources running. If you created the Jupyter notebook instance in SageMaker please complete the following steps:

    Save and shut down the notebook:
    # First save your work# Then close all open notebooks by clicking File -> Close and Halt 
    Clear the output (if needed before saving):
    # Option 1: Using notebook menu# Kernel -> Restart & Clear Output# Option 2: Using codefrom IPython.display import clear_outputclear_output()
    Stop and delete the Jupyter notebook instance created in SageMaker:
    # Option 1: Using aws cli# Stop the notebook instance when not in useaws sagemaker stop-notebook-instance --notebook-instance-name <your-notebook-name># If you no longer need the notebook instanceaws sagemaker delete-notebook-instance --notebook-instance-name <your-notebook-name># Option 2: Using Sagemager Console# Amazon Sagemaker -> Notebooks# Select the Notebook and click Actions drop-down and hit Stop.Click Actions drop-down and hit Delete

Responsible use of AI

Responsible AI use and data privacy are paramount when using AI in financial applications. Although synthetic data generation can be a powerful tool, it’s crucial to make sure that no real customer information is used without proper authorization and thorough anonymization. Organizations must prioritize data protection, implement robust security measures, and adhere to relevant regulations. Additionally, when developing and deploying AI models, it’s essential to consider ethical implications, potential biases, and the broader societal impact. Responsible AI practices include regular audits, transparency in decision-making processes, and ongoing monitoring to help prevent unintended consequences. By balancing innovation with ethical considerations, financial institutions can harness the benefits of AI while maintaining trust and protecting individual privacy.

Conclusion

In this post, we showed how to generate a well-balanced synthetic dataset representing various aspects of counterparty data, using RAG-based prompt engineering with LLMs. Counterparty data analysis is imperative for making OTC transactions between two counterparties. Because actual business data in this domain isn’t easily available, using this approach you can generate synthetic training data for your ML models at minimal cost often within minutes. After you train the model, you can use it to make intelligent decisions before entering into an OTC derivative transaction.

For more information about this topic, refer to the following resources:


About the Authors

Santosh Kulkarni is a Senior Moderation Architect with over 16 years of experience, specialized in developing serverless, container-based, and data architectures for clients across various domains. Santosh’s expertise extends to machine learning, as a certified AWS ML specialist. Currently, engaged in multiple initiatives leveraging AWS Bedrock and hosted Foundation models.

Joyanta Banerjee is a Senior Modernization Architect with AWS ProServe and specializes in building secure and scalable cloud native application for customers from different industry domains. He has developed an interest in the AI/ML space particularly leveraging Gen AI capabilities available on Amazon Bedrock.

Mallik Panchumarthy is a Senior Specialist Solutions Architect for generative AI and machine learning at AWS. Mallik works with customers to help them architect efficient, secure and scalable AI and machine learning applications. Mallik specializes in generative AI services Amazon Bedrock and Amazon SageMaker.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

生成式AI RAG 合成数据 金融领域 交易对手风险
相关文章