Use the ApplyGuardrail API with long-context inputs and streaming outputs in Amazon Bedrock

As generative artificial intelligence (AI) applications become more prevalent, maintaining responsible AI principles becomes essential. Without proper safeguards, large language models (LLMs) can potentially generate harmful, biased, or inappropriate content, posing risks to individuals and organizations. Applying guardrails helps mitigate these risks by enforcing policies and guidelines that align with ethical principles and legal requirements. Guardrails for Amazon Bedrock evaluates user inputs and model responses based on use case-specific policies, and provides an additional layer of safeguards regardless of the underlying foundation model (FM). Guardrails can be applied across all LLMs on Amazon Bedrock, including fine-tuned models and even generative AI applications outside of Amazon Bedrock. You can create multiple guardrails, each configured with a different combination of controls, and use these guardrails across different applications and use cases. You can configure guardrails in multiple ways, including to deny topics, filter harmful content, remove sensitive information, and detect contextual grounding.

The new ApplyGuardrail API enables you to assess any text using your preconfigured guardrails in Amazon Bedrock, without invoking the FMs. In this post, we demonstrate how to use the ApplyGuardrail API with long-context inputs and streaming outputs.

ApplyGuardrail API overview

The ApplyGuardrail API offers several key features:

Ease of use

Retrieval Augmented Generation (RAG)

Decoupled from FMs

Amazon SageMaker

You can use the assessment results from the ApplyGuardrail API to design the experience on your generative AI application, making sure it adheres to your defined policies and guidelines.

The ApplyGuardrail API request allows you to pass all your content that should be guarded using your defined guardrails. The source field should be set to INPUT when the content to evaluated is from a user, typically the LLM prompt. The source should be set to OUTPUT when the model output guardrails should be enforced, typically an LLM response. An example request looks like the following code:

{    "source": "INPUT" | "OUTPUT",    "content": [{        "text": {            "text": "This is a sample text snippet...",        }    }]}

For more information about the API structure, refer to Guardrails for Amazon Bedrock.

Streaming output

LLMs can generate text in a streaming manner, where the output is produced token by token or word by word, rather than generating the entire output at once. This streaming output capability is particularly useful in scenarios where real-time interaction or continuous generation is required, such as conversational AI assistants or live captioning. Incrementally displaying the output allows for a more natural and responsive user experience. Although it’s advantageous in terms of responsiveness, streaming output introduces challenges when it comes to applying guardrails in real time as the output is generated. Unlike the input scenario, where the entire text is available upfront, the output is generated incrementally, making it difficult to assess the complete context and potential violations.

One of the main challenges is the need to evaluate the output as it’s being generated, without waiting for the entire output to be complete. This requires a mechanism to continuously monitor the streaming output and apply guardrails in real time, while also considering the context and coherence of the generated text. Furthermore, the decision to halt or continue the generation process based on the guardrail assessment needs to be made in real time, which can impact the responsiveness and user experience of the application.

Solution overview: Use guardrails on streaming output

To address the challenges of applying guardrails on streaming output from LLMs, a strategy that combines batching and real-time assessment is required. This strategy involves collecting the streaming output into smaller batches or chunks, evaluating each batch using the ApplyGuardrail API, and then taking appropriate actions based on the assessment results.

The first step in this strategy is to batch the streaming output chunks into smaller batches that are closer to a text unit, which is approximately 1,000 characters. If a batch is smaller, such as 600 characters, you’re still charged for an entire text unit (1,000 characters). For a cost-effective usage of the API, it’s recommended that the batches of chunks are in order of text units, such as 1,000 characters, 2,000, and so on. This way, you minimize the risk of incurring unnecessary costs.

By batching the output into smaller batches, you can invoke the ApplyGuardrail API more frequently, allowing for real-time assessment and decision-making. The batching process should be designed to maintain the context and coherence of the generated text. This can be achieved by making sure the batches don’t split words or sentences, and by carrying over any necessary context from the previous batch. Though the chunking varies between use cases, for the sake of simplicity, this post showcases simple character-level chunking, but it’s recommended to explore options such as semantic chunking or hierarchical chunking while still adhering to the guidelines mentioned in this post.

After the streaming output has been batched into smaller chunks, each chunk can be passed to the API for evaluation. The API will assess the content of each chunk against the defined policies and guidelines, identifying any potential violations or sensitive information.

The assessment results from the API can then be used to determine the appropriate action for the current batch. If a severe violation is detected, the API assessment suggests halting the generation process, and instead a preset message or response can be displayed to the user. However, in some cases, no severe violation is detected, but the guardrail was configured to pass through the request, for example in the case of sensitiveInformationPolicyConfig to anonymize the detected entities instead of blocking. If such an intervention occurs, the output will be masked or modified accordingly before being displayed to the user. For latency-sensitive applications, you can also consider creating multiple buffers and multiple guardrails, each with different policies, and then processing them with the ApplyGuardrail API in parallel. This way, you can minimize the time it takes to make assessments for one guardrail at a time, but maximize getting the assessments from multiple guardrails and multiple batches, though this technique hasn’t been implemented in this example.

Example use case: Apply guardrails to streaming output

In this section, we provide an example of how such a strategy could be implemented. Let’s begin with creating a guardrail. You can use the following code sample to create a guardrail in Amazon Bedrock:

import boto3REGION_NAME = "us-east-1"bedrock_client = boto3.client("bedrock", region_name=REGION_NAME)bedrock_runtime = boto3.client("bedrock-runtime", region_name=REGION_NAME)response = bedrock_client.create_guardrail(    name="<name>",    description="<description>",    ...)# alternatively provide the id and version for your own guardrailguardrail_id = response['guardrailId'] guardrail_version = response['version']

Proper assessment of the policies must be conducted to verify if the input should be later sent to an LLM or whether the output generated by the LLM should be displayed to the user. In the following code, we examine the assessments, which are part of the response from the ApplyGuardrail API, for potential severe violation leading to BLOCKED intervention by the guardrail:

from typing import List, Dictdef check_severe_violations(violations: List[Dict]) -> int:    """    When guardrail intervenes either the action on the request is BLOCKED or NONE.    This method checks the number of the violations leading to blocking the request.    Args:        violations (List[Dict]): A list of violation dictionaries, where each dictionary has an 'action' key.    Returns:        int: The number of severe violations (where the 'action' is 'BLOCKED').    """    severe_violations = [violation['action']=='BLOCKED' for violation in violations]    return sum(severe_violations)def is_policy_assessement_blocked(assessments: List[Dict]) -> bool:    """    While creating the guardrail you could specify multiple types of policies.    At the time of assessment all the policies should be checked for potential violations    If there is even 1 violation that blocks the request, the entire request is blocked    This method checks if the policy assessment is blocked based on the given assessments.    Args:        assessments (list[dict]): A list of assessment dictionaries, where each dictionary may contain 'topicPolicy', 'wordPolicy', 'sensitiveInformationPolicy', and 'contentPolicy' keys.    Returns:        bool: True if the policy assessment is blocked, False otherwise.    """    blocked = []    for assessment in assessments:        if 'topicPolicy' in assessment:            blocked.append(check_severe_violations(assessment['topicPolicy']['topics']))        if 'wordPolicy' in assessment:            if 'customWords' in assessment['wordPolicy']:                blocked.append(check_severe_violations(assessment['wordPolicy']['customWords']))            if 'managedWordLists' in assessment['wordPolicy']:                blocked.append(check_severe_violations(assessment['wordPolicy']['managedWordLists']))        if 'sensitiveInformationPolicy' in assessment:            if 'piiEntities' in assessment['sensitiveInformationPolicy']:                blocked.append(check_severe_violations(assessment['sensitiveInformationPolicy']['piiEntities']))            if 'regexes' in assessment['sensitiveInformationPolicy']:                blocked.append(check_severe_violations(assessment['sensitiveInformationPolicy']['regexes']))        if 'contentPolicy' in assessment:            blocked.append(check_severe_violations(assessment['contentPolicy']['filters']))    severe_violation_count = sum(blocked)    print(f'::Guardrail:: {severe_violation_count} severe violations detected')    return severe_violation_count>0

We can then define how to apply guardrail. If the response from the API leads to an action == 'GUARDRAIL_INTERVENED', it means that the guardrail has detected a potential violation. We now need to check if the violation was severe enough to block the request or pass it through with either the same text as input or an alternate text in which modifications are made according to the defined policies:

def apply_guardrail(text, source, guardrail_id, guardrail_version):    response = bedrock_runtime.apply_guardrail(        guardrailIdentifier=guardrail_id,        guardrailVersion=guardrail_version,         source=source,        content=[{"text": {"text": text}}]    )    if response['action'] == 'GUARDRAIL_INTERVENED':        is_blocked = is_policy_assessement_blocked(response['assessments'])        alternate_text = ' '.join([output['text'] for output in response['output']])        return is_blocked, alternate_text, response    else:        # Return the default response in case of no guardrail intervention        return False, text, response

Let’s now apply our strategy for streaming output from an LLM. We can maintain a buffer_text, which creates a batch of chunks received from the stream. As soon as len(buffer_text + new_text) > TEXT_UNIT, meaning if the batch is close to a text unit (1,000 characters), it’s ready to be sent to the ApplyGuardrail API. With this mechanism, we can make sure we don’t incur the unnecessary cost of invoking the API on smaller chunks and also that enough context is available inside each batch for the guardrail to make meaningful assessments. Additionally, when the generation is complete from the LLM, the final batch must also be tested for potential violations. If at any point the API detects severe violations, further consumption of the stream is halted and the user is displayed the preset message at the time of creation of the guardrail.

In the following example, we ask the LLM to generate three names and tell us what is a bank. This generation will lead to GUARDRAIL_INTERVENED but not block the generation, and instead anonymize the text (masking the names) and continue with generation.

input_message = "List 3 names of prominent CEOs and later tell me what is a bank and what are the benefits of opening a savings account?"model_id = "anthropic.claude-3-haiku-20240307-v1:0"text_unit= 1000 # charactersresponse = bedrock_runtime.converse_stream(    modelId=model_id,    messages=[{        "role": "user",        "content": [{"text": input_message}]    system=[{"text" : "You are an assistant that helps with tasks from users. Be as elaborate as possible"}],)stream = response.get('stream')buffer_text = ""if stream:    for event in stream:        if 'contentBlockDelta' in event:            new_text = event['contentBlockDelta']['delta']['text']            if len(buffer_text + new_text) > text_unit:                is_blocked, alt_text, guardrail_response = apply_guardrail(buffer_text, "OUTPUT", guardrail_id, guardrail_version)                # print(alt_text, end="")                if is_blocked:                    break                buffer_text = new_text            else:                 buffer_text += new_text        if 'messageStop' in event:            # print(f"\nStop reason: {event['messageStop']['stopReason']}")            is_blocked, alt_text, guardrail_response = apply_guardrail(buffer_text, "OUTPUT", guardrail_id, guardrail_version)            # print(alt_text)

After running the preceding code, we receive an example output with masked names:

Certainly! Here are three names of prominent CEOs:1. {NAME} - CEO of Apple Inc.2. {NAME} - CEO of Microsoft Corporation3. {NAME} - CEO of AmazonNow, let's discuss what a bank is and the benefits of opening a savings account.A bank is a financial institution that accepts deposits, provides loans, and offers various other financial services to its customers. Banks play a crucial role in the economy by facilitating the flow of money and enabling financial transactions.

Long-context inputs

RAG is a technique that enhances LLMs by incorporating external knowledge sources. It allows LLMs to reference authoritative knowledge bases before generating responses, producing output tailored to specific contexts while providing relevance, accuracy, and efficiency. The input to the LLM in a RAG scenario can be quite long, because it includes the user’s query concatenated with the retrieved information from the knowledge base. This long-context input poses challenges when applying guardrails, because the input may exceed the character limits imposed by the ApplyGuardrail API. To learn more about the quotas applied to Guardrails for Amazon Bedrock, refer to Guardrails quotas.

We evaluated the strategy to avoid the risk from model response in the previous section. In the case of inputs, the risk could be both at the query level or together with the query and the retrieved context for the query.

The retrieved information from the knowledge base may contain sensitive or potentially harmful content, which needs to be identified and handled appropriately, for example masking sensitive information, before being passed to the LLM for generation. Therefore, guardrails must be applied to the entire input to make sure it adheres to the defined policies and constraints.

Solution overview: Use guardrails on long-context inputs

The ApplyGuardrail API has a default limit of 25 text units (approximately 25,000 characters) per second. If the input exceeds this limit, it needs to be chunked and processed sequentially to avoid throttling. Therefore, the strategy becomes relatively straightforward: if the length of input text is less than 25 text units (25,000 characters), then it can be evaluated in a single request, otherwise it needs to be broken down into smaller pieces. The chunk size can vary depending on application behavior and the type of context in the application; you can start with 12 text units and iterate to find the best suitable chunk size. This way, we maximize the allowed default limit while keeping most of the context intact in a single request. Even if the guardrail action is GUARDRAIL_INTERVENED, it doesn’t mean the input is BLOCKED. It could also be true that the input is processed and sensitive information is masked; in this case, the input text must be recompiled with any processed response from the applied guardrail.

text_unit = 1000 # characterslimit_text_unit = 25max_text_units_in_chunk = 12def apply_guardrail_with_chunking(text, guardrail_id, guardrail_version="DRAFT"):    text_length = len(text)    filtered_text = ''    if text_length <= limit_text_unit * text_unit:        return apply_guardrail(text, "INPUT", guardrail_id, guardrail_version)    else:        # If the text length is greater than the default text unit limits then it's better to chunk the text to avoid throttling.        for i, chunk in enumerate(wrap(text, max_text_units_in_chunk * text_unit)):            print(f'::Guardrail::Applying guardrails at chunk {i+1}')            is_blocked, alternate_text, response = apply_guardrail(chunk, "INPUT", guardrail_id, guardrail_version)            if is_blocked:                filtered_text = alternate_text                break            # It could be the case that guardrails intervened and anonymized PII in the input text,            # we can then take the output from guardrails to create filtered text response.            filtered_text += alternate_text        return is_blocked, filtered_text, response

Run the full notebook to test this strategy with long-context input.

Best practices and considerations

When applying guardrails, it’s essential to follow best practices to maintain efficient and effective content moderation:

Optimize chunking strategy

Asynchronous processing

ApplyGuardrail

Develop comprehensive test suites

Implement fallback mechanisms

In addition to the aforementioned considerations, it’s also good practice to regularly audit your guardrail implementation, continuously refine and adapt your guardrail implementation, and implement logging and monitoring mechanisms to capture and analyze the performance and effectiveness of your guardrails.

Clean up

The only resource we created in this example is a guardrail. To delete the guardrail, complete the following steps:

Safeguards

Guardrails

Delete

Alternatively, you can use the SDK:

bedrock_client.delete_guardrail(guardrailIdentifier = "<your_guardrail_id>")

Key takeaways

Applying guardrails is crucial for maintaining responsible and safe content generation. With the ApplyGuardrail API from Amazon Bedrock, you can effectively moderate both inputs and outputs, protecting your generative AI application against violations and maintaining compliance with your content policies.

Key takeaways from this post include:

ApplyGuardrail

Benefits

By incorporating the ApplyGuardrail API into your generative AI application, you can unlock several benefits:

Content moderation at scale

Customizable policies

Real-time moderation

Integration with any LLM

ApplyGuardrail

Cost-effective solution

Conclusion

By using the ApplyGuardrail API from Amazon Bedrock and following the best practices outlined in this post, you can make sure your generative AI application remains safe, responsible, and compliant with content moderation standards, even with long-context inputs and streaming outputs.

To further explore the capabilities of the ApplyGuardrail API and its integration with your generative AI application, consider experimenting with the API using the following resources:

Guardrails for Amazon Bedrock

ApplyGuardrail

AWS samples GitHub repository

ApplyGuardrail

workshops and tutorials

ApplyGuardrail

Resources

The following resources explain both practical and ethical aspects of applying Guardrails for Amazon Bedrock:

Safeguard a generative AI travel agent with prompt engineering and Guardrails for Amazon Bedrock

Evaluate the reliability of Retrieval Augmented Generation applications using Amazon Bedrock

Build safe and responsible generative AI applications with guardrails

About the Author

Talha Chattha is a Generative AI Specialist Solutions Architect at Amazon Web Services, based in Stockholm. Talha helps establish practices to ease the path to production for Gen AI workloads. Talha is an expert in Amazon Bedrock and supports customers across entire EMEA. He holds passion about meta-agents, scalable on-demand inference, advanced RAG solutions and cost optimized prompt engineering with LLMs. When not shaping the future of AI, he explores the scenic European landscapes and delicious cuisines. Connect with Talha at LinkedIn using /in/talha-chattha/.

ApplyGuardrail API overview

Streaming output

Solution overview: Use guardrails on streaming output

Example use case: Apply guardrails to streaming output

Long-context inputs

Solution overview: Use guardrails on long-context inputs

Best practices and considerations

Clean up

Key takeaways

Benefits

Conclusion

Resources

About the Author

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签