Build safe and responsible generative AI applications with guardrails

Large language models (LLMs) enable remarkably human-like conversations, allowing builders to create novel applications. LLMs find use in chatbots for customer service, virtual assistants, content generation, and much more. However, the implementation of LLMs without proper caution can lead to the dissemination of misinformation, manipulation of individuals, and the generation of undesirable outputs such as harmful slurs or biased content. Enabling guardrails plays a crucial role in mitigating these risks by imposing constraints on LLM behaviors within predefined safety parameters.

This post aims to explain the concept of guardrails, underscore their importance, and covers best practices and considerations for their effective implementation using Guardrails for Amazon Bedrock or other tools.

Introduction to guardrails for LLMs

The following figure shows an example of a dialogue between a user and an LLM.

As demonstrated in this example, LLMs are capable of facilitating highly natural conversational experiences. However, it’s also clear that LLMs without appropriate guardrail mechanisms can be problematic. Consider the following levels of risk when building or deploying an LLM-powered application:

User-level risk

hallucination

Business-level risk

To mitigate and address these risks, various safeguarding mechanisms can be employed throughout the lifecycle of an AI application. An effective mechanism that can steer LLMs towards creating desirable outputs are guardrails. The following figure shows what the earlier example would look like with guardrails in place.

This conversation is certainly preferred to the one shown earlier.

What other risks are there? Let’s review this in the next section.

Risks in LLM-powered applications

In this section, we discuss some of the challenges and vulnerabilities to consider when implementing LLM-powered applications.

Producing toxic, biased, or hallucinated content

If your end-users submit prompts that contain inappropriate language like profanity or hate speech, this could increase the probability of your application generating a toxic or biased response. In rare situations, chatbots may produce unprovoked toxic or biased responses, and it’s important to identify, block, and report those incidents. Due to their probabilistic nature, LLMs can inadvertently generate output that is incorrect; eroding users’ trust and potentially creating a liability. This content might include the following:

Irrelevant or controversial content

Biased content

Hallucinated content

Vulnerability to adversarial attacks

Adversarial attacks (or prompt hacking) is used to describe attacks that exploit the vulnerabilities of LLMs by manipulating their inputs or prompts. An attacker will craft an input (jailbreak) to deceive your LLM application into performing unintended actions, such as revealing personally identifiable information (PII). Generally, adversarial attacks may result results in data leakage, unauthorized access, or other security breaches. Some examples of adversarial attacks include:

Prompt injection

Prompt leaking

Token smuggling

Payload splitting

These are just a few examples, and the risks can be different depending on your use case, so it’s important to think about potentially harmful events and then design guardrails to prevent these events from occurring as much as possible. For further discussion on various attacks, refer to Prompt Hacking on the Learn Prompting website. The next section will explore current practices and emerging strategies aimed at mitigating these risks.

Layering safety mechanisms for LLMs

Achieving safe and responsible deployment of LLMs is a collaborative effort between model producers (AI research labs and tech companies) and model consumers (builders and organizations deploying LLMs).

Model producers have the following responsibilities:

Data preprocessing

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

base model

Value alignment

Reinforcement Learning from Human Feedback

Direct Preference Optimization

Model cards

model cards

Claude Model Card

Titan Text Service Card

Just like model producers are taking steps to make sure LLMs are trustworthy and reliable, model consumers should also expect to take certain actions:

Choose a base model

Perform fine-tuning

Create prompt templates

prompt templates

Specify tone and domain

system prompts

Add external guardrails

The following figure illustrates the shared responsibility and layered security for LLM safety.

By working together and fulfilling their respective responsibilities, model producers and consumers can create robust, trustworthy, safe, and secure AI applications. In the next section, we look at external guardrails in more detail.

Adding external guardrails to your app architecture

Let’s first review a basic LLM application architecture without guardrails (see the following figure), comprising a user, an app microservice, and an LLM. The user sends a chat message to the app, which converts it to a payload for the LLM. Next, the LLM generates text, which the app converts into a response for the end-user.

Let’s now add external guardrails to validate both the user input and the LLM responses, either using a fully managed service such as Guardrails for Amazon Bedrock, open source Toolkits and libraries such as NeMo Guardrails, or frameworks like Guardrails AI and LLM Guard. For implementation details, check out the guardrail strategies and implementation patterns discussed later in this post.

The following figure shows the scenario with guardrails verifying user input and LLM responses. Invalid input or responses invoke an intervention flow (conversation stop) rather than continuing the conversation. Approved inputs and responses continue the standard flow.

Minimizing guardrails added latency

Minimizing latency in interactive applications like chatbots can be critical. Adding guardrails could result in increased latency if input and output validation is carried out serially as part of the LLM generation flow (see the following figure). The extra latency will depend on the input and response lengths and the guardrails’ implementation and configuration.

Reducing input validation latency

This first step in reducing latency is to overlap input validation checks and LLM response generation. The two flows are parallelized, and in the rare case the guardrails need to intervene, you can simply ignore the LLM generation result and proceed to a guardrails intervention flow. Remember that all input validation must complete before a response will be sent to the user.

Some types of input validation must still take place prior to LLM generation, for example verifying certain types of adversarial attacks (like input text that will cause the LLM to go out of memory, overflow, or be used as input for LLM tools).

The following figure shows how input validation is overlapped with response generation.

Reducing output validation latency

Many applications use response streaming with LLMs to improve perceived latency for end users. The user receives and reads the response, while it is being generated, instead of waiting for the entire response to be generated. Streaming reduces effective end-user latency to be the time-to-first-token instead of time-to-last-token, because LLMs typically generate content faster than users can read it.

A naïve implementation will wait for the entire response to be generated before starting guardrails output validation, only then sending the output to the end-user.
To allow streaming with guardrails, the output guardrails can validate the LLM’s response in chunks. Each chunk is verified as it becomes available before presenting it to the user. On each verification, guardrails are given the original input text plus all available response chunks. This provides the wider semantic context needed to evaluate appropriateness.

The following figure illustrates input validation wrapped around LLM generation and output validation of the first response chunk. The end-user doesn’t see any response until input validation completes successfully. While the first chunk is validated, the LLM generates subsequent chunks.

Validating in chunks risks some loss of context vs. validating the full response. For example, chunk 1 may contain a harmless text like “I love it so much,” which will be validated and shown to the end-user, but chunk 2 might complete that declaration with “when you are not here,” which could constitute offensive language. When the guardrails must intervene mid-response, the application UI could replace the partially displayed response text with a relevant guardrail intervention message.

External guardrail implementation options

This section presents an overview of different guardrail frameworks and a collection of methodologies and tools for implementing external guardrails, arranged by development and deployment difficulty.

Guardrails for Amazon Bedrock

Guardrails for Amazon Bedrock enables the implementation of guardrails across LLMs based on use cases and responsible AI policies. You can create multiple guardrails tailored to different use cases and apply them on multiple LLMs, providing a consistent user experience and standardizing safety controls across generative AI applications.

Guardrails for Amazon Bedrock consists of a collection of different filtering policies that you can configure to avoid undesirable and harmful content and remove or mask sensitive information for privacy protection:

Content filters

Denied topics

Word filters

Sensitive information filters

For more information on the available options and detailed explanations, see Components of a guardrail.You can also refer to Guardrails for Amazon Bedrock with safety filters and privacy controls.

You can use Guardrails for Amazon Bedrock with all LLMs available on Amazon Bedrock, as well as with fine-tuned models and Agents for Amazon Bedrock. For more details about supported AWS Regions and models, see Supported regions and models for Guardrails for Amazon Bedrock.

Keywords, patterns, and regular expressions

The heuristic approach for external guardrails in LLM chatbots applies rule-based shortcuts to quickly manage interactions, prioritizing speed and efficiency over precision and comprehensive coverage. Key components include:

Keywords and patterns

Regular expressions

An open source framework (among many) is LLM Guard, which implements the Regex Scanner. This scanner is designed to sanitize prompts based on predefined regular expression patterns. It offers flexibility in defining patterns to identify and process desirable or undesirable content within the prompts.

Amazon Comprehend

To prevent undesirable outputs, you can use also use Amazon Comprehend to derive insights from text and classify topics or intent in the prompt a user submits (prompt classification) as well as the LLM responses (response classification). You can build such a model from scratch, use open source models, or use pre-built offerings such as Amazon Comprehend—a natural language processing (NLP) service that uses machine learning (ML) to uncover valuable insights and connections in text. Amazon Comprehend contains a user-friendly, cost-effective, fast, and customizable trust and safety feature that covers the following:

Toxicity detection

Intent classification

Privacy protection

Refer to Build trust and safety for generative AI applications with Amazon Comprehend and LangChain, in which we discuss new features powered by Amazon Comprehend that enable seamless integration to provide data privacy, content safety, and prompt safety in new and existing generative AI applications.

Additionally, refer to Llama Guard is now available in Amazon SageMaker JumpStart, where we walk through how to deploy the Llama Guard model in Amazon SageMaker JumpStart and build responsible generative AI solutions.

NVIDIA NeMo with Amazon Bedrock

NVIDIA’s NeMo is an open-source toolkit that provides programmable guardrails for conversational AI systems powered by LLMs. The following notebook demonstrates the integration of NeMo with Amazon Bedrock.

Key aspects of NeMo include:

Fact-checking rail

Hallucination rail

Jailbreaking rail

Topical rail

Moderation rail

Comparing available guardrail implementation options

The following table compares the external guardrails implementations we’ve discussed.

Implementation Option	Ease of Use	Guardrail Coverage	Latency	Cost
Guardrails for Amazon Bedrock	No code	Denied topics, harmful and toxic content, PII detection, prompt attacks, regex and word filters	Less than a second	Free for regular expressions and word filters. For other filters, see pricing per text unit.
Keywords and Patterns Approach	Python based	Custom patterns	Less than 100 milliseconds	Low
Amazon Comprehend	No code	Toxicity, intent, PII	Less than a second	Medium
NVIDIA NeMo	Python based	Jailbreak, topic, moderation	More than a second	High (LLM and vector store round trips)

Evaluating the effectiveness of guardrails in LLM chatbots

When evaluating guardrails for LLMs, several considerations come into play.

Offline vs. online (in production) evaluation

For offline evaluation, you create a set of examples that should be blocked and a set of examples that shouldn’t be blocked. Then, you use an LLM with guardrails to test the prompts and keep track of the results (blocked vs. allowed responses).

You can evaluate the results using traditional metrics for classification that compare the ground truth to the model results, such as precision, recall, or F1. Depending on the use case (whether it’s more important to block all undesirable outputs or more important to not prevent potentially good outputs), you can use the metrics to modify guardrails configurations and setup.

You can also create example datasets by different intervention criteria (types of inappropriate language, off-topic, adversarial attacks, and so on). You need to evaluate the guardrails directly and as part of the overall LLM task evaluation.

Safety performance evaluation

Firstly, it’s essential to assess the guardrails effectiveness in mitigating risks regarding the LLM behavior itself. This can involve custom metrics such as a safety score, where an output is considered to be safe for an unsafe input if it rejects to answer the input,

refutes the underlying opinion or assumptions in the input, or provides general advice with suitable disclaimers. You can also use more traditional metrics such as coverage (percentage of inappropriate content blocked). It’s also important to check whether the use of guardrails results in an over-defensive behavior. To test for this, you can use custom evaluations such as abstention vs. answering classification.

For the evaluation of risk mitigation effectiveness, datasets such as the Do-Not-Answer Dataset by Wang et al. or benchmarks such as “Safety and Over-Defensiveness Evaluation” (SODE) by Varshney et al. provide a starting point.

LLM accuracy evaluation

Certain types of guardrail implementations can modify the output and thereby impact their performance. Therefore, when implementing guardrails, it’s important to evaluate LLM performance on established benchmarks and across a variety of metrics such as coherence, fluency, and grammar. If the LLM was originally trained or fine-tuned to perform a particular task, then additional metrics like precision, recall, and F1 scores should also be used to gauge the LLM performance accurately with the guardrails in place. Guardrails may also result in a decrease of topic relevance; this is due to the fact that most LLMs have a certain context window, meaning they keep track of an ongoing conversation. If guardrails are overly restrictive, the LLM might stray off topic eventually.

Various open source and commercial libraries are available that can assist with the evaluation; for example: fmeval, deepeval, evaluate, or lm-evaluation-harness.

Latency evaluation

Depending on the implementation strategy for the guardrails, the user experience could be impacted significantly. Additional calls to other models (assuming optimal architecture) can add anywhere from a fraction of a second to several seconds to complete; meaning the conversation flow could be interrupted. Therefore, it’s crucial to also test for any changes to latency for different length user prompts (generally an LLM will take longer to respond the more text provided by the user) on different topics.

To measure latency, use Amazon SageMaker Inference Recommender, open source projects like Latency Benchmarking tools for Amazon Bedrock, FMBench, or managed services like Amazon CloudWatch.

Robustness evaluation

Furthermore, ongoing monitoring and adjustment is necessary to adapt guardrails to evolving threats and usage patterns. Over time, malicious actors might uncover new vulnerabilities, so it’s important to check for suspicious patterns on an ongoing basis. It can also be useful to keep track of the outputs that were generated and classify them according to various topics, or create alarms if the number of blocked prompts or outputs starts to exceed a certain threshold (using services such as Amazon SageMaker Model Monitor, for example).

To test for robustness, different libraries and datasets are available. For instance, PromptBench offers a range of robustness evaluation benchmarks. Similarly, ANLI evaluates LLM robustness through manually crafted sentences incorporating spelling errors and synonyms.

Conclusion

A layered security model should be adopted with shared responsibility between model producers, application developers, and end-users. Multiple guardrail implementations exist, with different features and varying levels of difficulty. When evaluating guardrails, considerations around safety performance, accuracy, latency, and ongoing robustness against new threats all come into play. Overall, guardrails enable building innovative yet responsible AI applications, balancing progress and risk through customizable controls tailored to your specific use cases and responsible AI policies.

To get started, we invite you to learn about Guardrails for Amazon Bedrock.

About the Authors

Harel Gal is a Solutions Architect at AWS, specializing in Generative AI and Machine Learning. He provides technical guidance and support across various segments, assisting customers in developing and managing AI solutions. In his spare time, Harel stays updated with the latest advancements in machine learning and AI. He is also an advocate for Responsible AI, an open-source software contributor, a pilot, and a musician.

Eitan Sela is a Generative AI and Machine Learning Specialist Solutions Architect at AWS. He works with AWS customers to provide guidance and technical assistance, helping them build and operate Generative AI and Machine Learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.

Gili Nachum is a Principal AI/ML Specialist Solutions Architect who works as part of the EMEA Amazon Machine Learning team. Gili is passionate about the challenges of training deep learning models, and how machine learning is changing the world as we know it. In his spare time, Gili enjoy playing table tennis.

Mia C. Mayer is an Applied Scientist and ML educator at AWS Machine Learning University; where she researches and teaches safety, explainability and fairness of Machine Learning and AI systems. Throughout her career, Mia established several university outreach programs, acted as a guest lecturer and keynote speaker, and presented at numerous large learning conferences. She also helps internal teams and AWS customers get started on their responsible AI journey.