Minimize AI hallucinations and deliver up to 99% verification accuracy with Automated Reasoning checks: Now available

AWS Blogs 前天 00:03

Minimize AI hallucinations and deliver up to 99% verification accuracy with Automated Reasoning checks: Now available

AWS Bedrock Guardrails现已正式推出自动化推理检查功能，旨在帮助用户验证由基础模型（FMs）生成的内容的准确性，有效防止AI幻觉导致的错误。该功能采用数学逻辑和形式验证技术，提供高达99%的验证准确率，并能检测输出的歧义性。新功能支持处理高达80,000个token的大型文档，简化了策略验证流程，支持自动化场景生成，提供增强的策略反馈，并允许自定义验证设置，为AI应用的可靠性提供了有力保障。

🎯 **自动化推理检查的核心作用**：该功能通过应用数学逻辑和形式验证技术，能够对照领域知识库来验证基础模型生成内容的准确性，从而显著减少AI幻觉（hallucinations）带来的事实性错误，并能识别输出的歧义性，确保AI回答的可靠性，最高可达99%的验证准确率。

🚀 **新功能与优势**：此次正式发布带来了多项改进，包括支持高达80,000个token的大文档处理（相当于约100页内容），简化了策略验证和重复运行的测试流程，支持自动生成测试场景以提高覆盖率，提供自然语言的策略修改建议，以及允许用户自定义置信度分数阈值，以满足不同场景下的验证严格性需求。

🔧 **使用流程详解**：用户首先需要将知识领域的规则编码为自动化推理策略，然后在Amazon Bedrock控制台中创建或更新策略，上传包含规则的文档，并定义相关的变量、规则和类型。随后，可以通过自动生成或手动输入测试用例来验证策略的有效性，并根据测试结果调整策略，最终将此策略应用于AI助手，以确保其输出符合既定规则。

💡 **实际应用案例**：文章通过一个抵押贷款审批的场景，展示了如何创建和测试自动化推理策略。此外，还介绍了一个公用事业公司在断电管理系统中的应用案例，利用该功能实现自动化协议生成、实时计划验证和结构化工作流创建，提升了运营效率和响应速度，例如与PwC的合作案例强调了其在高度监管行业的价值。

🌍 **可用性和定价**：自动化推理检查功能现已在美国东部（俄亥俄州、弗吉尼亚州）、美国西部（俄勒冈州）以及欧洲（法兰克福、爱尔兰、巴黎）等AWS区域普遍可用。用户将根据处理的文本量付费，具体定价信息可在Amazon Bedrock定价页面查看。

<section class="blog-post-content lb-rtxt"><table id="amazon-polly-audio-table"><tbody><tr><td id="amazon-polly-audio-tab"><p></p></td></tr></tbody></table><p>Today, I’m happy to share that Automated Reasoning checks, a new <a href="https://aws.amazon.com/bedrock/guardrails/?trk=e61dee65-4ce8-4738-84db-75305c9cd4fe&amp;sc_channel=el">Amazon Bedrock Guardrails</a> policy <a href="https://aws.amazon.com/blogs/aws/prevent-factual-errors-from-llm-hallucinations-with-mathematically-sound-automated-reasoning-checks-preview/?trk=e61dee65-4ce8-4738-84db-75305c9cd4fe&amp;sc_channel=el">that we previewed during AWS re:Invent</a>, is now generally available. Automated Reasoning checks helps you validate the accuracy of content generated by <a href="https://aws.amazon.com/what-is/foundation-models/?trk=e61dee65-4ce8-4738-84db-75305c9cd4fe&amp;sc_channel=el">foundation models (FMs)</a> against a domain knowledge. This can help prevent factual errors due to AI hallucinations. The policy uses mathematical logic and formal verification techniques to validate accuracy, providing definitive rules and parameters against which AI responses are checked for accuracy.</p><p>This approach is fundamentally different from probabilistic reasoning methods which deal with uncertainty by assigning probabilities to outcomes. In fact, Automated Reasoning checks delivers up to 99% verification accuracy, providing provable assurance in detecting AI hallucinations while also assisting with ambiguity detection when the output of a model is open to more than one interpretation.</p><p>With general availability, you get the following new features:</p><ul><li>Support for large documents in a single build, up to 80K tokens – Process extensive documentation; we found this can add up to 100 pages of content</li><li>Simplified policy validation – Save your validation tests and run them repeatedly, making it easier to maintain and verify your policies over time</li><li>Automated scenario generation – Create test scenarios automatically from your definitions, saving time and effort while helping make coverage more comprehensive</li><li>Enhanced policy feedback – Provide natural language suggestions for policy changes, simplifying the way you can improve your policies</li><li>Customizable validation settings – Adjust confidence score thresholds to match your specific needs, giving you more control over validation strictness</li></ul><p>Let’s see how this works in practice.</p><p><strong>Creating Automated Reasoning checks in Amazon Bedrock Guardrails<br /></strong> To use Automated Reasoning checks, you first encode rules from your knowledge domain into an Automated Reasoning policy, then use the policy to validate generated content. For this scenario, I’m going to create a mortgage approval policy to safeguard an AI assistant evaluating who can qualify for a mortgage. It is important that the predictions of the AI system do not deviate from the rules and guidelines established for mortgage approval. These rules and guidelines are captured in a policy document written in natural language.</p><p>In the <a href="https://console.aws.amazon.com/bedrock/?trk=e61dee65-4ce8-4738-84db-75305c9cd4fe&amp;sc_channel=el">Amazon Bedrock console</a>, I choose <strong>Automated Reasoning</strong> from the navigation pane to create a policy.</p><p>I enter name and description of the policy and upload the PDF of the policy document. The name and description are just metadata and do not contribute in building the Automated Reasoning policy. I describe the source content to add context on how it should be translated into formal logic. For example, I explain how I plan to use the policy in my application, including sample Q&A from the AI assistant.</p><p><a href="https://d2908q01vomqb2.cloudfront.net/da4b9237bacccdf19c0760cab7aec4a8359010b0/2025/07/22/ar-checks-create-policy-1.png"><img class="aligncenter size-full wp-image-98597" src="https://d2908q01vomqb2.cloudfront.net/da4b9237bacccdf19c0760cab7aec4a8359010b0/2025/07/22/ar-checks-create-policy-1.png" alt="Consoel screenshot." width="1232" height="1199" /></a></p><p>When the policy is ready, I land on the overview page, showing the policy details and a summary of the tests and definitions. I choose <strong>Definitions</strong> from the dropdown to examine the Automated Reasoning policy, made of rules, variables, and types that have been created to translate the natural language policy into formal logic.</p><p>The <strong>Rules</strong> describe how variables in the policy are related and are used when evaluating the generated content. For example, in this case, which are the thresholds to apply and how some of the decisions are taken. For traceability, each rule has its own unique ID.</p><p><a href="https://d2908q01vomqb2.cloudfront.net/da4b9237bacccdf19c0760cab7aec4a8359010b0/2025/07/16/ar-checks-rules.png"><img class="aligncenter size-full wp-image-98432" src="https://d2908q01vomqb2.cloudfront.net/da4b9237bacccdf19c0760cab7aec4a8359010b0/2025/07/16/ar-checks-rules.png" alt="Console screenshot." width="1051" height="819" /></a></p><p>The <strong>Variables</strong> represent the main concepts at play in the original natural language documents. Each variable is involved in one or more rules. Variables allow complex structures to be easier to understand. For this scenario, some of the rules need to look at the down payment or at the credit score.</p><p><a href="https://d2908q01vomqb2.cloudfront.net/da4b9237bacccdf19c0760cab7aec4a8359010b0/2025/07/16/ar-checks-variables.png"><img class="aligncenter size-full wp-image-98433" src="https://d2908q01vomqb2.cloudfront.net/da4b9237bacccdf19c0760cab7aec4a8359010b0/2025/07/16/ar-checks-variables.png" alt="Console screenshot." width="1052" height="937" /></a></p><p>Custom <strong>Types</strong> are created for variables that are neither boolean nor numeric. For example, for variables that can only assume a limited number of values. In this case, there are two type of mortgage described in the policy, insured and conventional.</p><p><a href="https://d2908q01vomqb2.cloudfront.net/da4b9237bacccdf19c0760cab7aec4a8359010b0/2025/07/16/ar-checks-types.png"><img class="aligncenter size-full wp-image-98434" src="https://d2908q01vomqb2.cloudfront.net/da4b9237bacccdf19c0760cab7aec4a8359010b0/2025/07/16/ar-checks-types.png" alt="Console screenshot." width="1056" height="508" /></a></p><p>Now we can assess the quality of the initial Automated Reasoning policy through testing. I choose <strong>Tests</strong> from the dropdown. Here I can manually enter a test, consisting of input (optional) and output, such as a question and its possible answer from the interaction of a customer with the AI assistant. I then set the expected result from the Automated Reasoning check. The expected result can be valid (the answer is correct), invalid (the answer is not correct), or satisfiable (the answer could be true or false depending on specific assumptions). I can also assign a confidence threshold for the translation of the query/content pair from natural language to logic.</p><p>Before I enter tests manually, I use the option to automatically generate a scenario from the definitions. This is the easiest way to validate a policy and (unless you’re an expert in logic) should be the first step after the creation of the policy.</p><p>For each generated scenario, I provide an expected validation to say if it is something that can happen (satisfiable) or not (invalid). If not, I can add an annotation that can then be used to update the definitions. For a more advanced understanding of the generated scenario, I can show the formal logic representation of a test using <a href="https://smt-lib.org/">SMT-LIB</a> syntax.</p><p><a href="https://d2908q01vomqb2.cloudfront.net/da4b9237bacccdf19c0760cab7aec4a8359010b0/2025/07/16/ar-checks-generate.png"><img class="aligncenter size-full wp-image-98435" src="https://d2908q01vomqb2.cloudfront.net/da4b9237bacccdf19c0760cab7aec4a8359010b0/2025/07/16/ar-checks-generate.png" alt="Console screenshot." width="841" height="487" /></a></p><p>After using the generate scenario option, I enter a few tests manually. For these tests, I set different expected results: some are valid, because they follow the policy, some are invalid, because they flout the policy, and some are satisfiable, because their result depends on specific assumptions.</p><p><a href="https://d2908q01vomqb2.cloudfront.net/da4b9237bacccdf19c0760cab7aec4a8359010b0/2025/07/22/ar-checks-add-tests-1.png"><img class="aligncenter size-full wp-image-98595" src="https://d2908q01vomqb2.cloudfront.net/da4b9237bacccdf19c0760cab7aec4a8359010b0/2025/07/22/ar-checks-add-tests-1.png" alt="Console screenshot." width="1419" height="652" /></a></p><p>Then, I choose <strong>Validate all tests</strong> to see the results. All tests passed in this case. Now, when I update the policy, I can use these tests to validate that the changes didn’t introduce errors.</p><p><a href="https://d2908q01vomqb2.cloudfront.net/da4b9237bacccdf19c0760cab7aec4a8359010b0/2025/07/16/ar-checks-summary-tests.png"><img class="aligncenter size-full wp-image-98437" src="https://d2908q01vomqb2.cloudfront.net/da4b9237bacccdf19c0760cab7aec4a8359010b0/2025/07/16/ar-checks-summary-tests.png" alt="Console screenshot." width="1079" height="350" /></a></p><p>For each test, I can look at the findings. If a test doesn’t pass, I can look at the rules that created the contradiction that made the test fail and go against the expected result. Using this information, I can understand if I should add an annotation, to improve the policy, or correct the test.</p><p><a href="https://d2908q01vomqb2.cloudfront.net/da4b9237bacccdf19c0760cab7aec4a8359010b0/2025/07/16/ar-checks-findings-test.png"><img class="aligncenter size-full wp-image-98438" src="https://d2908q01vomqb2.cloudfront.net/da4b9237bacccdf19c0760cab7aec4a8359010b0/2025/07/16/ar-checks-findings-test.png" alt="Console screenshot." width="1075" height="459" /></a></p><p>Now that I’m satisfied with the tests, I can create a new Amazon Bedrock guardrail (or update an existing one) to use up to two Automated Reasoning policies to check the validity of the responses of the AI assistant. All six policies offered by Guardrails are modular, and can be used together or separately. For example, Automated Reasoning checks can be used with other safeguards such as content filtering and contextual grounding checks. The guardrail can be applied to models served by Amazon Bedrock or with any third-party model (such as OpenAI and Google Gemini) via the <a href="https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_ApplyGuardrail.html?trk=e61dee65-4ce8-4738-84db-75305c9cd4fe&amp;sc_channel=el">ApplyGuardrail</a> API. I can also <a href="https://strandsagents.com/latest/documentation/docs/user-guide/safety-security/guardrails/?trk=e61dee65-4ce8-4738-84db-75305c9cd4fe&amp;sc_channel=el">use the guardrail with an agent framework such as Strands Agents,</a> including <a href="https://aws.amazon.com/blogs/aws/introducing-amazon-bedrock-agentcore-securely-deploy-and-operate-ai-agents-at-any-scale/?trk=e61dee65-4ce8-4738-84db-75305c9cd4fe&amp;sc_channel=el">agents deployed using Amazon Bedrock AgentCore</a>.</p><p><a href="https://d2908q01vomqb2.cloudfront.net/da4b9237bacccdf19c0760cab7aec4a8359010b0/2025/07/22/ar-checks-guardrails-1.png"><img class="aligncenter size-full wp-image-98596" src="https://d2908q01vomqb2.cloudfront.net/da4b9237bacccdf19c0760cab7aec4a8359010b0/2025/07/22/ar-checks-guardrails-1.png" alt="Console screenshot." width="1235" height="861" /></a></p><p>Now that we saw how to set up a policy, let’s look at how Automated Reasoning checks are used in practice.</p><p><strong>Customer case study – Utility outage management systems<br /></strong> When the lights go out, every minute counts. That’s why utility companies are turning to AI solutions to improve their outage management systems. We collaborated on a solution in this space together with <a href="https://www.pwc.com/us/en/technology/alliances/amazon-web-services/ai-solutions.html">PwC</a>. Using Automated Reasoning checks, utilities can streamline operations through:</p><ul><li>Automated protocol generation – Creates standardized procedures that meet regulatory requirements</li><li>Real-time plan validation – Ensures response plans comply with established policies</li><li>Structured workflow creation – Develops severity-based workflows with defined response targets</li></ul><p>At its core, this solution combines intelligent policy management with optimized response protocols. Automated Reasoning checks are used to assess AI-generated responses. When a response is found to be invalid or satisfiable, the result of the Automated Reasoning check is used to rewrite or enhance the answer.</p><p>This approach demonstrates how AI can transform traditional utility operations, making them more efficient, reliable, and responsive to customer needs. By combining mathematical precision with practical requirements, this solution sets a new standard for outage management in the utility sector. The result is faster response times, improved accuracy, and better outcomes for both utilities and their customers.</p><p>In the words of Matt Wood, PwC’s Global and US Commercial Technology and Innovation Officer:</p><p><em>“At PwC, we’re helping clients move from AI pilot to production with confidence—especially in highly regulated industries where the cost of a misstep is measured in more than dollars. Our collaboration with AWS on Automated Reasoning checks is a breakthrough in responsible AI: mathematically assessed safeguards, now embedded directly into Amazon Bedrock Guardrails. We’re proud to be AWS’s launch collaborator, bringing this innovation to life across sectors like pharma, utilities, and cloud compliance—where trust isn’t a feature, it’s a requirement.”</em></p><p><strong>Things to know<br /></strong> Automated Reasoning checks in <a href="https://aws.amazon.com/bedrock/guardrails/?trk=e61dee65-4ce8-4738-84db-75305c9cd4fe&amp;sc_channel=el">Amazon Bedrock Guardrails</a> is generally available today in the following <a href="https://aws.amazon.com/about-aws/global-infrastructure/regions_az/?trk=e61dee65-4ce8-4738-84db-75305c9cd4fe&amp;sc_channel=el">AWS Regions</a>: US East (Ohio, N. Virginia), US West (Oregon), and Europe (Frankfurt, Ireland, Paris).</p><p>With Automated Reasoning checks, you pay based on the amount of text processed. For more information, see <a href="https://aws.amazon.com/bedrock/pricing/?trk=e61dee65-4ce8-4738-84db-75305c9cd4fe&amp;sc_channel=el">Amazon Bedrock pricing</a>.</p><p>To learn more, and build secure and safe AI applications, see the <a href="https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails-automated-reasoning-checks.html?trk=e61dee65-4ce8-4738-84db-75305c9cd4fe&amp;sc_channel=el">technical documentation</a> and the <a href="https://github.com/aws-samples/amazon-bedrock-samples/tree/main/responsible_ai/bedrock-automated-reasoning-checks">GitHub code samples</a>. Follow <a href="https://console.aws.amazon.com/bedrock/home#/automated-reasoning/policies?trk=e61dee65-4ce8-4738-84db-75305c9cd4fe&amp;sc_channel=el">this link for direct access to the Amazon Bedrock console</a>.</p><p>The videos in this playlist include an introduction to Automated Reasoning checks, a deep dive presentation, and hands-on tutorials to create, test, and refine a policy.</p><p>— <a href="https://x.com/danilop">Danilo</a></p></section><aside id="Comments" class="blog-comments"><div data-lb-comp="aws-blog:cosmic-comments" data-env="prod" data-content-id="7556202f-9b29-4b24-bcf4-a973c5a212c0" data-title="Minimize AI hallucinations and deliver up to 99% verification accuracy with Automated Reasoning checks: Now available" data-url="https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-and-deliver-up-to-99-verification-accuracy-with-automated-reasoning-checks-now-available/"><p data-failed-message="Comments cannot be loaded… Please refresh and try again.">Loading comments…</p></div></aside>

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon Bedrock Guardrails 自动化推理 AI准确性 AI幻觉

相关文章

Are AI-RAG Solutions Really Hallucination-Free? Researchers at Stanford University Assess the Reliability of AI in Legal Research: Hallucinations and Accuracy Challenges

Add Flexibility to Your RAG Applications in Amazon Bedrock

语义检查+文档润色+一键生成PPT，海外版WPS AI实现智慧办公 | 创新场景

Access control for vector stores using metadata filtering with Knowledge Bases for Amazon Bedrock

Build a self-service digital assistant using Amazon Lex and Knowledge Bases for Amazon Bedrock

Build generative AI applications on Amazon Bedrock — the secure, compliant, and responsible foundation

Build a conversational chatbot using different LLMs within single interface – Part 1

AI-powered assistants for investment research with multi-modal data: An application of Agents for Amazon Bedrock

AI21 Labs Jamba-Instruct model is now available in Amazon Bedrock

Build an automated insight extraction framework for customer feedback analysis with Amazon Bedrock and Amazon QuickSight