AWS Machine Learning Blog 18小时前
Observing and evaluating AI agentic workflows with Strands Agents SDK and Arize AX
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何结合Arize AX服务和Strands Agents SDK,以增强AIAgent工作流的可观测性、评估和可信度。文章指出,AIAgent的非确定性特性带来了可预测性、隐藏故障模式、低效路径、工具集成复杂性和成本波动等挑战,传统监控方法已不足以应对。Arize AX提供了一个全面的AI工程服务,包括Tracing、Evaluation、Datasets、Experiments、Playground、Prompt管理、监控告警和Agent可视化,能够深入追踪LLM操作,自动化评估,优化决策路径。通过将Strands Agents与Arize AX集成,开发者可以端到端地追踪AIAgent的执行过程,进行细致的分析,并通过LLM-as-a-judge评估来自动识别和纠正Agent的行为,最终实现AIAgent的持续优化和可靠部署。

🎯 **AIAgent的非确定性挑战与传统监控的局限**:AIAgent由于底层LLM的概率性生成,表现出非确定性,可能导致结果不一致。这带来了如不可预测的规模化行为、隐藏的故障模式(如貌似正确但误导决策)、低效或错误的决策路径、工具集成复杂性(API调用错误)以及成本和性能波动(如无限循环导致高额token费用和延迟)等问题。传统的测试和监控方法难以有效应对这些挑战,需要更全面的AI运维策略。

🚀 **Arize AX提供的全面可观测性框架**:Arize AX作为企业级AI工程服务,通过其Phoenix基础和新增的企业级功能(如AI助手Alyx、在线评估、自动提示优化等),为AIAgent的监控、评估和调试提供了一个端到端解决方案。其核心能力包括:通过OpenTelemetry进行全链路Tracing,实现LLM调用、检索步骤和元数据的详细分析;自动化质量监控,利用LLM-as-a-judge进行生产样本评估;维护版本化的数据集用于回归测试和A/B测试;运行对照实验以衡量变更影响;提供交互式Playground用于调试和优化;以及实现提示管理、实时监控告警和Agent可视化,以分析和优化决策路径。

🤝 **Strands Agents与Arize AX的强大协同作用**:Strands Agents是一个开源SDK,为构建和运行AIAgent提供了低代码框架,简化了提示、工具、LLM交互和集成协议的统一。通过将Strands Agent SDK与Arize AI的Observability系统集成,可以实现对AIAgent行为和决策的深度、结构化可见性。这包括端到端追踪Agent工作流(从用户输入到最终输出),利用Arize的LLM-as-a-judge评估来验证Agent行为的正确性和可信度,并通过分析和优化这些追踪数据来改进Agent的性能,例如识别并纠正工具调用错误、参数不匹配等问题。

💡 **自动化评估与持续优化驱动AIAgent迭代**:Arize AX支持LLM-as-a-judge的自动化评估,能够针对Agent的工具调用、路径选择、参数提取等关键环节进行评估,并自动标记正确或错误的标签以及LLM的解释。这些评估结果可以直接用于构建回归数据集,驱动Agent的改进工作流。此外,Arize的Prompt Playground允许开发者在集成环境中实验不同的提示和模型组合,并与回归数据集进行对比测试,从而优化Agent的性能。实验结果可以与CI/CD流程集成,实现自动化回归测试和验证,确保AIAgent的持续改进和可靠性。

📈 **生产环境中的持续监控与关键指标**:为了在生产环境中维护AIAgent的可靠性和性能,持续监控至关重要。Arize AI提供了开箱即用的监控能力,帮助团队实现早期问题检测、性能跟踪、成本管理和质量保证。关键监控指标包括延迟、Token使用量、错误率、工具使用效率以及用户满意度信号(如工具调用正确性、对话长度等)。通过可定制的仪表板和预置模板,团队可以深入了解Agent的性能表现,识别性能瓶颈和回归,确保AIAgent能够可靠地扩展并提供高质量的用户体验。

This post is co-written with Rich Young from Arize AI.

Agentic AI applications built on agentic workflows differ from traditional workloads in one important way: they’re nondeterministic. That is, they can produce different results with the same input. This is because the large language models (LLMs) they’re based on use probabilities when generating each token. This inherent unpredictability can lead AI application designers to ask questions related to the correction plan of action, the optimal path of an agent and the correct set of tools with the right parameters. Organizations that want to deploy such agentic workloads need an observability system that can make sure that they’re producing results that are correct and can be trusted.

In this post, we present how the Arize AX service can trace and evaluate AI agent tasks initiated through Strands Agents, helping validate the correctness and trustworthiness of agentic workflows.

Challenges with generative AI applications

The path from a promising AI demo to a reliable production system is fraught with challenges that many organizations underestimate. Based on industry research and real-world deployments, teams face several critical hurdles:

These challenges mean that traditional testing and monitoring approaches are insufficient for AI systems. Success requires a more thoughtful approach that incorporates a more comprehensive strategy.

Arize AX delivers a comprehensive observability, evaluation, and experimentation framework

Arize AX is the enterprise-grade AI engineering service that helps teams monitor, evaluate, and debug AI applications from development to production lifecycle. Incorporating Arize’s Phoenix foundation, AX adds enterprise essentials such as the “Alyx” AI assistant, online evaluations, automatic prompt optimization, role-based access control (RBAC), and enterprise scale and support. AX offers a comprehensive solution to organizations that caters to both technical and nontechnical personas so they can manage and improve AI agents from development through production at scale. Arize AX capabilities include:

These components form a comprehensive observability strategy that treats LLM applications as mission-critical production systems requiring continuous monitoring, evaluation, and improvement.

Arize AX and Strands Agents: A powerful combination

Strands Agents is an open source SDK, a powerful low-code framework for building and running AI agents with minimal overhead. Designed to simplify the development of sophisticated agent workflows, Strands unifies prompts, tools, LLM interactions, and integration protocols into a single streamlined experience. It supports both Amazon Bedrock hosted and external models, with built-in capabilities for Retrieval Augmented Generation (RAG), Model Context Protocol (MCP), and Agent2Agent (A2A) communication. In this section, we walk through building an agent with Strands Agent SDK, instrumenting it with Arize AX for trace-based evaluation, and optimizing its behavior.

The following workflow shows how a Strands agent handles a user task end-to-end—invoking tools, retrieving context, and generating a response—while sending traces to Arize AX for evaluation and optimization.

The solution follows these high-level steps:

    Install and configure the dependencies Instrument the agent for observability Build the agent with Strands SDK Test the agent and generate traces Analyze traces in Arize AI Evaluate the agent’s behavior Optimize the agent Continually monitor the agent

Prerequisites

You’ll need:

Install dependencies:pip install strands opentelemetry-sdk arize-otel

Solution walkthrough: Using Arize AX with Strands Agents

The integration between Strands Agent SDK and Arize AI’s observability system provides deep, structured visibility into the behavior and decisions of AI agents. This setup enables end-to-end tracing of agent workflows—from user input through planning, tool invocation, and final output.

Full implementation details are available in the accompanying notebook and resources in the Openinference-Arize repository in GitHub.

Install and configure the dependencies

To install and configure the dependencies, use the following code:

from opentelemetry import tracefrom opentelemetry.sdk.trace.export import BatchSpanProcessorfrom opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporterfrom strands_to_openinference_mapping import StrandsToOpenInferenceProcessorfrom arize.otel import registerimport grpc

Instrument the agent for observability

To instrument the agent for observability, use the following code.

Use Arize’s OpenTelemetry integration to enable tracing:

register(    space_id="your-arize-space-id",    api_key="your-arize-api-key",    project_name="strands-project",    processor=StrandsToOpenInferenceProcessor())agent = Agent(    model=model,    system_prompt=system_prompt,    tools=[        retrieve, current_time, get_booking_details,        create_booking, delete_booking    ],    trace_attributes={        "session.id": "abc-1234",        "user.id": "user-email@example.com",        "arize.tags": [            "Agent-SDK",            "Arize-Project",            "OpenInference-Integration"        ]    })

Build the agent with Strands SDK

Create the Restaurant Assistant agent using Strands. This agent will help customers with restaurant information and reservations using several tools:

    retrieve – Searches the knowledge base for restaurant information current_time – Gets the current time for reservation scheduling create_booking – Creates a new restaurant reservation get_booking_details – Retrieves details of an existing reservation delete_booking – Cancels an existing reservation

The agent uses Anthropic’s Claude 3.7 Sonnet model in Amazon Bedrock for natural language understanding and generation. Import the required tools and define the agent:

import get_booking_details, delete_booking, create_bookingfrom strands_tools import retrieve, current_timefrom strands import Agent, toolfrom strands.models.bedrock import BedrockModelimport boto3system_prompt = """You are "Restaurant Helper", a restaurant assistant helping customers reserving tables in different restaurants. You can talk about the menus, create new bookings, get the details of an existing booking or delete an existing reservation. You reply always politely and mention your name in the reply (Restaurant Helper)..........."""model = BedrockModel(    model_id="us.anthropic.claude-3-7-sonnet-20250219-v1:0",)kb_name = 'restaurant-assistant'smm_client = boto3.client('ssm')kb_id = smm_client.get_parameter(    Name=f'{kb_name}-kb-id',    WithDecryption=False)os.environ["KNOWLEDGE_BASE_ID"] = kb_id["Parameter"]["Value"]agent = Agent(    model=model,    system_prompt=system_prompt,    tools=[        retrieve, current_time, get_booking_details,        create_booking, delete_booking    ],    trace_attributes={        "session.id": "abc-1234",        "user.id": "user-email-example@domain.com",        "arize.tags": [            "Agent-SDK",            "Arize-Project",            "OpenInference-Integration",        ]    })

Test the agent and generate traces

Test the agent with a couple of queries to generate traces for Arize. Each interaction will create spans in OpenTelemetry that will be processed by the custom processor and sent to Arize AI.The first test case is a restaurant information query. Ask about restaurants in San Francisco. This will trigger the knowledge base retrieval tool:

# Test with a question about restaurantsresults = agent("Hi, where can I eat in New York?")print(results)

The second test case is for a restaurant reservation. Test the booking functionality by making a reservation. This will trigger the create_booking tool:

# Test with a reservation requestresults = agent("Make a reservation for tonight at Rice & Spice. At 8pm, for 2 people in the name of Anna")print(results)

Analyze traces in Arize AI

After running the agent, you can view and analyze the traces in the Arize AI dashboard, shown in the following screenshot. Trace-level visualization shows the representation of the trace to confirm the path that the agent took during execution. In the Arize dashboard, you can review the traces generated by the agent. By selecting the strands-project you defined in the notebook, you can view your traces on the LLM Tracing tab. Arize provides powerful filtering capabilities to help you focus on specific traces. You can filter by OTel attributes and metadata, for example, to analyze performance across different models.

You can also use Alyx AI assistant, to analyze your agent’s behavior through natural language queries and uncover insights. In the example below, we use Alyx to reason about why a tool was invoked incorrectly by the agent in one of the traces, helping us identify the root cause of the misstep

Choosing a specific trace gives detailed information about the agent’s runtime performance and decision-making process, as shown in the following screenshot.

The graph view, shown in the following screenshot, shows the hierarchical structure of your agent’s execution and users can inspect specific execution paths to understand how the agent made decisions by selecting the graph.

You can also view session-level insights on the Sessions tab next to LLM Tracing. By tagging spans with session.id and user.id, you can group related interactions, identify where conversations break down, track user frustration, and evaluate multiturn performance across sessions.

Evaluate the agent’s behavior

Arize’s system traces the agent’s decision-making process, capturing details such as routing decisions, tool calls and parameters. You can evaluate performance by analyzing these traces to verify that the agent selects optimal paths and provides accurate responses. For example, if the agent misinterprets a customer’s request and chooses the wrong tool or uses incorrect parameters, Arize evaluators will identify when these failures occur.Arize has pre-built evaluation templates for every step of your Agent process:

Create a new task under Evals and Tasks and choose LLM as a judge task type. You can use a pre-built prompt template (tool calling is used in the example shown in the following screenshot) or you can ask Alyx AI assistant to build one for you. Evals will now automatically run on your traces as they flow into Arize. This uses AI to automatically label your data and identify failures at scale without human intervention.

Now every time the agent is invoked, trace data is collected in Arize and the tool calling evaluation automatically runs and labels the data with a correct or incorrect label along with an explanation by the LLM-as-a-judge for its labeling decision. Here is an example of an evaluation label and explanation.

Optimize the agent

The LLM-as-a-judge evaluations automatically identify and label failure cases where the agent didn’t call the right tool. In the below screenshot these failure cases are automatically captured and added to a regression dataset, which will drive agent improvement workflows. This production data can now fuel development cycles for improving the agent.

Now, you can connect directly with Arize’s prompt playground, an integrated development environment (IDE) where you can experiment with various prompt changes and model choices, compare side-by-side results and test across the regression dataset from the previous step. When you have an optimal prompt and model combination, you can save this version to the prompt hub for future version tracking and retrieval, as shown in the following screenshot.

Experiments from the prompt testing are automatically saved, with online evaluations run and results saved for immediate analysis and comparison to facilitate data-driven decisions on what enhancements to deploy. Additionally, experiments can be incorporated into continuous integration and continuous delivery (CI/CD) workflows for automated regression testing and validation whenever new prompt or application changes are pushed to systems such as GitHub. The screenshot below shows hallucination metrics for prompt experiments.

Continually monitor the agent

To maintain reliability and performance in production, it’s essential to continually monitor your AI agents. Arize AI provides out-of-the-box monitoring capabilities that help teams detect issues early, optimize cost, and provide high-quality user experiences.Setting up monitors in Arize AI offers:

You can access and configure monitors on the Monitors tab in your Arize project. For details, refer to the Arize documentation on monitoring.

When monitoring your Strands Agent in production, pay close attention to these key metrics:

By continually monitoring these metrics, teams can proactively improve agent performance, catch regressions early, and make sure the system scales reliably in real-world use. In Arize, you can create custom metrics directly from OTel trace attributes or metadata, and even from evaluation labels and metrics, such as the tool calling correctness evaluation you created previously. The screenshot below visualizes the tool call correctness ratio across agent traces, helping identify patterns in correct versus incorrect tool usage

The screenshot below illustrate how Arize provides customizable dashboards that enable deep observability into LLM agent performance, showcasing a custom monitoring dashboard tracking core metrics such as latency, token usage, and the percentage of correct tool calls.

The screenshot below demonstrates prebuilt templates designed to accelerate setup and offer immediate visibility into key agent behaviors.

Clean up

When you’re done experimenting, you can clean up the AWS resources created by this notebook by running the cleanup script: !sh cleanup.sh.

Conclusion

The key lesson is clear: observability, automatic evaluations, experimentation and feedback loops, and proactive alerting aren’t optional for production AI—they’re the difference between innovation and liability. Organizations that invest in proper AI operations infrastructure can harness the transformative power of AI agents while avoiding the pitfalls that have plagued early adopters. The combination of Amazon Strands Agents and Arize AI provides a comprehensive solution that addresses these challenges:

The partnership between AWS and Arize AI offers a powerful solution for building and deploying generative AI agents. The fully managed framework of Strands Agents simplifies agent development, and Arize’s observability tools provide critical insights into agent performance. By addressing challenges such as nondeterminism, verifying correctness, and enabling continual monitoring, this integration benefits organizations in that they can create reliable and effective AI applications. As businesses increasingly adopt agentic workflows, the combination of Amazon Bedrock and Arize AI sets a new standard for trustworthy AI deployment.

Get started

Now that you’ve learned how to integrate Strands Agents with the Arize Observability Service, you can start exploring different types of agents using the example provided in this sample. As a next step, try expanding this integration to include automated evaluations using Arize’s evaluation framework to score agent performance and decision quality.

Ready to build better agents? Get started with an account at arize.com for no additional cost and begin transforming your AI agents from unpredictable experiments into reliable, production-ready solutions. The tools and knowledge are here; the only question is: what will you build?

About the Authors

Rich Young is the Director of Partner Solutions Architecture at Arize AI, focused on AI agent observability and evaluation tooling. Prior to joining Arize, Rich led technical pre-sales at WhyLabs AI. In his pre-AI life, Rich held leadership and IC roles at enterprise technology companies such as Splunk and Akamai.

Karan Singh is a Agentic AI leader at AWS, where he works with top-tier third-party foundation model and agentic frameworks providers to develop and execute joint go-to-market strategies, enabling customers to effectively deploy and scale solutions to solve enterprise agentic AI challenges. Karan holds a BS in Electrical Engineering from Manipal University, a MS in Electrical Engineering from Northwestern University, and an MBA from the Haas School of Business at University of California, Berkeley.

Nolan Chen is a Partner Solutions Architect at AWS, where he helps startup companies build innovative solutions using the cloud. Prior to AWS, Nolan specialized in data security and helping customers deploy high-performing wide area networks. Nolan holds a bachelor’s degree in mechanical engineering from Princeton University.

Venu Kanamatareddy is an AI/ML Solutions Architect at AWS, supporting AI-driven startups in building and scaling innovative solutions. He provides strategic and technical guidance across the AI lifecycle from model development to MLOps and generative AI. With experience across startups and large enterprises, he brings deep expertise in cloud architecture and AI solutions. Venu holds a degree in computer science and a master’s in artificial intelligence from Liverpool John Moores University.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AIAgent Arize AX Strands Agents 可观测性 LLM
相关文章