AWS Machine Learning Blog 12小时前
Amazon Bedrock Agents observability using Arize AI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了如何利用Arize AI与Amazon Bedrock Agents的集成,提升AI应用的可观测性。通过集成,开发者可以深入了解Agent的执行路径,评估其性能,并进行数据驱动的优化。文章介绍了Arize Phoenix系统,它提供了全面的追踪、评估和监控工具,帮助开发者追踪用户请求的每一步,从初始查询到最终的行动执行,从而实现对AI Agent的全面监控和调试,确保其在生产环境中的可靠性和高效性。

🔍 Amazon Bedrock Agents允许用户构建和配置应用程序中的自主Agent,这些Agent能够根据组织数据和用户输入来完成操作,并且能够自动调用API以执行操作和调用知识库。

💡 Arize AI与Amazon Bedrock Agents的集成解决了AI开发中的一个关键挑战:可观测性。它提供了深入的洞察,帮助开发者了解Agent的性能、交互和任务执行情况。

✅ 集成方案的主要优势包括:全面的可追溯性,使开发者可以了解Agent执行路径的每一步;系统化的评估框架,用于衡量和理解Agent的性能;以及数据驱动的优化,用于比较不同的Agent配置并识别最佳设置。

🛠️ Arize AI提供了两种服务版本:Arize AX(企业解决方案)和Arize Phoenix(开源服务)。文章重点介绍了Arize Phoenix系统,演示了如何使用它来追踪和评估Agent的性能。

⚙️ 通过安装openinference-instrumentation-bedrock库,可以自动追踪与Amazon Bedrock或Amazon Bedrock Agents的交互,从而实现可观测性、评估和故障排除。

This post is cowritten with John Gilhuly from Arize AI.

With Amazon Bedrock Agents, you can build and configure autonomous agents in your application. An agent helps your end-users complete actions based on organization data and user input. Agents orchestrate interactions between foundation models (FMs), data sources, software applications, and user conversations. In addition, agents automatically call APIs to take actions and invoke knowledge bases to supplement information for these actions. By integrating agents, you can accelerate your development effort to deliver generative AI applications. With agents, you can automate tasks for your customers and answer questions for them. For example, you can create an agent that helps customers process insurance claims or make travel reservations. You don’t have to provision capacity, manage infrastructure, or write custom code. Amazon Bedrock manages prompt engineering, memory, monitoring, encryption, user permissions, and API invocation.

AI agents represent a fundamental shift in how applications make decisions and interact with users. Unlike traditional software systems that follow predetermined paths, AI agents employ complex reasoning that often operates as a “black box.” Monitoring AI agents presents unique challenges for organizations seeking to maintain reliability, efficiency, and optimal performance in their AI implementations.

Today, we’re excited to announce a new integration between Arize AI and Amazon Bedrock Agents that addresses one of the most significant challenges in AI development: observability. Agent observability is a crucial aspect of AI operations that provides deep insights into how your Amazon Bedrock agents perform, interact, and execute tasks. It involves tracking and analyzing hierarchical traces of agent activities, from high-level user requests down to individual API calls and tool invocations. These traces form a structured tree of events, helping developers understand the complete journey of user interactions through the agent’s decision-making process. Key metrics that demand attention include response latency, token usage, runtime exceptions, and inspect function calling. As organizations scale their AI implementations from proof of concept to production, understanding and monitoring AI agent behavior becomes increasingly critical.

The integration between Arize AI and Amazon Bedrock Agents provides developers with comprehensive observability tools for tracing, evaluating, and monitoring AI agent applications. This solution delivers three primary benefits:

The Arize AI service is available in two versions:

In this post, we demonstrate the Arize Phoenix system for tracing and evaluation. Phoenix can run on your local machine, a Jupyter notebook, a containerized deployment, or in the cloud. We explore how this integration works, its key features, and how you can implement it in your Amazon Bedrock Agents applications to enhance observability and maintain production-grade reliability.

Solution overview

Large language model (LLM) tracing records the paths taken by requests as they propagate through multiple steps or components of an LLM application. It improves the visibility of your application or system’s health and makes it possible to debug behavior that is difficult to reproduce locally. For example, when a user interacts with an LLM application, tracing can capture the sequence of operations, such as document retrieval, embedding generation, language model invocation, and response generation, to provide a detailed timeline of the request’s execution.

For an application to emit traces for analysis, the application must be instrumented. Your application can be manually instrumented or be automatically instrumented. Arize Phoenix offers a set of plugins (instrumentors) that you can add to your application’s startup process that perform automatic instrumentation. These plugins collect traces for your application and export them (using an exporter) for collection and visualization. The Phoenix server is a collector and UI that helps you troubleshoot your application in real time. When you run Phoenix (for example, the px.launch_app() container), Phoenix starts receiving traces from an application that is exporting traces to it. For Phoenix, the instrumentors are managed through a single repository called OpenInference. OpenInference provides a set of instrumentations for popular machine learning (ML) SDKs and frameworks in a variety of languages. It is a set of conventions and plugins that is complimentary to OpenTelemetry and online transaction processing (OLTP) to enable tracing of AI applications. Phoenix currently supports OTLP over HTTP.

For AWS, Boto3 provides Python bindings to AWS services, including Amazon Bedrock, which provides access to a number of FMs. You can instrument calls to these models using OpenInference, enabling OpenTelemetry-aligned observability of applications built using these models. You can also capture traces on invocations of Amazon Bedrock agents using OpenInference and view them in Phoenix.The following high-level architecture diagram shows an LLM application created using Amazon Bedrock Agents, which has been instrumented to send traces to the Phoenix server.

In the following sections, we demonstrate how, by installing the openinference-instrumentation-bedrock library, you can automatically instrument interactions with Amazon Bedrock or Amazon Bedrock agents for observability, evaluation, and troubleshooting purposes in Phoenix.

Prerequisites

To follow this tutorial, you must have the following:

You can also clone the GitHub repo locally to run the Jupyter notebook yourself:

git clone https://github.com/awslabs/amazon-bedrock-agent-samples.git

Install required dependencies

Begin by installing the necessary libraries:

%pip install -r requirements.txt — quiet

Next, import the required modules:

import timeimport boto3import loggingimport osimport nest_asynciofrom phoenix.otel import registerfrom openinference.instrumentation import using_metadatanest_asyncio.apply()

The arize-phoenix-otel package provides a lightweight wrapper around OpenTelemetry primitives with Phoenix-aware defaults. These defaults are aware of environment variables you must set to configure Phoenix in the next steps, such as:

Configure the Phoenix environment

Set up the Phoenix Cloud environment for this tutorial. Phoenix can also be self-hosted on AWS instead.

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com“if not os.environ.get("PHOENIX_CLIENT_HEADERS"):os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + input("Enter your Phoenix API key: ")

Connect your notebook to Phoenix with auto-instrumentation enabled:

project_name = "Amazon Bedrock Agent Example"tracer_provider = register(project_name=project_name, auto_instrument=True)

The auto_instrument parameter automatically locates the openinference-instrumentation-bedrock library and instruments Amazon Bedrock and Amazon Bedrock Agent calls without requiring additional configuration. Configure metadata for the span:

metadata = { "agent" : "bedrock-agent",             "env" : "development"Metadata is used to filter search values in the dashboard       }

Set up an Amazon Bedrock session and agent

Before using Amazon Bedrock, make sure that your AWS credentials are configured correctly. You can set them up using the AWS Command Line Interface (AWS CLI) or by setting environment variables:

session = boto3.Session()REGION = session.region_namebedrock_agent_runtime = session.client(service_name="bedrock-agent-runtime",region_name=REGION)

We assume you’ve already created an Amazon Bedrock agent. To configure the agent, use the following code:

agent_id = "XXXXXYYYYY" # ← Configure your Bedrock Agent IDagent_alias_id = "Z0ZZZZZZ0Z" # ← Optionally set a different Alias ID if you have one

Before proceeding to your next step, you can validate whether invoke agent is working correctly. The response is not important; we are simply testing the API call.

print(f"Trying to invoke alias {agent_alias_id} of agent {agent_id}...")agent_resp = bedrock_agent_runtime.invoke_agent(    agentAliasId=agent_alias_id,    agentId=agent_id,    inputText="Hello!",    sessionId="dummy-session",)if "completion" in agent_resp:    print("✅ Got response")else:    raise ValueError(f"No 'completion' in agent response:\n{agent_resp}")

Run your agent with tracing enabled

Create a function to run your agent and capture its output:

@using_metadata(metadata)def run(input_text):    session_id = f"default-session1_{int(time.time())}"    attributes = dict(        inputText=input_text,        agentId=agent_id,        agentAliasId=agent_alias_id,        sessionId=session_id,        enableTrace=True,    )    response = bedrock_agent_runtime.invoke_agent(**attributes)    # Stream the response    for _, event in enumerate(response["completion"]):        if "chunk" in event:            print(event)            chunk_data = event["chunk"]            if "bytes" in chunk_data:                output_text = chunk_data["bytes"].decode("utf8")                print(output_text)        elif "trace" in event:            print(event["trace"])

Test your agent with a few sample queries:

run ("What are the total leaves for Employee 1?")run ("If Employee 1 takes 4 vacation days off, What are the total leaves left for Employee 1?")

You should replace these queries with the queries that your application is built for. After executing these commands, you should see your agent’s responses in the notebook output. The Phoenix instrumentation is automatically capturing detailed traces of these interactions, including knowledge base lookups, orchestration steps, and tool calls.

View captured traces in Phoenix

Navigate to your Phoenix dashboard to view the captured traces. You will see a comprehensive visualization of each agent invocation, including:

Phoenix’s tracing and span analysis capabilities are useful during the prototyping and debugging stages. By instrumenting application code with Phoenix, teams gain detailed insights into the execution flow, making it straightforward to identify and resolve issues. Developers can drill down into specific spans, analyze performance metrics, and access relevant logs and metadata to streamline debugging efforts. With Phoenix’s tracing capabilities, you can monitor the following:

The following screenshot shows the Phoenix dashboard for the Amazon Bedrock agent, showing the latency, token usage, total traces.

You can choose one of the traces to drill down to the level of the entire orchestration.

Evaluate the agent in Phoenix

Evaluating any AI application is a challenge. Evaluating an agent is even more difficult. Agents present a unique set of evaluation pitfalls to navigate. A common evaluation metric for agents is their function calling accuracy, in other words, how well they do at choosing the right tool for the job. For example, agents can take inefficient paths and still get to the right solution. How do you know if they took an optimal path? Additionally, bad responses upstream can lead to strange responses downstream. How do you pinpoint where a problem originated? Phoenix also includes built-in LLM evaluations and code-based experiment testing. An agent is characterized by what it knows about the world, the set of actions it can perform, and the pathway it took to get there. To evaluate an agent, you must evaluate each component. Phoenix has built evaluation templates for every step, such as:

You can evaluate the individual skills and response using normal LLM evaluation strategies, such as retrieval evaluation, classification with LLM judges, hallucination, or Q&A correctness. In this post, we demonstrate evaluation of agent function calling. You can use the Agent Function Call eval to determine how well a model selects a tool to use, extracts the right parameters from the user query, and generates the tool call code. Now that you’ve traced your agent in the previous step, the next step is to add evaluations to measure its performance. A common evaluation metric for agents is their function calling accuracy (how well they do at choosing the right tool for the job).Complete the following steps:

    Up until now, you have just used the lighter-weight Phoenix OTEL tracing library. To run evals, you must to install the full library:

!pip install -q arize-phoenix — quiet

    Import the necessary evaluation components:
import reimport jsonimport phoenix as pxfrom phoenix.evals import (TOOL_CALLING_PROMPT_RAILS_MAP,TOOL_CALLING_PROMPT_TEMPLATE,BedrockModel,llm_classify,)from phoenix.trace import SpanEvaluationsfrom phoenix.trace.dsl import SpanQuery

The following is our agent function calling prompt template:

TOOL_CALLING_PROMPT_TEMPLATE = """You are an evaluation assistant evaluating questions and tool calls todetermine whether the tool called would answer the question. The toolcalls have been generated by a separate agent, and chosen from the list oftools provided below. It is your job to decide whether that agent chosethe right tool to call.    [BEGIN DATA]    ************    [Question]: {question}    ************    [Tool Called]: {tool_call}    [END DATA]Your response must be single word, either "correct" or "incorrect",and should not contain any text or characters aside from that word."incorrect" means that the chosen tool would not answer the question,the tool includes information that is not presented in the question,or that the tool signature includes parameter values that don't matchthe formats specified in the tool signatures below."correct" means the correct tool call was chosen, the correct parameterswere extracted from the question, the tool call generated is runnable and correct,and that no outside information not present in the question was usedin the generated question.    [Tool Definitions]: {tool_definitions}"""
    Because we are only evaluating the inputs, outputs, and function call columns, let’s extract those into a simpler-to-use dataframe. Phoenix provides a method to query your span data and directly export only the values you care about:
query = (SpanQuery().where(# Filter for the `LLM` span kind.# The filter condition is a string of valid Python boolean expression."span_kind == 'LLM' and 'evaluation' not in input.value").select(question="input.value",outputs="output.value",))trace_df = px.Client().query_spans(query, project_name=project_name)
    The next step is to prepare these traces into a dataframe with columns for input, tool call, and tool definitions. Parse the JSON input and output data to create these columns:
def extract_tool_calls(output_value):try:tool_calls = []# Look for tool calls within <function_calls> tagsif "<function_calls>" in output_value:# Find all tool_name tagstool_name_pattern = r"<tool_name>(.*?)</tool_name>"tool_names = re.findall(tool_name_pattern, output_value)# Add each found tool name to the listfor tool_name in tool_names:if tool_name:tool_calls.append(tool_name)except Exception as e:print(f"Error extracting tool calls: {e}")passreturn tool_calls
    Apply the function to each row of trace_df.output.value:
trace_df["tool_call"] = trace_df["outputs"].apply(lambda x: extract_tool_calls(x) if isinstance(x, str) else [])# Display the tool calls foundprint("Tool calls found in traces:", trace_df["tool_call"].sum())
    Add tool definitions for evaluation:
trace_df["tool_definitions"] = ("phoenix-traces retrieves the latest trace information from Phoenix, phoenix-experiments retrieves the latest experiment information from Phoenix, phoenix-datasets retrieves the latest dataset information from Phoenix")

Now with your dataframe prepared, you can use Phoenix’s built-in LLM-as-a-Judge template for tool calling to evaluate your application. The following method takes in the dataframe of traces to evaluate, our built-in evaluation prompt, the eval model to use, and a rails object to snap responses from our model to a set of binary classification responses. We also instruct our model to provide explanations for its responses.

    Run the tool calling evaluation:
rails = list(TOOL_CALLING_PROMPT_RAILS_MAP.values())eval_model = BedrockModel(session=session, model_id="us.anthropic.claude-3-5-haiku-20241022-v1:0")response_classifications = llm_classify(    data=trace_df,    template=TOOL_CALLING_PROMPT_TEMPLATE,    model=eval_model,    rails=rails,    provide_explanation=True,)response_classifications["score"] = response_classifications.apply(    lambda x: 1 if x["label"] == "correct" else 0, axis=1)

We use the following parameters:

    Finally, log the evaluation results to Phoenix:
px.Client().log_evaluations(SpanEvaluations(eval_name="Tool Calling Eval", dataframe=response_classifications),)

After running these commands, you will see your evaluation results on the Phoenix dashboard, providing insights into how effectively your agent is using its available tools.

The following screenshot shows how the tool calling evaluation attribute shows up when you run the evaluation.

When you expand the individual trace, you can observe that the tool calling evaluation adds a score of 1 if the label is correct. This means that agent has responded correctly.

Conclusion

As AI agents become increasingly prevalent in enterprise applications, effective observability is crucial for facilitating their reliability, performance, and continuous improvement. The integration of Arize AI with Amazon Bedrock Agents provides developers with the tools they need to build, monitor, and enhance AI agent applications with confidence. We’re excited to see how this integration will empower developers and organizations to push the boundaries of what’s possible with AI agents.

Stay tuned for more updates and enhancements to this integration in the coming months. To learn more about Amazon Bedrock Agents and the Arize AI integration, refer to the Phoenix documentation and Integrating Arize AI and Amazon Bedrock Agents: A Comprehensive Guide to Tracing, Evaluation, and Monitoring.


About the Authors

Ishan Singh is a Sr. Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

John Gilhuly is the Head of Developer Relations at Arize AI, focused on AI agent observability and evaluation tooling. He holds an MBA from Stanford and a B.S. in C.S. from Duke. Prior to joining Arize, John led GTM activities at Slingshot AI, and served as a venture fellow at Omega Venture Partners. In his pre-AI life, John built out and ran technical go-to-market teams at Branch Metrics.

Richa Gupta is a Sr. Solutions Architect at Amazon Web Services. She is passionate about architecting end-to-end solutions for customers. Her specialization is machine learning and how it can be used to build new solutions that lead to operational excellence and drive business revenue. Prior to joining AWS, she worked in the capacity of a Software Engineer and Solutions Architect, building solutions for large telecom operators. Outside of work, she likes to explore new places and loves adventurous activities.

Aris Tsakpinis is a Specialist Solutions Architect for Generative AI, focusing on open weight models on Amazon Bedrock and the broader generative AI open source landscape. Alongside his professional role, he is pursuing a PhD in Machine Learning Engineering at the University of Regensburg, where his research focuses on applied natural language processing in scientific domains.

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Mani Khanuja is a Principal Generative AI Specialist SA and author of the book Applied Machine Learning and High-Performance Computing on AWS. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Musarath Rahamathullah is an AI/ML and GenAI Solutions Architect at Amazon Web Services, focusing on media and entertainment customers. She holds a Master’s degree in Analytics with a specialization in Machine Learning. She is passionate about using AI solutions in the AWS Cloud to address customer challenges and democratize technology. Her professional background includes a role as a Research Assistant at the prestigious Indian Institute of Technology, Chennai. Beyond her professional endeavors, she is interested in interior architecture, focusing on creating beautiful spaces to live.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon Bedrock Arize AI 可观测性 AI Agent
相关文章