AWS Machine Learning Blog 前天 23:35
Evaluate Amazon Bedrock Agents with Ragas and LLM-as-a-judge
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何使用开源工具Ragas和Langfuse评估Amazon Bedrock Agents在自动化复杂任务、增强决策和简化操作方面的性能。通过LLM-as-a-judge技术,对Amazon Bedrock Agents的RAG、text-to-SQL和chain-of-thought能力进行评估,并利用Langfuse平台进行可视化分析。该方案旨在解决AI Agent开发中端到端评估和实验管理方面的技术挑战,支持对单个和多个Agent进行评估,并提供全面的评估结果和追踪数据。

🚀 **Agent评估框架**: 开源Bedrock Agent评估框架通过指定Agent ID、评估模型和数据集,执行评估任务,并利用自定义解析逻辑处理Agent调用轨迹,从而评估Agent的响应。

📊 **多维度评估指标**: 该框架采用多种评估指标,包括Agent目标(通过Chain-of-thought评估)和任务准确性(通过RAG和text-to-SQL评估),以全面衡量Agent的性能。

🛠️ **技术挑战与解决方案**: 针对AI Agent开发中的端到端Agent评估和实验管理挑战,该方案提供了一种系统化的方法来追踪、比较和衡量配置更改对不同Agent版本的影响,从而有效优化Agent性能。

🧪 **用户交互模拟**: 通过模拟用户与Agent的交互轨迹,包括问题ID、问题类型、问题内容和标准答案,来评估Agent在不同场景下的表现,例如RAG和text-to-SQL任务。

AI agents are quickly becoming an integral part of customer workflows across industries by automating complex tasks, enhancing decision-making, and streamlining operations. However, the adoption of AI agents in production systems requires scalable evaluation pipelines. Robust agent evaluation enables you to gauge how well an agent is performing certain actions and gain key insights into them, enhancing AI agent safety, control, trust, transparency, and performance optimization.

Amazon Bedrock Agents uses the reasoning of foundation models (FMs) available on Amazon Bedrock, APIs, and data to break down user requests, gather relevant information, and efficiently complete tasks—freeing teams to focus on high-value work. You can enable generative AI applications to automate multistep tasks by seamlessly connecting with company systems, APIs, and data sources.

Ragas is an open source library for testing and evaluating large language model (LLM) applications across various LLM use cases, including Retrieval Augmented Generation (RAG). The framework enables quantitative measurement of the effectiveness of the RAG implementation. In this post, we use the Ragas library to evaluate the RAG capability of Amazon Bedrock Agents.

LLM-as-a-judge is an evaluation approach that uses LLMs to assess the quality of AI-generated outputs. This method employs an LLM to act as an impartial evaluator, to analyze and score outputs. In this post, we employ the LLM-as-a-judge technique to evaluate the text-to-SQL and chain-of-thought capabilities of Amazon Bedrock Agents.

Langfuse is an open source LLM engineering platform, which provides features such as traces, evals, prompt management, and metrics to debug and improve your LLM application.

In the post Accelerate analysis and discovery of cancer biomarkers with Amazon Bedrock Agents, we showcased research agents for cancer biomarker discovery for pharmaceutical companies. In this post, we extend the prior work and showcase Open Source Bedrock Agent Evaluation with the following capabilities:

First, we conduct evaluations on a variety of different Amazon Bedrock Agents. These include a sample RAG agent, a sample text-to-SQL agent, and pharmaceutical research agents that use multi-agent collaboration for cancer biomarker discovery. Then, for each agent, we showcase navigating the Langfuse dashboard to view traces and evaluation results.

Technical challenges

Today, AI agent developers generally face the following technical challenges:

Solution overview

The following figure illustrates how Open Source Bedrock Agent Evaluation works on a high level. The framework runs an evaluation job that will invoke your own agent in Amazon Bedrock and evaluate its response.

The workflow consists of the following steps:

    The user specifies the agent ID, alias, evaluation model, and dataset containing question and ground truth pairs. The user executes the evaluation job, which will invoke the specified Amazon Bedrock agent. The retrieved agent invocation traces are run through a custom parsing logic in the framework. The framework conducts an evaluation based on the agent invocation results and the question type:
      Chain-of-thought – LLM-as-a-judge with Amazon Bedrock LLM calls (conducted for every evaluation run for different types of questions) RAG – Ragas evaluation library Text-to-SQL – LLM-as-a-judge with Amazon Bedrock LLM calls
    Evaluation results and parsed traces are gathered and sent to Langfuse for evaluation insights.

Prerequisites

To deploy the sample RAG and text-to-SQL agents and follow along with evaluating them using Open Source Bedrock Agent Evaluation, follow the instructions in Deploying Sample Agents for Evaluation.

To bring your own agent to evaluate with this framework, refer to the following README and follow the detailed instructions to deploy the Open Source Bedrock Agent Evaluation framework.

Overview of evaluation metrics and input data

First, we create sample Amazon Bedrock agents to demonstrate the capabilities of Open Source Bedrock Agent Evaluation. The text-to-SQL agent uses the BirdSQL Mini-Dev dataset, and the RAG agent uses the Hugging Face rag-mini-wikpedia dataset.

Evaluation metrics

The Open Source Bedrock Agent Evaluation framework conducts evaluations on two broad types of metrics:

Agent goal metrics measure how well an agent identifies and achieves the goals of the user. There are two main types: reference-based evaluation and no reference evaluation. Examples can be found in Agent Goal accuracy as defined by Ragas:

We will showcase evaluation without reference using chain-of-thought evaluation. We conduct evaluations by comparing the agent’s reasoning and the agent’s instruction. For this evaluation, we use some metrics from the evaluator prompts for Amazon Bedrock LLM-as-a-judge. In this framework, the chain-of-thought evaluations are run on every question that the agent is evaluated against.

Task accuracy metrics measure how well an agent calls the required tools to complete a given task. For the two task accuracy metrics, RAG and text-to-SQL, evaluations are conducted based on comparing the actual agent answer against the ground truth dataset that must be provided in the input dataset. The task accuracy metrics are only evaluated when the corresponding tool is used to answer the question.

The following is a breakdown of the key metrics used in each evaluation type included in the framework:

User-agent trajectories

The input dataset is in the form of trajectories, where each trajectory consists of one or more questions to be answered by the agent. The trajectories are meant to simulate how a user might interact with the agent. Each trajectory consists of a unique question_id, question_type, question, and ground_truth information. The following are examples of actual trajectories used to evaluate each type of agent in this post.

For more simple agent setups like the RAG and text-to-SQL sample agent, we created trajectories consisting of a single question, as shown in the following examples.

The following is an example of a RAG sample agent trajectory:

{   "Trajectory0": [        {           "question_id": 0,           "question_type": "RAG",         "question": "Was Abraham Lincoln the sixteenth President of the United States?",            "ground_truth": "yes"       }   ]}

The following is an example of a text-to-SQL sample agent trajectory:

{ "Trajectory1": [        {           "question_id": 1,           "question": "What is the highest eligible free rate for K-12 students in the schools in Alameda County?",           "question_type": "TEXT2SQL",            "ground_truth": {               "ground_truth_sql_query": "SELECT `Free Meal Count (K-12)` / `Enrollment (K-12)` FROM frpm WHERE `County Name` = 'Alameda' ORDER BY (CAST(`Free Meal Count (K-12)` AS REAL) / `Enrollment (K-12)`) DESC LIMIT 1",               "ground_truth_sql_context": "[{'table_name': 'frpm', 'columns': [('cdscode', 'varchar'), ('academic year', 'varchar'), ...",                "ground_truth_query_result": "1.0",             "ground_truth_answer": "The highest eligible free rate for K-12 students in schools in Alameda County is 1.0."      }   ]}

Pharmaceutical research agent use case example

In this section, we demonstrate how you can use the Open Source Bedrock Agent Evaluation framework to evaluate pharmaceutical research agents discussed in the post Accelerate analysis and discovery of cancer biomarkers with Amazon Bedrock Agents . It showcases a variety of specialized agents, including a biomarker database analyst, statistician, clinical evidence researcher, and medical imaging expert in collaboration with a supervisor agent.

The pharmaceutical research agent was built using the multi-agent collaboration feature of Amazon Bedrock. The following diagram shows the multi-agent setup that was evaluated using this framework.

As shown in the diagram, the RAG evaluations will be conducted on the clinical evidence researcher sub-agent. Similarly, text-to-SQL evaluations will be run on the biomarker database analyst sub-agent. The chain-of-thought evaluation evaluates the final answer of the supervisor agent to check if it properly orchestrated the sub-agents and answered the user’s question.

Research agent trajectories

For a more complex setup like the pharmaceutical research agents, we used a set of industry relevant pregenerated test questions. By creating groups of questions based on their topic regardless of the sub-agents that might be invoked to answer the question, we created trajectories that include multiple questions spanning multiple types of tool use. With relevant questions already generated, integrating with the evaluation framework simply required properly formatting the ground truth data into trajectories.

We walk through evaluating this agent against a trajectory containing a RAG question and a text-to-SQL question:

{    "Trajectory1": [        {           "question_id": 3,           "question_type": "RAG",         "question": "According to the knowledge base, how did the EGF pathway associate with CT imaging features?",         "ground_truth": "The EGF pathway was significantly correlated with the presence of ground-glass opacity and irregular nodules or nodules with poorly defined margins."      },      {           "question_id": 4,           "question_type": "TEXT2SQL",            "question": "According to the database, What percentage of patients have EGFR mutations?",          "ground_truth": {               "ground_truth_sql_query": "SELECT (COUNT(CASE WHEN EGFR_mutation_status = 'Mutant' THEN 1 END) * 100.0 / COUNT(*)) AS percentage FROM clinical_genomic;",               "ground_truth_sql_context": "Table clinical_genomic: - Case_ID: VARCHAR(50) - EGFR_mutation_status: VARCHAR(50)",               "ground_truth_query_result": "14.285714",               "ground_truth_answer": "According to the query results, approximately 14.29% of patients in the clinical_genomic table have EGFR mutations."            }       }   ]}

Chain-of-thought evaluations are conducted for every question, regardless of tool use. This will be illustrated through a set of images of agent trace and evaluations on the Langfuse dashboard.

After running the agent against the trajectory, the results are sent to Langfuse to view the metrics. The following screenshot shows the trace of the RAG question (question ID 3) evaluation on Langfuse.

The screenshot displays the following information:

The following screenshot shows the trace of the text-to-SQL question (question ID 4) evaluation on Langfuse, which evaluated the biomarker database analyst agent that generates SQL queries to run against an Amazon Redshift database containing biomarker information.

The screenshot shows the following information:

The chain-of-thought evaluation is included in part of both questions’ evaluation traces. For both traces, LLM-as-a-judge is used to generate scores and explanation around an Amazon Bedrock agent’s reasoning on a given question.

Overall, we ran 56 questions grouped into 21 trajectories against the agent. The traces, model costs, and scores are shown in the following screenshot.

The following table contains the average evaluation scores across 56 evaluation traces.

Metric Category Metric Type Metric Name Number of Traces Metric Avg. Value
Agent Goal COT Helpfulness 50 0.77
Agent Goal COT Faithfulness 50 0.87
Agent Goal COT Instruction following 50 0.69
Agent Goal COT Overall (average of all metrics) 50 0.77
Task Accuracy TEXT2SQL Answer correctness 26 0.83
Task Accuracy TEXT2SQL SQL semantic equivalence 26 0.81
Task Accuracy RAG Semantic similarity 20 0.66
Task Accuracy RAG Faithfulness 20 0.5
Task Accuracy RAG Answer relevancy 20 0.68
Task Accuracy RAG Context recall 20 0.53

Security considerations

Consider the following security measures:

Clean up

If you deployed the sample agents, run the following notebooks to delete the resources created.

If you chose the self-hosted Langfuse option, follow these steps to clean up your AWS self-hosted Langfuse setup.

Conclusion

In this post, we introduced the Open Source Bedrock Agent Evaluation framework, a Langfuse-integrated solution that streamlines the agent development process. The framework comes equipped with built-in evaluation logic for RAG, text-to-SQL, chain-of-thought reasoning, and integration with Langfuse for viewing evaluation metrics. With the Open Source Bedrock Agent Evaluation agent, developers can quickly evaluate their agents and rapidly experiment with different configurations, accelerating the development cycle and improving agent performance.

We demonstrated how this evaluation framework can be integrated with pharmaceutical research agents. We used it to evaluate agent performance against biomarker questions and sent traces to Langfuse to view evaluation metrics across question types.

The Open Source Bedrock Agent Evaluation framework enables you to accelerate your generative AI application building process using Amazon Bedrock Agents. To self-host Langfuse in your AWS account, see Hosting Langfuse on Amazon ECS with Fargate using CDK Python. To explore how you can streamline your Amazon Bedrock Agents evaluation process, get started with Open Source Bedrock Agent Evaluation.

Refer to Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications from the Amazon Bedrock team to learn more about multi-agent collaboration and end-to-end agent evaluation.


About the authors

Hasan Poonawala is a Senior AI/ML Solutions Architect at AWS, working with healthcare and life sciences customers. Hasan helps design, deploy, and scale generative AI and machine learning applications on AWS. He has over 15 years of combined work experience in machine learning, software development, and data science on the cloud. In his spare time, Hasan loves to explore nature and spend time with friends and family.

Blake Shin is an Associate Specialist Solutions Architect at AWS who enjoys learning about and working with new AI/ML technologies. In his free time, Blake enjoys exploring the city and playing music.

Rishiraj Chandra is an Associate Specialist Solutions Architect at AWS, passionate about building innovative artificial intelligence and machine learning solutions. He is committed to continuously learning and implementing emerging AI/ML technologies. Outside of work, Rishiraj enjoys running, reading, and playing tennis.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon Bedrock Agents AI Agent评估 开源工具 RAG LLM
相关文章