AWS Machine Learning Blog 04月08日 01:49
Advanced tracing and evaluation of generative AI agents using LangChain and Amazon SageMaker AI MLFlow
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了如何利用Amazon SageMaker AI与MLflow,简化生成式AI Agent的实验、评估和优化过程。文章深入研究了在LangGraph框架下构建Agent,并启用详细的追踪和评估方法。通过结合RAGAS等评估指标,展示了如何定制MLflow来追踪自定义和第三方指标。文章重点介绍了SageMaker AI与MLflow在Agent开发中的关键优势,包括可扩展性、集成追踪、可视化和对现有ML团队的友好性。此外,文章还提供了关于配置追踪的详细步骤,强调了追踪在性能监控、调试、可解释性、优化、合规性、成本跟踪和适应性学习等方面的重要性,帮助开发者构建更可靠、高效的生成式AI Agent。

🔍 开发生成式AI Agent面临诸多挑战,包括行为不可预测、工作流程复杂和交互复杂。SageMaker AI与MLflow提供了一个强大的解决方案,简化了生成式AI Agent的实验过程。

📈 MLflow的追踪功能对于理解Agent行为至关重要,可以观察、记录和分析Agent的内部执行路径,从而定位错误、评估决策过程并提高系统可靠性。

💡 SageMaker AI与MLflow结合了MLflow的开源特性,提供了用于管理机器学习工作流程的强大平台,包括实验跟踪、模型注册、部署和指标比较。

⚙️ SageMaker AI与MLflow的关键特性包括实验跟踪、Agent版本控制和统一的Agent治理。这些功能有助于开发者高效地管理和改进Agent。

✅ 通过MLflow的追踪功能,开发者可以深入了解生成式AI Agent的行为,优化其性能,并确保其可靠和安全地运行。

Developing generative AI agents that can tackle real-world tasks is complex, and building production-grade agentic applications requires integrating agents with additional tools such as user interfaces, evaluation frameworks, and continuous improvement mechanisms. Developers often find themselves grappling with unpredictable behaviors, intricate workflows, and a web of complex interactions. The experimentation phase for agents is particularly challenging, often tedious and error prone. Without robust tracking mechanisms, developers face daunting tasks such as identifying bottlenecks, understanding agent reasoning, ensuring seamless coordination across multiple tools, and optimizing performance. These challenges make the process of creating effective and reliable AI agents a formidable undertaking, requiring innovative solutions to streamline development and enhance overall system reliability.

In this context, Amazon SageMaker AI with MLflow offers a powerful solution to streamline generative AI agent experimentation. For this post, I use LangChain’s popular open source LangGraph agent framework to build an agent and show how to enable detailed tracing and evaluation of LangGraph generative AI agents. This post explores how Amazon SageMaker AI with MLflow can help you as a developer and a machine learning (ML) practitioner efficiently experiment, evaluate generative AI agent performance, and optimize their applications for production readiness. I also show you how to introduce advanced evaluation metrics with Retrieval Augmented Generation Assessment (RAGAS) to illustrate MLflow customization to track custom and third-party metrics like with RAGAS.

The need for advanced tracing and evaluation in generative AI agent development

A crucial functionality for experimentation is the ability to observe, record, and analyze the internal execution path of an agent as it processes a request. This is essential for pinpointing errors, evaluating decision-making processes, and improving overall system reliability. Tracing workflows not only aids in debugging but also ensures that agents perform consistently across diverse scenarios.

Further complexity arises from the open-ended nature of tasks that generative AI agents perform, such as text generation, summarization, or question answering. Unlike traditional software testing, evaluating generative AI agents requires new metrics and methodologies that go beyond basic accuracy or latency measures. You must assess multiple dimensions—such as correctness, toxicity, relevance, coherence, tool call, and groundedness—while also tracing execution paths to identify errors or bottlenecks.

Why SageMaker AI with MLflow?

Amazon SageMaker AI, which provides a fully managed version of the popular open source MLflow, offers a robust platform for machine learning experimentation and generative AI management. This combination is particularly powerful for working with generative AI agents. SageMaker AI with MLflow builds on MLflow’s open source legacy as a tool widely adopted for managing machine learning workflows, including experiment tracking, model registry, deployment, and metrics comparison with visualization.

This evolution positions SageMaker AI with MLflow as a unified platform for both traditional ML and cutting-edge generative AI agent development.

Key features of SageMaker AI with MLflow

The capabilities of SageMaker AI with MLflow directly address the core challenges of agentic experimentation—tracing agent behavior, evaluating agent performance, and unified governance.

    Experiment tracking: Compare different runs of the LangGraph agent and track changes in performance across iterations. Agent versioning: Keep track of different versions of the agent throughout its development lifecycle to iteratively refine and improve agents. Unified agent governance: Agents registered in SageMaker AI with MLflow automatically appear in the SageMaker AI with MLflow console, enabling a collaborative approach to management, evaluation, and governance across teams. Scalable infrastructure: Use the managed infrastructure of SageMaker AI to run large-scale experiments without worrying about resource management.

LangGraph generative AI agents

LangGraph offers a powerful and flexible approach to designing generative AI agents tailored to your company’s specific needs. LangGraph’s controllable agent framework is engineered for production use, providing low-level customization options to craft bespoke solutions.

In this post, I show you how to create a simple finance assistant agent equipped with a tool to retrieve financial data from a datastore, as depicted in the following diagram. This post’s sample agent, along with all necessary code, is available on the GitHub repository, ready for you to replicate and adapt it for your own applications.

Solution code

You can follow and execute the full example code from the aws-samples GitHub repository. I use snippets from the code in the repository to illustrate evaluation and tracking approaches in the reminder of this post.

Prerequisites

Trace generative AI agents with SageMaker AI with MLflow

MLflow’s tracing capabilities are essential for understanding the behavior of your LangGraph agent. The MLflow tracking is an API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results.

MLflow tracing is a feature that enhances observability in your generative AI agent by capturing detailed information about the execution of the agent services, nodes, and tools. Tracing provides a way to record the inputs, outputs, and metadata associated with each intermediate step of a request, enabling you to easily pinpoint the source of bugs and unexpected behaviors.

The MLfow tracking UI displays the traces exported under the MLflow Traces tab for the selected MLflow experimentation, as shown in the following image.

Furthermore, you can see the detailed trace for an agent input or prompt invocation by selecting the Request ID. Choosing Request ID opens a collapsible view with results captured at each step of the invocation workflow from input to the final output, as shown in the following image.

SageMaker AI with MLflow traces all the nodes in the LangGraph agent and displays the trace in the MLflow UI with detailed inputs, outputs, usage tokens, and multi-sequence messages with origin type (human, tool, AI) for each node. The display also captures the execution time over the entire agentic workflow, providing a per-node breakdown of time. Overall, tracing is crucial for generative AI agents for the following reasons:

In the MLflow UI, you can choose the Task name to see details captured at any agent step as it services the input request prompt or invocation, as shown in the following image.

By implementing proper tracing, you can gain deeper insights into your generative AI agents’ behavior, optimize their performance, and make sure that they operate reliably and securely.

Configure tracing for agent

For fine-grained control and flexibility in tracking, you can use MLflow’s tracing decorator APIs. With these APIs, you can add tracing to specific agentic nodes, functions, or code blocks with minimal modifications.

@mlflow.trace(name="assistant", attributes={"workflow": "agent_assistant"}, span_type="graph.py")def assistant(state: GraphState):    ...

This configuration allows users to:

This approach allows you to specify exactly what you want to track in your experiment. Additionally, MLflow offers out-of-the box tracing comparability with LangChain for basic tracing through MLflow’s autologging feature mlflow.langchain.autolog(). With SageMaker AI with MLflow, you can gain deep insights into the LangGraph agent’s performance and behavior, facilitating easier debugging, optimization, and monitoring, in both development and production environments.

Evaluate with MLflow

You can use MLflow’s evaluation capabilities to help assess the performance of the LangGraph large language model (LLM) agent and objectively measure its effectiveness in various scenarios. The important aspects of evaluation are:

The following is a snippet for how mlflow.evaluate(), can be used to execute evaluation on agents. You can follow this example by running the code in the same aws-samples GitHub repository.

pythonresults = mlflow.evaluate(            agent_responses,  # Agent-generated answers to test queries            targets="ground_truth",    # Reference "correct" answers for comparison            model_type="question-answering",  # Predefined metrics for QA tasks            extra_metrics=metrics   # Evaluation Metrics to include        )

This code snippet employs MLflow’s evaluate() function to rigorously assess the performance of a LangGraph LLM agent, comparing its responses to a predefined ground truth dataset that’s maintained in the golden_questions_answer.jsonl file in the aws-samples GitHub repository. By specifying “model_type”:”question-answering”, MLflow applies relevant evaluation metrics for question-answering tasks, such as accuracy and coherence. Additionally, the extra_metrics parameter allows you to incorporate custom, domain-specific metrics tailored to the agent’s specific application, enabling a comprehensive and nuanced evaluation beyond standard benchmarks. The results of this evaluation are then logged in MLflow (as shown in the following image), providing a centralized and traceable record of the agent’s performance, facilitating iterative improvement and informed deployment decisions. The MLflow evaluation is captured as part of the MLflow execution run.

You can open the SageMaker AI with MLflow tracking server and see the list of MLflow execution runs for the specified MLflow experiment, as shown in the following image.

The evaluation metrics are captured within the MLflow execution along with model metrics and the accompanying artifacts, as shown in the following image.

Furthermore, the evaluation metrics are also displayed under the Model metrics tab within a selected MLflow execution run, as shown in the following image.

Finally, as shown in the following image, you can compare different variations and versions of the agent during the development phase by selecting the compare checkbox option in the MLflow UI between selected MLflow execution experimentation runs. This can help compare and select the best functioning agent version for deployment or with other decision making processes for agent development.

Register the LangGraph agent

You can use SageMaker AI with MLflow artifacts to register the LangGraph agent along with any other item as required or that you’ve produced. All the artifacts are stored in the SageMaker AI with MLflow tracking server’s configured Amazon Simple Storage Service (Amazon S3) bucket. Registering the LangGraph agent is crucial for governance and lifecycle management. It provides a centralized repository for tracking, versioning, and deploying the agents. Think of it as a catalog of your validated AI assets.

As shown in the following figure, you can see the artifacts captured under the Artifact tab within the MLflow execution run.

MLflow automatically captures and logs agent-related information files such as the evaluation results and the consumed libraries in the requirements.txt file. Furthermore, a successfully logged LangGraph agent as a MLflow model can be loaded and used for inference using mlflow.langchain.load_model(model_uri). Registering the generative AI agent after rigorous evaluation helps ensure that you’re promoting a proven and validated agent to production. This practice helps prevent the deployment of poorly performing or unreliable agents, helping to safeguard the user experience and the integrity of your applications. Post-evaluation registration is critical to make sure that the experiment with the best result is the one that gets promoted to production.

Use MLflow to experiment and evaluate with external libraries (such as RAGAS)

MLflow’s flexibility allows for seamless integration with external libraries, enhancing your ability to experiment and evaluate LangChain LangGraph agents. You can extend SageMaker MLflow to include external evaluation libraries such as  RAGAS for comprehensive LangGraph agent assessment. This integration enables ML practitioners to use RAGAS’s specialized LLM evaluation metrics while benefiting from MLflow’s experiment tracking and visualization capabilities. By logging RAGAS metrics directly to SageMaker AI with MLflow, you can easily compare different versions of the LangGraph agent across multiple runs, gaining deeper insights into its performance.

RAGAS is an open source library that provide tools specifically for evaluation of LLM applications and generative AI agents. RAGAS includes a method ragas.evaluate(), to run evaluations for LLM agents with choice of LLM models (evaluators) for scoring the evaluation, and extensive list of default metrics. To incorporate RAGAS metrics into your MLflow experiments, you can use the following approach.

You can follow this example by running the notebook in the GitHub repository additional_evaluations_with_ragas.ipynb.

from ragas import EvaluationDatasetfrom ragas import evaluatefrom ragas.llms import LangchainLLMWrapperevaluation_dataset = EvaluationDataset.from_list(ragas_dataset)evaluator_llm = LangchainLLMWrapper(llm_for_evaluation)result = evaluate(    dataset=evaluation_dataset,    metrics=metrics_final,    llm=evaluator_llm,    embeddings=bedrock_embeddings,    )result

The evaluation results using RAGAS metrics from the above code are shown in the following figure.

Subsequently, the computed RAGAS evaluations metrics can be exported and tracked in the SageMaker AI with MLflow tracking server as part of the MLflow experimentation run. See the following code snippet for illustration and the full code can be found in the notebook in the same aws-samples GitHub repository.

with mlflow.start_run(    experiment_id=get_experiment_id(_MLFLOW_RAGAS_EXPERIMENT_NAME),     run_name=timestamp,     tags={        "project": os.getenv('PROJECT'),        "model": os.getenv('MODELID'),        "version": os.getenv('VERSION')    }):    # Log the dataset to MLflow    mlflow.log_input(dataset, context="ragas_eval_results")    for ragas_metric in [faithfulness, answer_relevancy, answer_correctness]:        print(ragas_metric.name)        mean = ragas_result_ds[ragas_metric.name].mean()        p90 = ragas_result_ds[ragas_metric.name].quantile(0.9)        variance = ragas_result_ds[ragas_metric.name].var()        print(mean, p90, variance)        mlflow.log_metric(f"ragas_{ragas_metric.name}_score/v1/mean", mean)        mlflow.log_metric(f"ragas_{ragas_metric.name}_score/v1/p90", p90)        mlflow.log_metric(f"ragas_{ragas_metric.name}_score/v1/variance", variance)mlflow.end_run()

You can view the RAGAS metrics logged by MLflow in the SageMaker AI with MLflow UI on the Model metrics tab, as shown in the following image.

From experimentation to production: Collaborative approval with SageMaker with MLflow tracing and evaluation

In a real-world deployment scenario, MLflow’s tracing and evaluation capabilities with LangGraph agents can significantly streamline the process of moving from experimentation to production.

Imagine a large team of data scientists and ML engineers working on an agentic platform, as shown in the following image. With MLflow, they can create sophisticated agents that can handle complex queries, process returns, and provide product recommendations. During the experimentation phase, the team can use MLflow to log different versions of the agent, tracking performance and evaluation metrics such as response accuracy, latency, and other metrics. MLflow’s tracing feature allows them to analyze the agent’s decision-making process, identifying areas for improvement. The results across numerous experiments are automatically logged to SageMaker AI with MLflow. The team can use the MLflow UI to collaborate, compare, and select the best performing version of the agent and decide on a production-ready version, all informed by the diverse set data logged in SageMaker AI with MLflow.

With this data, the team can present a clear, data-driven case to stakeholders for promoting the agent to production. Managers and compliance officers can review the agent’s performance history, examine specific interaction traces, and verify that the agent meets all necessary criteria. After being approved, the SageMaker AI with MLflow registered agent facilitates a smooth transition to deployment, helping to ensure that the exact version of the agent that passed evaluation is the one that goes live. This collaborative, traceable approach not only accelerates the development cycle but also instills confidence in the reliability and effectiveness of the generative AI agent in production.

Clean up

To avoid incurring unnecessary charges, use the following steps to clean up the resources used in this post:

    Remove SageMaker AI with MLflow tracking server:
      In SageMaker Studio, stop and delete any running MLflow tracking server instances
    Revoke Amazon Bedrock model access:
      Go to the Amazon Bedrock console. Navigate to Model access and remove access to any models you enabled for this project.
    Delete the SageMaker domain (If not needed):
      Open the SageMaker console. Navigate to the Domains section. Select the domain you created for this project. Choose Delete domain and confirm the action. Also delete any associated S3 buckets and IAM roles.

Conclusion

In this post, I showed you how to combine LangChain’s LangGraph, Amazon SageMaker AI, and MLflow to demonstrate a powerful workflow for developing, evaluating, and deploying sophisticated generative AI agents. This integration provides the tools needed to gain deep insights into the generative AI agent’s performance, iterate quickly, and maintain version control throughout the development process.

As the field of AI continues to advance, tools like these will be essential for managing the increasing complexity of generative AI agents and ensuring their effectiveness with the following considerations,

    Traceability is paramount: Effective tracing of agent execution paths using SageMaker MLflow is crucial for debugging, optimization, and helping to ensure consistent performance in complex generative AI workflows. Pinpoint issues, understand decision-making, examine interaction traces, and improve overall system reliability through detailed, recorded analysis of agent processes. Evaluation drives improvement: Standardized and customized evaluation metrics, using MLflow’s evaluate() function and integrations with external libraries like RAGAS, provide quantifiable insights into agent performance, guiding iterative refinement and informed deployment decisions. Collaboration and governance are essential: Unified governance facilitated by SageMaker AI with MLflow enables seamless collaboration across teams, from data scientists to compliance officers, helping to ensure responsible and reliable deployment of generative AI agents in production environments.

By embracing these principles and using the tools outlined in this post, developers and ML practitioners can confidently navigate the complexities of generative AI agent development and deployment, building robust and reliable applications that deliver real business value. Now, it’s your turn to unlock the potential of advanced tracing, evaluation, and collaboration in your agentic workflows! Dive into the aws-samples GitHub repository and start using the power of LangChain’s LangGraph, Amazon SageMaker AI, and MLflow for your generative AI projects.


About the Author

Sandeep Raveesh is a Generative AI Specialist Solutions Architect at AWS. He works with customers through their AIOps journey across model training, Retrieval Augmented Generation (RAG), generative AI agents, and scaling generative AI use-cases. He also focuses on go-to-market strategies helping AWS build and align products to solve industry challenges in the generative AI space. You can find Sandeep on LinkedIn.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SageMaker AI MLflow 生成式AI Agent 追踪 实验
相关文章