Advanced tracing and evaluation of generative AI agents using LangChain and Amazon SageMaker AI MLFlow

Developing generative AI agents that can tackle real-world tasks is complex, and building production-grade agentic applications requires integrating agents with additional tools such as user interfaces, evaluation frameworks, and continuous improvement mechanisms. Developers often find themselves grappling with unpredictable behaviors, intricate workflows, and a web of complex interactions. The experimentation phase for agents is particularly challenging, often tedious and error prone. Without robust tracking mechanisms, developers face daunting tasks such as identifying bottlenecks, understanding agent reasoning, ensuring seamless coordination across multiple tools, and optimizing performance. These challenges make the process of creating effective and reliable AI agents a formidable undertaking, requiring innovative solutions to streamline development and enhance overall system reliability.

In this context, Amazon SageMaker AI with MLflow offers a powerful solution to streamline generative AI agent experimentation. For this post, I use LangChain’s popular open source LangGraph agent framework to build an agent and show how to enable detailed tracing and evaluation of LangGraph generative AI agents. This post explores how Amazon SageMaker AI with MLflow can help you as a developer and a machine learning (ML) practitioner efficiently experiment, evaluate generative AI agent performance, and optimize their applications for production readiness. I also show you how to introduce advanced evaluation metrics with Retrieval Augmented Generation Assessment (RAGAS) to illustrate MLflow customization to track custom and third-party metrics like with RAGAS.

The need for advanced tracing and evaluation in generative AI agent development

A crucial functionality for experimentation is the ability to observe, record, and analyze the internal execution path of an agent as it processes a request. This is essential for pinpointing errors, evaluating decision-making processes, and improving overall system reliability. Tracing workflows not only aids in debugging but also ensures that agents perform consistently across diverse scenarios.

Further complexity arises from the open-ended nature of tasks that generative AI agents perform, such as text generation, summarization, or question answering. Unlike traditional software testing, evaluating generative AI agents requires new metrics and methodologies that go beyond basic accuracy or latency measures. You must assess multiple dimensions—such as correctness, toxicity, relevance, coherence, tool call, and groundedness—while also tracing execution paths to identify errors or bottlenecks.

Why SageMaker AI with MLflow?

Amazon SageMaker AI, which provides a fully managed version of the popular open source MLflow, offers a robust platform for machine learning experimentation and generative AI management. This combination is particularly powerful for working with generative AI agents. SageMaker AI with MLflow builds on MLflow’s open source legacy as a tool widely adopted for managing machine learning workflows, including experiment tracking, model registry, deployment, and metrics comparison with visualization.

Scalability

Integrated

tracking

Visualization

Continuity

for ML Teams

AWS ecosystem advantage

This evolution positions SageMaker AI with MLflow as a unified platform for both traditional ML and cutting-edge generative AI agent development.

Key features of SageMaker AI with MLflow

The capabilities of SageMaker AI with MLflow directly address the core challenges of agentic experimentation—tracing agent behavior, evaluating agent performance, and unified governance.

Experiment tracking:

Agent versioning:

Unified agent governance:

SageMaker AI with MLflow console

Scalable infrastructure:

LangGraph generative AI agents

LangGraph offers a powerful and flexible approach to designing generative AI agents tailored to your company’s specific needs. LangGraph’s controllable agent framework is engineered for production use, providing low-level customization options to craft bespoke solutions.

In this post, I show you how to create a simple finance assistant agent equipped with a tool to retrieve financial data from a datastore, as depicted in the following diagram. This post’s sample agent, along with all necessary code, is available on the GitHub repository, ready for you to replicate and adapt it for your own applications.

Solution code

You can follow and execute the full example code from the aws-samples GitHub repository. I use snippets from the code in the repository to illustrate evaluation and tracking approaches in the reminder of this post.

Prerequisites

Use quick setup for Amazon SageMaker AI

setting up a new MLflow tracking server

Amazon Bedrock foundation models

Trace generative AI agents with SageMaker AI with MLflow

MLflow’s tracing capabilities are essential for understanding the behavior of your LangGraph agent. The MLflow tracking is an API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results.

MLflow tracing is a feature that enhances observability in your generative AI agent by capturing detailed information about the execution of the agent services, nodes, and tools. Tracing provides a way to record the inputs, outputs, and metadata associated with each intermediate step of a request, enabling you to easily pinpoint the source of bugs and unexpected behaviors.

The MLfow tracking UI displays the traces exported under the MLflow Traces tab for the selected MLflow experimentation, as shown in the following image.

Furthermore, you can see the detailed trace for an agent input or prompt invocation by selecting the Request ID. Choosing Request ID opens a collapsible view with results captured at each step of the invocation workflow from input to the final output, as shown in the following image.

SageMaker AI with MLflow traces all the nodes in the LangGraph agent and displays the trace in the MLflow UI with detailed inputs, outputs, usage tokens, and multi-sequence messages with origin type (human, tool, AI) for each node. The display also captures the execution time over the entire agentic workflow, providing a per-node breakdown of time. Overall, tracing is crucial for generative AI agents for the following reasons:

Performance monitoring:

Timeout management:

Debugging and troubleshooting:

Explainability:

Optimization

Compliance and security:

Cost tracking:

Adaptation and learning:

In the MLflow UI, you can choose the Task name to see details captured at any agent step as it services the input request prompt or invocation, as shown in the following image.

By implementing proper tracing, you can gain deeper insights into your generative AI agents’ behavior, optimize their performance, and make sure that they operate reliably and securely.

Configure tracing for agent

For fine-grained control and flexibility in tracking, you can use MLflow’s tracing decorator APIs. With these APIs, you can add tracing to specific agentic nodes, functions, or code blocks with minimal modifications.

@mlflow.trace(name="assistant", attributes={"workflow": "agent_assistant"}, span_type="graph.py")def assistant(state: GraphState):    ...

This configuration allows users to:

Pinpoint performance bottlenecks in the LangGraph agent Track decision-making processes Monitor error rates and types Analyze patterns in agent behavior across different scenarios

This approach allows you to specify exactly what you want to track in your experiment. Additionally, MLflow offers out-of-the box tracing comparability with LangChain for basic tracing through MLflow’s autologging feature mlflow.langchain.autolog(). With SageMaker AI with MLflow, you can gain deep insights into the LangGraph agent’s performance and behavior, facilitating easier debugging, optimization, and monitoring, in both development and production environments.

Evaluate with MLflow

You can use MLflow’s evaluation capabilities to help assess the performance of the LangGraph large language model (LLM) agent and objectively measure its effectiveness in various scenarios. The important aspects of evaluation are:

Evaluation metrics:

LLM-as-a-Judge

Evaluation dataset:

Run evaluation using MLflow evaluate library:

mlflow.evaluate()

The following is a snippet for how mlflow.evaluate(), can be used to execute evaluation on agents. You can follow this example by running the code in the same aws-samples GitHub repository.

pythonresults = mlflow.evaluate(            agent_responses,  # Agent-generated answers to test queries            targets="ground_truth",    # Reference "correct" answers for comparison            model_type="question-answering",  # Predefined metrics for QA tasks            extra_metrics=metrics   # Evaluation Metrics to include        )

This code snippet employs MLflow’s evaluate() function to rigorously assess the performance of a LangGraph LLM agent, comparing its responses to a predefined ground truth dataset that’s maintained in the golden_questions_answer.jsonl file in the aws-samples GitHub repository. By specifying “model_type”:”question-answering”, MLflow applies relevant evaluation metrics for question-answering tasks, such as accuracy and coherence. Additionally, the extra_metrics parameter allows you to incorporate custom, domain-specific metrics tailored to the agent’s specific application, enabling a comprehensive and nuanced evaluation beyond standard benchmarks. The results of this evaluation are then logged in MLflow (as shown in the following image), providing a centralized and traceable record of the agent’s performance, facilitating iterative improvement and informed deployment decisions. The MLflow evaluation is captured as part of the MLflow execution run.

You can open the SageMaker AI with MLflow tracking server and see the list of MLflow execution runs for the specified MLflow experiment, as shown in the following image.

The evaluation metrics are captured within the MLflow execution along with model metrics and the accompanying artifacts, as shown in the following image.

Furthermore, the evaluation metrics are also displayed under the Model metrics tab within a selected MLflow execution run, as shown in the following image.

Finally, as shown in the following image, you can compare different variations and versions of the agent during the development phase by selecting the compare checkbox option in the MLflow UI between selected MLflow execution experimentation runs. This can help compare and select the best functioning agent version for deployment or with other decision making processes for agent development.

Register the LangGraph agent

You can use SageMaker AI with MLflow artifacts to register the LangGraph agent along with any other item as required or that you’ve produced. All the artifacts are stored in the SageMaker AI with MLflow tracking server’s configured Amazon Simple Storage Service (Amazon S3) bucket. Registering the LangGraph agent is crucial for governance and lifecycle management. It provides a centralized repository for tracking, versioning, and deploying the agents. Think of it as a catalog of your validated AI assets.

As shown in the following figure, you can see the artifacts captured under the Artifact tab within the MLflow execution run.

MLflow automatically captures and logs agent-related information files such as the evaluation results and the consumed libraries in the requirements.txt file. Furthermore, a successfully logged LangGraph agent as a MLflow model can be loaded and used for inference using mlflow.langchain.load_model(model_uri). Registering the generative AI agent after rigorous evaluation helps ensure that you’re promoting a proven and validated agent to production. This practice helps prevent the deployment of poorly performing or unreliable agents, helping to safeguard the user experience and the integrity of your applications. Post-evaluation registration is critical to make sure that the experiment with the best result is the one that gets promoted to production.

Use MLflow to experiment and evaluate with external libraries (such as RAGAS)

MLflow’s flexibility allows for seamless integration with external libraries, enhancing your ability to experiment and evaluate LangChain LangGraph agents. You can extend SageMaker MLflow to include external evaluation libraries such as RAGAS for comprehensive LangGraph agent assessment. This integration enables ML practitioners to use RAGAS’s specialized LLM evaluation metrics while benefiting from MLflow’s experiment tracking and visualization capabilities. By logging RAGAS metrics directly to SageMaker AI with MLflow, you can easily compare different versions of the LangGraph agent across multiple runs, gaining deeper insights into its performance.

RAGAS is an open source library that provide tools specifically for evaluation of LLM applications and generative AI agents. RAGAS includes a method ragas.evaluate(), to run evaluations for LLM agents with choice of LLM models (evaluators) for scoring the evaluation, and extensive list of default metrics. To incorporate RAGAS metrics into your MLflow experiments, you can use the following approach.

You can follow this example by running the notebook in the GitHub repository additional_evaluations_with_ragas.ipynb.

from ragas import EvaluationDatasetfrom ragas import evaluatefrom ragas.llms import LangchainLLMWrapperevaluation_dataset = EvaluationDataset.from_list(ragas_dataset)evaluator_llm = LangchainLLMWrapper(llm_for_evaluation)result = evaluate(    dataset=evaluation_dataset,    metrics=metrics_final,    llm=evaluator_llm,    embeddings=bedrock_embeddings,    )result

The evaluation results using RAGAS metrics from the above code are shown in the following figure.

Subsequently, the computed RAGAS evaluations metrics can be exported and tracked in the SageMaker AI with MLflow tracking server as part of the MLflow experimentation run. See the following code snippet for illustration and the full code can be found in the notebook in the same aws-samples GitHub repository.

with mlflow.start_run(    experiment_id=get_experiment_id(_MLFLOW_RAGAS_EXPERIMENT_NAME),     run_name=timestamp,     tags={        "project": os.getenv('PROJECT'),        "model": os.getenv('MODELID'),        "version": os.getenv('VERSION')    }):    # Log the dataset to MLflow    mlflow.log_input(dataset, context="ragas_eval_results")    for ragas_metric in [faithfulness, answer_relevancy, answer_correctness]:        print(ragas_metric.name)        mean = ragas_result_ds[ragas_metric.name].mean()        p90 = ragas_result_ds[ragas_metric.name].quantile(0.9)        variance = ragas_result_ds[ragas_metric.name].var()        print(mean, p90, variance)        mlflow.log_metric(f"ragas_{ragas_metric.name}_score/v1/mean", mean)        mlflow.log_metric(f"ragas_{ragas_metric.name}_score/v1/p90", p90)        mlflow.log_metric(f"ragas_{ragas_metric.name}_score/v1/variance", variance)mlflow.end_run()

You can view the RAGAS metrics logged by MLflow in the SageMaker AI with MLflow UI on the Model metrics tab, as shown in the following image.

From experimentation to production: Collaborative approval with SageMaker with MLflow tracing and evaluation

In a real-world deployment scenario, MLflow’s tracing and evaluation capabilities with LangGraph agents can significantly streamline the process of moving from experimentation to production.

Imagine a large team of data scientists and ML engineers working on an agentic platform, as shown in the following image. With MLflow, they can create sophisticated agents that can handle complex queries, process returns, and provide product recommendations. During the experimentation phase, the team can use MLflow to log different versions of the agent, tracking performance and evaluation metrics such as response accuracy, latency, and other metrics. MLflow’s tracing feature allows them to analyze the agent’s decision-making process, identifying areas for improvement. The results across numerous experiments are automatically logged to SageMaker AI with MLflow. The team can use the MLflow UI to collaborate, compare, and select the best performing version of the agent and decide on a production-ready version, all informed by the diverse set data logged in SageMaker AI with MLflow.

With this data, the team can present a clear, data-driven case to stakeholders for promoting the agent to production. Managers and compliance officers can review the agent’s performance history, examine specific interaction traces, and verify that the agent meets all necessary criteria. After being approved, the SageMaker AI with MLflow registered agent facilitates a smooth transition to deployment, helping to ensure that the exact version of the agent that passed evaluation is the one that goes live. This collaborative, traceable approach not only accelerates the development cycle but also instills confidence in the reliability and effectiveness of the generative AI agent in production.

Clean up

To avoid incurring unnecessary charges, use the following steps to clean up the resources used in this post:

In SageMaker Studio, stop and delete any running MLflow tracking server instances

Go to the Amazon Bedrock console. Navigate to Model access and remove access to any models you enabled for this project.

Open the SageMaker console. Navigate to the Domains section. Select the domain you created for this project. Choose Delete domain and confirm the action. Also delete any associated S3 buckets and IAM roles.

Conclusion

In this post, I showed you how to combine LangChain’s LangGraph, Amazon SageMaker AI, and MLflow to demonstrate a powerful workflow for developing, evaluating, and deploying sophisticated generative AI agents. This integration provides the tools needed to gain deep insights into the generative AI agent’s performance, iterate quickly, and maintain version control throughout the development process.

As the field of AI continues to advance, tools like these will be essential for managing the increasing complexity of generative AI agents and ensuring their effectiveness with the following considerations,

evaluate()

By embracing these principles and using the tools outlined in this post, developers and ML practitioners can confidently navigate the complexities of generative AI agent development and deployment, building robust and reliable applications that deliver real business value. Now, it’s your turn to unlock the potential of advanced tracing, evaluation, and collaboration in your agentic workflows! Dive into the aws-samples GitHub repository and start using the power of LangChain’s LangGraph, Amazon SageMaker AI, and MLflow for your generative AI projects.

About the Author

Sandeep Raveesh is a Generative AI Specialist Solutions Architect at AWS. He works with customers through their AIOps journey across model training, Retrieval Augmented Generation (RAG), generative AI agents, and scaling generative AI use-cases. He also focuses on go-to-market strategies helping AWS build and align products to solve industry challenges in the generative AI space. You can find Sandeep on LinkedIn.