Track LLM model evaluation using Amazon SageMaker managed MLflow and FMEval

Evaluating large language models (LLMs) is crucial as LLM-based systems become increasingly powerful and relevant in our society. Rigorous testing allows us to understand an LLM’s capabilities, limitations, and potential biases, and provide actionable feedback to identify and mitigate risk. Furthermore, evaluation processes are important not only for LLMs, but are becoming essential for assessing prompt template quality, input data quality, and ultimately, the entire application stack. As LLMs take on more significant roles in areas like healthcare, education, and decision support, robust evaluation frameworks are vital for building trust and realizing the technology’s potential while mitigating risks.

Developers interested in using LLMs should prioritize a comprehensive evaluation process for several reasons. First, it assesses the model’s suitability for specific use cases, because performance can vary significantly across different tasks and domains. Evaluations are also a fundamental tool during application development to validate the quality of prompt templates. This process makes sure that solutions align with the company’s quality standards and policy guidelines before deploying them to production. Regular interval evaluation also allows organizations to stay informed about the latest advancements, making informed decisions about upgrading or switching models. Moreover, a thorough evaluation framework helps companies address potential risks when using LLMs, such as data privacy concerns, regulatory compliance issues, and reputational risk from inappropriate outputs. By investing in robust evaluation practices, companies can maximize the benefits of LLMs while maintaining responsible AI implementation and minimizing potential drawbacks.

To support robust generative AI application development, it’s essential to keep track of models, prompt templates, and datasets used throughout the process. This record-keeping allows developers and researchers to maintain consistency, reproduce results, and iterate on their work effectively. By documenting the specific model versions, fine-tuning parameters, and prompt engineering techniques employed, teams can better understand the factors contributing to their AI system’s performance. Similarly, maintaining detailed information about the datasets used for training and evaluation helps identify potential biases and limitations in the model’s knowledge base. This comprehensive approach to tracking key components not only facilitates collaboration among team members but also enables more accurate comparisons between different iterations of the AI application. Ultimately, this systematic approach to managing models, prompts, and datasets contributes to the development of more reliable and transparent generative AI applications.

In this post, we show how to use FMEval and Amazon SageMaker to programmatically evaluate LLMs. FMEval is an open source LLM evaluation library, designed to provide data scientists and machine learning (ML) engineers with a code-first experience to evaluate LLMs for various aspects, including accuracy, toxicity, fairness, robustness, and efficiency. In this post, we only focus on the quality and responsible aspects of model evaluation, but the same approach can be extended by using other libraries for evaluating performance and cost, such as LLMeter and FMBench, or richer quality evaluation capabilities like those provided by Amazon Bedrock Evaluations.

SageMaker is a data, analytics, and AI/ML platform, which we will use in conjunction with FMEval to streamline the evaluation process. We specifically focus on SageMaker with MLflow. MLflow is an open source platform for managing the end-to-end ML lifecycle, including experimentation, reproducibility, and deployment. The managed MLflow in SageMaker simplifies the deployment and operation of tracking servers, and offers seamless integration with other AWS services, making it straightforward to track experiments, package code into reproducible runs, and share and deploy models.

By combining FMEval’s evaluation capabilities with SageMaker with MLflow, you can create a robust, scalable, and reproducible workflow for assessing LLM performance. This approach can enable you to systematically evaluate models, track results, and make data-driven decisions in your generative AI development process.

Using FMEval for model evaluation

FMEval is an open-source library for evaluating foundation models (FMs). It consists of three main components:

Data config

Model runner

Evaluation algorithm

You can use pre-built components because it provides native components for both Amazon Bedrock and Amazon SageMaker JumpStart, or create custom ones by inheriting from the base core component. The library supports various evaluation scenarios, including pre-computed model outputs and on-the-fly inference. FMEval offers flexibility in dataset handling, model integration, and algorithm implementation. Refer to Evaluate large language models for quality and responsibility or the Evaluating Large Language Models with fmeval paper to dive deeper into FMEval, or see the official GitHub repository.

Using SageMaker with MLflow to track experiments

The fully managed MLflow capability on SageMaker is built around three core components:

MLflow tracking server

through the Amazon SageMaker Studio interface

using the API

MLflow metadata backend

MLflow artifact repository

Amazon Simple Storage Service

The following diagram depicts the different components and where they run within AWS.

Code walkthrough

You can follow the full sample code from the GitHub repository.

Prerequisites

You must have the following prerequisites:

Amazon SageMaker Studio domain

Amazon Bedrock models

Amazon SageMaker JumpStart

Refer to the documentation best practices regarding AWS Identity and Access Management (IAM) policies for SageMaker, MLflow, and Amazon Bedrock on how to set up permissions for the SageMaker execution role. Remember to always following the least privilege access principle.

Evaluate a model and log to MLflow

We provide two sample notebooks to evaluate models hosted in Amazon Bedrock (Bedrock.ipynb) and models deployed to SageMaker Hosting using SageMaker JumpStart (JumpStart.ipynb). The workflow implemented in these two notebooks is essentially the same, although a few differences are noteworthy:

ModelRunners

BedrockModelRunner

JumpStartModelRunner

ModelRunner definition

For BedrockModelRunner, we need to find the model content_template. We can find this information conveniently on the Amazon Bedrock console in the API request sample section, and look at value of the body. The following example is the content template for Anthropic’s Claude 3 Haiku:

output_jmespath = "content[0].text"content_template = """{  "anthropic_version": "bedrock-2023-05-31",  "max_tokens": 512,  "temperature": 0.5,  "messages": [    {      "role": "user",      "content": [        {          "type": "text",          "text": $prompt        }      ]    }  ]}"""model_runner = BedrockModelRunner(    model_id=model_id,    output=output_jmespath,    content_template=content_template,)

For JumpStartModelRunner, we need to find the model and model_version. This information can be retrieved directly using the get_model_info_from_endpoint(endpoint_name=endpoint_name) utility provided by the SageMaker Python SDK, where endpoint_name is the name of the SageMaker endpoint where the SageMaker JumpStart model is hosted. See the following code example:

from sagemaker.jumpstart.session_utils import get_model_info_from_endpointmodel_id, model_version, , , _ = get_model_info_from_endpoint(endpoint_name=endpoint_name)model_runner = JumpStartModelRunner(    endpoint_name=endpoint_name,    model_id=model_id,    model_version=model_version,)

DataConfig definition

For each model runner, we want to evaluate three categories: Summarization, Factual Knowledge, and Toxicity. For each of this category, we prepare a DataConfig object for the appropriate dataset. The following example shows only the data for the Summarization category:

dataset_path = Path("datasets")dataset_uri_summarization = dataset_path / "gigaword_sample.jsonl"if not dataset_uri_summarization.is_file():    print("ERROR - please make sure the file, gigaword_sample.jsonl, exists.")data_config_summarization = DataConfig(    dataset_name="gigaword_sample",    dataset_uri=dataset_uri_summarization.as_posix(),    dataset_mime_type=MIME_TYPE_JSONLINES,    model_input_location="document",    target_output_location="summary",)

Evaluation sets definition

We can now create an evaluation set for each algorithm we want to use in our test. For the Summarization evaluation set, replace with your own prompt according to the input signature identified earlier. fmeval uses $model_input as placeholder to get the input from your evaluation dataset. See the following code:

summarization_prompt = "Summarize the following text in one sentence: $model_input"summarization_accuracy = SummarizationAccuracy()evaluation_set_summarization = EvaluationSet(  data_config_summarization,  summarization_accuracy,  summarization_prompt,)

We are ready now to group the evaluation sets:

evaluation_list = [    evaluation_set_summarization,    evaluation_set_factual,    evaluation_set_toxicity,]

Evaluate and log to MLflow

We set up the MLflow experiment used to track the evaluations. We then create a new run for each model, and run all the evaluations for that model within that run, so that the metrics will all appear together. We use the model_id as the run name to make it straightforward to identify this run as part of a larger experiment, and run the evaluation using the run_evaluation_sets() defined in utils.py. See the following code:

run_name = f"{model_id}"experiment_name = "fmeval-mlflow-simple-runs"experiment = mlflow.set_experiment(experiment_name)with mlflow.start_run(run_name=run_name) as run:    run_evaluation_sets(model_runner, evaluation_list)

It is up to the user to decide how to best organize the results in MLflow. In fact, a second possible approach is to use nested runs. The sample notebooks implement both approaches to help you decide which one fits best your needs.

experiment_name = "fmeval-mlflow-nested-runs"experiment = mlflow.set_experiment(experiment_name)with mlflow.start_run(run_name=run_name, nested=True) as run:    run_evaluation_sets_nested(model_runner, evaluation_list)

Run evaluations

Tracking the evaluation process involves storing information about three aspects:

The input dataset The parameters of the model being evaluated The scores for each evaluation

We provide a helper library (fmeval_mlflow) to abstract the logging of these aspects to MLflow, streamlining the interaction with the tracking server. For the information we want to store, we can refer to the following three functions:

log_input_dataset(data_config: DataConfig | list[DataConfig])

log_runner_parameters(model_runner: ModelRunner, custom_parameters_map: dict | None = None, model_id: str | None = None,)

ModelRunner

log_metrics(eval_output: list[EvalOutput], log_eval_output_artifact: bool = False)

SingleEvalOutput

When the evaluations are complete, we can analyze the results directly in the MLflow UI for a first visual assessment.

In the following screenshots, we show the visualization differences between logging using simple runs or nested runs.

You might want to create your own custom visualizations. For example, spider plots are often used to make visual comparison across multiple metrics. In the notebook compare_models.ipynb, we provide an example on how to use metrics stored in MLflow to generate such plots, which ultimately can also be stored in MLflow as part of your experiments. The following screenshots show some example visualizations.

Clean up

Once created, an MLflow tracking server will incur costs until you delete or stop it. Billing for tracking servers is based on the duration the servers have been running, the size selected, and the amount of data logged to the tracking servers. You can stop the tracking servers when they are not in use to save costs or delete them using the API or SageMaker Studio UI. For more details on pricing, see Amazon SageMaker pricing.

Similarly, if you deployed a model using SageMaker, endpoints are priced by deployed infrastructure time rather than by requests. You can avoid unnecessary charges by deleting your endpoints when you’re done with the evaluation.

Conclusion

In this post, we demonstrated how to create an evaluation framework for LLMs by combining SageMaker managed MLflow with FMEval. This integration provides a comprehensive solution for tracking and evaluating LLM performance across different aspects including accuracy, toxicity, and factual knowledge.

To enhance your evaluation journey, you can explore the following:

GitHub repository

By adopting these practices, you can build more reliable and trustworthy LLM applications while maintaining a clear record of your evaluation process and results.

About the authors

Paolo Di Francesco is a Senior Solutions Architect at Amazon Web Services (AWS). He holds a PhD in Telecommunications Engineering and has experience in software engineering. He is passionate about machine learning and is currently focusing on using his experience to help customers reach their goals on AWS, in particular in discussions around MLOps. Outside of work, he enjoys playing football and reading.

Dr. Alessandro Cerè is a GenAI Evaluation Specialist and Solutions Architect at AWS. He assists customers across industries and regions in operationalizing and governing their generative AI systems at scale, ensuring they meet the highest standards of performance, safety, and ethical considerations. Bringing a unique perspective to the field of AI, Alessandro has a background in quantum physics and research experience in quantum communications and quantum memories. In his spare time, he pursues his passion for landscape and underwater photography.