MarkTechPost@AI 18小时前
Getting Started with MLFlow for LLM Evaluation
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍如何使用MLflow这一强大的开源平台来评估大型语言模型(LLM)的性能,特别是针对Google的Gemini模型。通过MLflow,可以方便地跟踪模型实验、记录参数和管理部署。教程演示了如何使用MLflow评估Gemini模型在基于事实的提示上的表现,包括生成响应并使用MLflow支持的各种指标来评估其质量。文章还详细介绍了设置依赖项、准备评估数据、获取Gemini的响应以及使用MLflow进行评估的完整流程。

💡MLflow是一个用于管理机器学习生命周期的开源平台,近期增加了对评估大型语言模型(LLM)的支持。

🔑为了评估Gemini模型,需要准备评估数据集,该数据集包含事实性提示及其对应的正确答案。

💻通过调用Google Generative AI SDK的gemini_completion()函数,将提示发送给Gemini 1.5 Flash模型,并获取生成的响应。

📊使用mlflow.evaluate()方法对Gemini生成的响应进行评估,该方法支持多种指标,如answer_similarity、exact_match、latency和token_count。

📝评估结果会打印出来,并保存到CSV文件中,方便后续检查或可视化,以便深入了解Gemini模型的性能。

MLflow is a powerful open-source platform for managing the machine learning lifecycle. While it’s traditionally used for tracking model experiments, logging parameters, and managing deployments, MLflow has recently introduced support for evaluating Large Language Models (LLMs).

In this tutorial, we explore how to use MLflow to evaluate the performance of an LLM—in our case, Google’s Gemini model—on a set of fact-based prompts. We’ll generate responses to fact-based prompts using Gemini and assess their quality using a variety of metrics supported directly by MLflow.

Setting up the dependencies

For this tutorial, we’ll be using both the OpenAI and Gemini APIs. MLflow’s built-in generative AI evaluation metrics currently rely on OpenAI models (e.g., GPT-4) to act as judges for metrics like answer similarity or faithfulness, so an OpenAI API key is required. You can obtain:

Installing the libraries

pip install mlflow openai pandas google-genai

Setting the OpenAI and Google API Keys as environment variable

import osfrom getpass import getpassos.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key:')os.environ["GOOGLE_API_KEY"] = getpass('Enter Google API Key:')

Preparing Evaluation Data and Fetching Outputs from Gemini

import mlflowimport openaiimport osimport pandas as pdfrom google import genai

Creating the evaluation data

In this step, we define a small evaluation dataset containing factual prompts along with their correct ground truth answers. These prompts span topics such as science, health, web development, and programming. This structured format allows us to objectively compare the Gemini-generated responses against known correct answers using various evaluation metrics in MLflow.

eval_data = pd.DataFrame(    {        "inputs": [            "Who developed the theory of general relativity?",            "What are the primary functions of the liver in the human body?",            "Explain what HTTP status code 404 means.",            "What is the boiling point of water at sea level in Celsius?",            "Name the largest planet in our solar system.",            "What programming language is primarily used for developing iOS apps?",        ],        "ground_truth": [            "Albert Einstein developed the theory of general relativity.",            "The liver helps in detoxification, protein synthesis, and production of biochemicals necessary for digestion.",            "HTTP 404 means 'Not Found' -- the server can't find the requested resource.",            "The boiling point of water at sea level is 100 degrees Celsius.",            "Jupiter is the largest planet in our solar system.",            "Swift is the primary programming language used for iOS app development."        ]    })eval_data

Getting Gemini Responses

This code block defines a helper function gemini_completion() that sends a prompt to the Gemini 1.5 Flash model using the Google Generative AI SDK and returns the generated response as plain text. We then apply this function to each prompt in our evaluation dataset to generate the model’s predictions, storing them in a new “predictions” column. These predictions will later be evaluated against the ground truth answers

client = genai.Client()def gemini_completion(prompt: str) -> str:    response = client.models.generate_content(        model="gemini-1.5-flash",        contents=prompt    )    return response.text.strip()eval_data["predictions"] = eval_data["inputs"].apply(gemini_completion)eval_data

Evaluating Gemini Outputs with MLflow

In this step, we initiate an MLflow run to evaluate the responses generated by the Gemini model against a set of factual ground-truth answers. We use the mlflow.evaluate() method with four lightweight metrics: answer_similarity (measuring semantic similarity between the model’s output and the ground truth), exact_match (checking for word-for-word matches), latency (tracking response generation time), and token_count (logging the number of output tokens).

It’s important to note that the answer_similarity metric internally uses OpenAI’s GPT model to judge the semantic closeness between answers, which is why access to the OpenAI API is required. This setup provides an efficient way to assess LLM outputs without relying on custom evaluation logic. The final evaluation results are printed and also saved to a CSV file for later inspection or visualization.

mlflow.set_tracking_uri("mlruns")mlflow.set_experiment("Gemini Simple Metrics Eval")with mlflow.start_run():    results = mlflow.evaluate(        model_type="question-answering",        data=eval_data,        predictions="predictions",        targets="ground_truth",        extra_metrics=[          mlflow.metrics.genai.answer_similarity(),          mlflow.metrics.exact_match(),          mlflow.metrics.latency(),          mlflow.metrics.token_count()      ]    )    print("Aggregated Metrics:")    print(results.metrics)    # Save detailed table    results.tables["eval_results_table"].to_csv("gemini_eval_results.csv", index=False)

To view the detailed results of our evaluation, we load the saved CSV file into a DataFrame and adjust the display settings to ensure full visibility of each response. This allows us to inspect individual prompts, Gemini-generated predictions, ground truth answers, and the associated metric scores without truncation, which is especially helpful in notebook environments like Colab or Jupyter.

results = pd.read_csv('gemini_eval_results.csv')pd.set_option('display.max_colwidth', None)results

Check out the Codes here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Getting Started with MLFlow for LLM Evaluation appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MLflow LLM评估 Gemini模型 机器学习
相关文章