掘金 人工智能 06月06日 10:53
昇腾NPU上基于MindIE服务的AIME和MATH500测评方案
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文详细介绍了使用LightEval框架,结合MindIE推理服务,对DeepSeek-R1-Distill-Qwen-7B模型进行性能评估的流程。涵盖了环境配置、模型部署、LightEval的定制化适配,以及具体的评测命令和结果展示。该方法旨在解决当前对DeepSeek-R1这类带推理思维链模型进行准确评估的难题,并提供了一种可行的解决方案。

🤔 鉴于现有MindIE服务无法直接评估DeepSeek-R1模型在AIME 2024、AIME 2025、MATH-500和GPQA等数据集上的性能,本文提出了基于LightEval的评估方案。

⚙️ 方案的核心在于利用Open R1项目的评测方法,结合lighteval框架进行评估。首先,需要安装modelscope下载模型权重,并配置MindIE服务进行模型部署,包括修改配置文件、环境变量等操作。

💡为了适配MindIE服务,需要对LightEval的openai_model.py文件进行修改,调整base_url的读取方式,并将请求的model字段设置为MindIE服务的模型名称。

💻 评测过程使用openai模式,通过配置环境变量指定服务地址和API Key,然后运行LightEval命令,并指定自定义的evaluate.py文件路径,以执行AIME 2024和MATH-500等数据集的评测。

📈 最终,通过LightEval的评测命令,可以获得DeepSeek-R1模型在指定数据集上的性能评估结果,并与DeepSeek官方的测评结果进行对比。

背景

当前对DeepSeek-R1此类带推理think思维链的模型进行模型能力测评缺乏一个较准确的方,MindIE当前不能对DeepSeek报告中提到的几个数据集(AIME 2024、AIME 2025、MATH-500、GPQA等)进行模型效果评测。Open R1由huggingface出品,是当前最火的DeepSeek-R1全开源复现。我们可以参考Open R1项目的评测方法,基于lighteval进行评测。

约束条件

模型权重下载

# 安装modelscopepip install modelscope -i https://pypi.tuna.tsinghua.edu.cn/simple/# 使用modelscope下载权重modelscope download --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --local_dir ./DeepSeek-R1-Distill-Qwen-7B

MindIE服务化部署

镜像下载

下载链接:https://www.hiascend.com/developer/ascendhub/detail/af85b724a7e5469ebd7ea13c3439d48f

启动容器

# 查看镜像docker images# 创建并启动容器sudo docker run -it --name lighteval_test \    --network=host --shm-size=128G \    --privileged=true \    --device=/dev/davinci_manager \    --device=/dev/hisi_hdc \    --device=/dev/devmm_svm \    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \    -v /usr/local/dcmi:/usr/local/dcmi \    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \    -v /usr/local/sbin:/usr/local/sbin \    -v /etc/ascend_install.info:/etc/ascend_install.info \    -v /home:/home \    -v /tmp:/tmp \    -v /data:/data \    -v `pwd`:/workspace \    -w /workspace \swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.0.RC1-800I-A2-py311-openeuler24.03-lts /bin/bash

环境变量

# 在容器中设置如下环境变量:source /usr/local/Ascend/ascend-toolkit/set_env.shsource /usr/local/Ascend/nnal/atb/set_env.shsource /usr/local/Ascend/mindie/set_env.shsource /usr/local/Ascend/atb-models/set_env.sh

启动MindIE服务

修改MindIE配置文件

cd /usr/local/Ascend/mindie/latest/mindie-service/confvim config.json修改config.json的配置项:修改"httpsEnabled"false,修改"npuDeviceIds"[[0,1,2,3,4,5,6,7]],修改"modelName""qwen_distill_7b"修改"modelWeightPath"为/home/test/DeepSeek-R1-Distill-Qwen-7B修改"worldSize"8其它config.json的参考配置:maxSeqlen为最大输入加输出长度,此处建议修改为40960maxInputTokenLen:根据实际输入长度配置,此处建议8192maxPrefillTokens:建议和maxSeqlen保持一致即可maxIterTimes:由于带think思维链模型输出较长,建议设为32786

修改模型配置文件

vim /home/test/DeepSeek-R1-Distill-Qwen-7B/config.json修改"torch_dtype": "float16",

修改模型路径权限

chmod -R 750 /home/test/DeepSeek-R1-Distill-Qwen-7B

启动MindIE服务

cd /usr/local/Ascend/mindie/latest/mindie-service/bin./mindieservice_daemon

启动成功显示:

Daemon starts success!

MindIE服务测试

curl 127.0.0.1:1025/generate -d '{"prompt": "What is deep learning?","max_tokens": 32,"stream": false,"do_sample":true,"repetition_penalty": 1.00,"temperature": 0.01,"top_p": 0.001,"top_k": 1,"model": "qwen_distill_7b"}'

注意:上面curl语句中的1025端口与qwen_distill_7b 分别与/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json 配置文件里的port和modelName对应

安装lighteval

python包安装(不推荐)

pip install lighteval

python包安装可能由于版本问题会和open-r1提供的评测脚本有代码冲突,建议选择通过源码安装

源码安装(推荐)

git clone https://github.com/huggingface/lighteval.gitcd lighteval# 本次验证参考版本commit-id:ed084813e0bd12d82a06d9f913291fdbee774905git checkout ed084813e0bd12d82a06d9f913291fdbee774905pip install .pip install .[math]

编写evaluate.py

参考Open R1项目的的src/open_r1/evaluate.py 准备好evaluate.py文件,用于自定义评测任务:

# Copyright 2025 The HuggingFace Team. All rights reserved.## Licensed under the Apache License, Version 2.0 (the "License");# you may not use this file except in compliance with the License.# You may obtain a copy of the License at##     http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License."""Custom evaluation tasks for LightEval."""import randomfrom lighteval.metrics.dynamic_metrics import (    ExprExtractionConfig,    IndicesExtractionConfig,    LatexExtractionConfig,    multilingual_extractive_match_metric,)from lighteval.tasks.lighteval_task import LightevalTaskConfigfrom lighteval.tasks.requests import Docfrom lighteval.utils.language import Languagelatex_gold_metric = multilingual_extractive_match_metric(    language=Language.ENGLISH,    fallback_mode="first_match",    precision=5,    gold_extraction_target=(LatexExtractionConfig(),),    # Match boxed first before trying other regexes    pred_extraction_target=(ExprExtractionConfig(), LatexExtractionConfig(boxed_match_priority=0)),    aggregation_function=max,)expr_gold_metric = multilingual_extractive_match_metric(    language=Language.ENGLISH,    fallback_mode="first_match",    precision=5,    gold_extraction_target=(ExprExtractionConfig(),),    # Match boxed first before trying other regexes    pred_extraction_target=(ExprExtractionConfig(), LatexExtractionConfig(boxed_match_priority=0)),    aggregation_function=max,)gpqa_metric = multilingual_extractive_match_metric(    language=Language.ENGLISH,    gold_extraction_target=[IndicesExtractionConfig(prefix_for_extraction="NativeLetters")],    pred_extraction_target=[IndicesExtractionConfig(prefix_for_extraction="NativeLetters")],    precision=5,)def prompt_fn(line, task_name: str = None):    """Assumes the model is either prompted to emit \\boxed{answer} or does so automatically"""    return Doc(        task_name=task_name,        query=line["problem"],        choices=[line["solution"]],        gold_index=0,    )def aime_prompt_fn(line, task_name: str = None):    return Doc(        task_name=task_name,        query=line["problem"],        choices=[line["answer"]],        gold_index=0,    )def gpqa_prompt_fn(line, task_name: str = None):    """Prompt template adapted from simple-evals: https://github.com/openai/simple-evals/blob/83ed7640a7d9cd26849bcb3340125002ef14abbe/common.py#L14"""    gold_index = random.randint(0, 3)    choices = [line["Incorrect Answer 1"], line["Incorrect Answer 2"], line["Incorrect Answer 3"]]    choices.insert(gold_index, line["Correct Answer"])    query_template = "Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering.\n\n{Question}\n\nA) {A}\nB) {B}\nC) {C}\nD) {D}"    query = query_template.format(A=choices[0], B=choices[1], C=choices[2], D=choices[3], Question=line["Question"])    return Doc(        task_name=task_name,        query=query,        choices=["A", "B", "C", "D"],        gold_index=gold_index,        instruction=query,    )# Define tasksaime24 = LightevalTaskConfig(    name="aime24",    suite=["custom"],    prompt_function=aime_prompt_fn,    hf_repo="HuggingFaceH4/aime_2024",    hf_subset="default",    hf_avail_splits=["train"],    evaluation_splits=["train"],    few_shots_split=None,    few_shots_select=None,    generation_size=32768,    metric=[expr_gold_metric],    version=1,)aime25 = LightevalTaskConfig(    name="aime25",    suite=["custom"],    prompt_function=aime_prompt_fn,    hf_repo="yentinglin/aime_2025",    hf_subset="default",    hf_avail_splits=["train"],    evaluation_splits=["train"],    few_shots_split=None,    few_shots_select=None,    generation_size=32768,    metric=[expr_gold_metric],    version=1,)math_500 = LightevalTaskConfig(    name="math_500",    suite=["custom"],    prompt_function=prompt_fn,    hf_repo="HuggingFaceH4/MATH-500",    hf_subset="default",    hf_avail_splits=["test"],    evaluation_splits=["test"],    few_shots_split=None,    few_shots_select=None,    generation_size=32768,    metric=[latex_gold_metric],    version=1,)gpqa_diamond = LightevalTaskConfig(    name="gpqa:diamond",    suite=["custom"],    prompt_function=gpqa_prompt_fn,    hf_repo="Idavidrein/gpqa",    hf_subset="gpqa_diamond",    hf_avail_splits=["train"],    evaluation_splits=["train"],    few_shots_split=None,    few_shots_select=None,    generation_size=32768,  # needed for reasoning models like R1    metric=[gpqa_metric],    stop_sequence=[],  # no stop sequence, will use eos token    trust_dataset=True,    version=1,)# Add tasks to the tableTASKS_TABLE = []TASKS_TABLE.append(aime24)TASKS_TABLE.append(aime25)TASKS_TABLE.append(math_500)TASKS_TABLE.append(gpqa_diamond)# MODULE LOGICif __name__ == "__main__":    print([t["name"] for t in TASKS_TABLE])    print(len(TASKS_TABLE))

测评方式说明

lighteval 提供了几个主要的入口点来进行模型评估,但是在NPU设备上使用vllm和tgi模式可能还需要对代码进行适配,因此我们可以使用的就是accelerate 和endpoint openai两个模式。但是accelerate模型存在输出较短和推理速度较慢的问题,==因此推荐基于openai模式,并通过MindIE部署模型推理服务实现评测==。

使用openai模式测评(基于lighteval+MindIE服务)

lighteval也支持通过openai第三方库对模型推理服务进行评分,但当前评测框架没有对本地模型服务的情况进行适配,通过分析代码发现适配量不大,因此采用此模式对MindIE服务进行模型评测。

lighteval适配本地推理服务

查找lighteval安装路径

通过pip show lighteval找到lighteval安装路径

修改openai_model.py

修改lighteval python安装包路径下“lighteval/models/endpoints/openai_model.py”代码:

评测命令

==说明:==

配置环境变量

MODEL="/home/test/DeepSeek-R1-Distill-Qwen-7B"MODEL_ARGS="$MODEL"OUTPUT_DIR=./data/evals/$MODELexport OPENAI_BASE_URL="http://127.0.0.1:1025/v1"export OPENAI_API_KEY="test"

AIME 2024测评

TASK=aime24lighteval endpoint openai $MODEL_ARGS "custom|$TASK|0|0" \    --custom-tasks evaluate.py \--output-dir $OUTPUT_DIR

MATH-500测评

TASK=math_500lighteval endpoint openai $MODEL_ARGS "custom|$TASK|0|0" \    --custom-tasks evaluate.py \    --output-dir $OUTPUT_DIR

测评结果

AIME2024测评结果

MATH-500测评结果

DeepSeek官方测评结果

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DeepSeek-R1 LightEval MindIE 模型评估 OpenAI
相关文章