2025-06-27 21:19 浙江
Agent工具调用、长文能力突出,支持256K原生上下文窗口
01
前言
C3-BenchMark数据集链接:
https://modelscope.cn/datasets/Tencent-Hunyuan/C3-BenchMark
02
模型推理
显存占用from modelscope import AutoModelForCausalLM, AutoTokenizer
import os
import re
model_name_or_path = "Tencent-Hunyuan/Hunyuan-A13B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",trust_remote_code=True) # You may want to use bfloat16 and/or move to GPU here
messages = [
{"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True,return_tensors="pt",
enable_thinking=True # Toggle thinking mode (default: True)
)
outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=4096)
output_text = tokenizer.decode(outputs[0])
think_pattern = r'<think>(.*?)</think>'
think_matches = re.findall(think_pattern, output_text, re.DOTALL)
answer_pattern = r'<answer>(.*?)</answer>'
answer_matches = re.findall(answer_pattern, output_text, re.DOTALL)
think_content = [match.strip() for match in think_matches][0]
answer_content = [match.strip() for match in answer_matches][0]
print(f"thinking_content:{think_content}\n\n")
print(f"answer_content:{answer_content}\n\n")
03
模型微调
微调数据集准备格式如下(system字段可选),在训练脚本中指定`--dataset <dataset_path>`即可。git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .
pip install liger-kernel transformers -U
不带思考的数据集,你可以在训练时额外指定`--loss_scale ignore_empty_think`,忽略正则为`<think>\s*</think>\s*`的损失计算,避免思考能力的丢失。{"messages": [{"role": "user", "content": "浙江的省会在哪?"}, {"role": "assistant", "content": "<think>\nxxx\n</think>\n浙江的省会在杭州。"}]}
对Hunyuan-A13B-Instruct进行60分钟快速自我认知微调脚本如下,可在魔搭提供的免费算力A10中运行:https://modelscope.cn/my/mynotebook{"messages": [{"role": "user", "content": "浙江的省会在哪?"}, {"role": "assistant", "content": "<think>\n\n</think>\n浙江的省会在杭州。"}]}
# 训练显存:4 * 47GiB
CUDA_VISIBLE_DEVICES=0,1,2,3 \
swift sft \
--model Tencent-Hunyuan/Hunyuan-A13B-Instruct \
--train_type lora \
--dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT#1500' \
'swift/self-cognition:empty_think#600' \
--loss_scale ignore_empty_think \
--torch_dtype bfloat16 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--learning_rate 1e-4 \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--gradient_accumulation_steps 8 \
--load_from_cache_file false \
--eval_steps 50 \
--save_steps 50 \
--save_total_limit 2 \
--logging_steps 5 \
--max_length 2048 \
--output_dir output \
--warmup_ratio 0.05 \
--packing true \
--attn_impl flash_attn \
--dataloader_num_workers 4 \
--model_author swift \
--model_name swift-robot
训练显存占用:
训练完成后,使用以下命令进行推理:推送模型到ModelScope:CUDA_VISIBLE_DEVICES=0,1,2,3 \
swift infer \
--adapters output/vx-xxx/checkpoint-xxx \
--stream true \
--temperature 0 \
--max_new_tokens 2048
swift export \
--adapters output/vx-xxx/checkpoint-xxx \
--push_to_hub true \
--hub_model_id '<your-model-id>' \
--hub_token '<your-sdk-token>'
04
模型部署
- 第一步:拉取镜像:
docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
- 启动 API 服务器:
docker run --name hunyuanLLM_infer --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
trtllm-serve \
/path/to/HunYuan-moe-A13B \
--host localhost \
--port 8000 \
--backend pytorch \
--max_batch_size 128 \
--max_num_tokens 16384 \
--tp_size 2 \
--kv_cache_free_gpu_memory_fraction 0.95 \
--extra_llm_api_options /path/to/extra-llm-api-config.yml
vLLM
Docker 镜像混元官方提供了一个预构建的 Docker 镜像,包含 vLLM 0.8.5,可完全支持混元-A13B模型。官方 vllm 版本目前正在开发中,注意:此 Docker 需要 CUDA 12.8。- 开始:
docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-vllm
#docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
- 下载模型文件:
- Huggingface:将由 vllm 自动下载。ModelScope:
modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct
或者设置环境变量export VLLM_USE_MODELSCOPE=True
- 启动 API 服务器:
docker run --privileged --user root --net=host --ipc=host \
-v ~/.cache:/root/.cache/ \
--gpus=all -it --entrypoint python hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
\
-m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 \
--tensor-parallel-size 4 --model tencent/Hunyuan-A13B-Instruct --trust-remote-code
modelscope 下载的模型:
docker run --privileged --user root --net=host --ipc=host \
-v ~/.cache/modelscope:/root/.cache/modelscope \
--gpus=all -it --entrypoint python hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
-m vllm.entrypoints.openai.api_server --host 0.0.0.0 --tensor-parallel-size 4 --port 8000 \
--model /root/.cache/modelscope/hub/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct/ --trust_remote_code
SGLang
Docker 镜像混元官方还提供基于最新版本 SGLang 的预构建 Docker 镜像。开始:- 拉取 Docker 镜像
docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-sglang
or
docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-sglang
- 启动 API 服务器:
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
--ipc=host \
docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-sglang \
-m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
点击阅读原文,即可跳转模型~
👇点击关注ModelScope公众号获取
更多技术信息~