CAMEL-AI 2025-02-14 19:05 北京
手把手教你高效、低成本蒸馏大模型的数据。
DeepSeek-R1 凭借卓越的数学推理能力和高效的逻辑处理引发了广泛关注。无论是基础算术还是复杂的数学难题,它都能轻松应对,为开发者提供强大地计算支持。
现在,利用硅基流动大模型云服务平台 SiliconCloud 上高效稳定且价格亲民的满血版 DeepSeek-R1 模型(https://cloud.siliconflow.cn/models),结合 CAMEL-AI 开源社区发布的多智能体框架 CAMEL(https://huggingface.co/datasets/camel-ai/),你可以利用长链式思维(CoT)提取数学问题的详细推理过程,从 DeepSeek R1 中蒸馏出高质量的数学推理数据。目前,这一数据集上传至 Hugging Face 供公开使用,推动更智能的数学推理研究。
本教程将手把手带你探索如何用 CAMEL 框架高效蒸馏硅基流动 SiliconCloud 平台上的 DeepSeek R1 的数学推理能力,生成有价值的数据集。
通过 CAMEL-AI 社区的合成数据生成工具,CAMEL-AI 精心打造了三个高质量的数据集。无论你是希望探索 AI 如何解决复杂问题,还是想深入钻研数学推理,这些数据集都是绝佳的资源,现已发布在 Hugging Face 平台:
AMC AIME STaR 数据集:包含 4000 道高难度数学题目及其解答,特别加入了解决方案的迭代改进历史,展示了如何一步步优化答案。https://huggingface.co/datasets/camel-ai/amc_aime_star
AMC AIME 蒸馏数据集:包含 4000 道高难度数学题目及其解答,每道题目都配有清晰的分步解析。https://huggingface.co/datasets/camel-ai/amc_aime_distilled
GSM8K 蒸馏数据集:包含 7000 道高质量、语言多样化的小学数学应用题及其解答,每道题目都配有详细的分步解析。https://huggingface.co/datasets/camel-ai/gsm8k_distilled
1. 前期准备
1. 安装依赖
首先,安装所需的Python库,从命令行执行以下命令:
pip install "git+https://github.com/camel-ai/camel.git@4210cb0849f3f13d6a46fefeb9e2c3e791c158cb#egg=camel-ai"
pip install datasets
pip install rouge
2. 设置硅基流动 SiliconCloud API 密钥
进入 SIliconCloud 网站,设置 SILICONFLOW_API_KEY(https://cloud.siliconflow.cn/account/ak),这些密钥将用于结合思维过程来提炼数学推理数据。
from getpass import getpass
import os
SILICONFLOW_API_KEY = getpass('Enter your SILICONFLOW_API_KEY: ')
os.environ["SILICONFLOW_API_KEY"] = SILICONFLOW_API_KEY
# To make deepseek r1 responds with thought process content,we should set the following environment variable
os.environ["GET_REASONING_CONTENT"]="True"
3. 从 Hugging Face 下载数据
从 Hugging Face 平台开始准备原始的数学数据,这些数据的核心部分主要包括问题和答案两部分。接下来以 GSM8K 数据集为例,详细讲解具体操作步骤。
# Set the number of problems to download from GSM8K in huggingface
NUMBER_OF_PROBLEMS=10
import json
from pathlib import Path
import uuid
from datasets import load_dataset
def download_gsm8k_dataset():
try:
# Load the dataset using the datasets library
dataset = load_dataset("openai/gsm8k", "main")
# Get the items from train split
data = dataset['train'].select(range(NUMBER_OF_PROBLEMS))
# Convert to the desired format
formatted_data = []
for item in data:
# Extract the final answer from the solution
solution = item['answer']
if solution:
# GSM8K solutions typically end with "#### number"
import re
match = re.search(r'####\s*(\d+)', solution)
if match:
number = match.group(1)
# Replace the "#### number" with "\boxed{number}"
solution = re.sub(
r'####\s*\d+', f'\\\\boxed{{{number}}}', solution
)
formatted_item = {
"id": str(uuid.uuid4()), # GSM8K doesn't provide IDs
"problem": item['question'],
"type": "openai/gsm8k", # All problems are from GSM8K
"solution": solution, # Use the modified solution with \boxed
}
formatted_data.append(formatted_item)
# Save to a file
output = formatted_data
output_file = "downloaded_gsm8k_10.json"
with open(output_file, "w") as f:
json.dump(output, f, indent=2)
print(f"Successfully downloaded and saved GSM8K dataset to {output_file}")
except Exception as e:
print(f"Error downloading GSM8K dataset: {e}")
if __name__ == "__main__":
download_gsm8k_dataset()
获得了一些符合目标格式的示例数据,接下来让我们开始蒸馏一些包含详细思维过程的数学推理数据。
2. 蒸馏包含思维过程的数学推理数据(长链思维数据,Long CoT Data)
1. 导入所需的库
import nest_asyncio
nest_asyncio.apply()
import json
import os
import time
from camel.agents import ChatAgent
from camel.datagen import STaRPipeline
from camel.models import ModelFactory
from camel.types import ModelPlatformType, ModelType
2. 设置推理模型和评估模型
通过硅基流动的 SiliconCloud 平台来调用 DeepSeek R1 模型。CAMEL 的模型管理器会根据请求的成功情况自动切换模型。
# Set DeepSeek R1 served by siliconflow as reason model 1
reason_model_1 = ModelFactory.create(
model_platform=ModelPlatformType.OPENAI_COMPATIBLE_MODEL,
model_type="deepseek-ai/DeepSeek-R1",
api_key=os.environ["SILICONFLOW_API_KEY"],
url="https://api.siliconflow.cn/v1",
model_config_dict={"max_tokens": 4096}, # Config the max_token carefully
)
# Set DeepSeek R1 served by deepseek cloud as reason model 2
reason_model_2 = ModelFactory.create(
model_platform=ModelPlatformType.DEEPSEEK,
model_type=ModelType.DEEPSEEK_REASONER,
)
3. 运行 CAMEL 的 Self-Improve 数据生成模块
在运行之前,请注意一些关键参数的设置,例如:
problems_path:原始数学问题的路径。
output_path:生成数据的保存路径。
max_iterations:最大迭代次数,控制数据生成的深度。
rationalization:是否将正确内容作为参考加入推理过程生成。
注意事项:
我们已经将部分可选的设置代码注释掉,你可以按需启用对应代码。
生成的数据可以直接用于训练或进一步分析。
运行完成后,你将在output_path中找到生成的高质量数学推理数据集。
start_time = time.time()
problems_path = "downloaded_gsm8k_10.json"
output_path = "generated_data.json"
# Load problems from JSON file
with open(problems_path, 'r') as f:
problems = json.load(f)
# Initialize agent
reason_agent_system_message = """Answer my question and give your
final answer within \\boxed{}."""
evaluate_agent_system_message = """You are a highly critical teacher who
evaluates the student's answers with a meticulous and demanding approach.
"""
# Set up reason agent
reason_agent = ChatAgent(
system_message=reason_agent_system_message,
model=[reason_model_1, reason_model_2], # add models to the list, You can also swtich to other models
)
# # Set up evaluate agent(optional)
# evaluate_agent = ChatAgent(
# system_message=evaluate_agent_system_message
# )
# # Initialize reward model (optional)
# reward_model = NemotronRewardModel(
# model_type=ModelType.NVIDIA_NEMOTRON_340B_REWARD,
# url="https://integrate.api.nvidia.com/v1",
# api_key=os.environ.get("NVIDIA_API_KEY"),
# )
# # Set score thresholds for different dimensions (optional)
# score_threshold = {
# "correctness": 1.0,
# "clarity": 0.0,
# "completeness": 0.0,
# }
# # Or use a single threshold for all dimensions:
# score_threshold = 0.9
# Create and run pipeline
pipeline = STaRPipeline(
reason_agent=reason_agent,
problems=problems, # Pass problems list directly
output_path=output_path,
max_iterations=0,
batch_size=100, # Size of batch to process the data (optional)
# evaluate_agent=evaluate_agent, # To use evaluate agent(optional)
# score_threshold=score_threshold, # Score thresholds for agent evaluation (optional)
# reward_model=reward_model, # To use a reward model (optional)
)
print("Start generation! May take some time, please wait..")
results = pipeline.generate(rationalization=False)
end_time = time.time()
execution_time = end_time - start_time
print(f"\nProcessed {len(results)} problems")
print(f"Results saved to: {output_path}")
print(f"Total execution time: {execution_time:.2f} seconds")
通过以下代码查看生成的CoT数据:
with open('generated_data.json', 'r') as f:
data = json.load(f)
print(json.dumps(data, indent=2))
3. 上传数据到Hugging Face平台
具体步骤包含:
加载生成的数据:从本地文件加载生成的数据集。
转换为 Hugging Face 格式:将数据转换为 Hugging Face 的 Dataset 格式。
生成数据集卡片:创建包含数据集描述、标签和许可证信息的卡片。
登录 Hugging Face:使用 API token 登录 Hugging Face 账户。
上传数据集:将数据集和卡片上传到 Hugging Face平台。
# Import necessary modules and classes
from camel.datahubs.huggingface import HuggingFaceDatasetManager # Manages interactions with Hugging Face datasets
from camel.datahubs.models import Record # Represents a single record in the dataset
from datetime import datetime # Handles date and time operations
import json # For reading JSON files
def load_star_output(file_path):
r"""Load and parse the star output JSON file.
Args:
file_path (str): Path to the star_output.json file.
Returns:
list: List of traces from the JSON file.
"""
with open(file_path, 'r') as f:
data = json.load(f)
return data['traces']
# Main function: Upload dataset to Hugging Face
def upload_to_huggingface(transformed_data, username, dataset_name=None):
r"""Uploads transformed data to the Hugging Face dataset platform.
Args:
transformed_data (list): Transformed data, typically a list of dictionaries.
username (str): Hugging Face username.
dataset_name (str, optional): Custom dataset name.
Returns:
str: URL of the uploaded dataset.
"""
# Initialize HuggingFaceDatasetManager to interact with Hugging Face datasets
manager = HuggingFaceDatasetManager()
# Generate or validate the dataset name
dataset_name = generate_or_validate_dataset_name(username, dataset_name)
# Create the dataset on Hugging Face and get the dataset URL
dataset_url = create_dataset(manager, dataset_name)
# Create a dataset card to add metadata
create_dataset_card(manager, dataset_name, username)
# Convert the transformed data into a list of Record objects
records = create_records(transformed_data)
# Add the Record objects to the dataset
add_records_to_dataset(manager, dataset_name, records)
# Return the dataset URL
return dataset_url
# Generate or validate the dataset name
def generate_or_validate_dataset_name(username, dataset_name):
r"""Generates a default dataset name or validates and formats a user-provided name.
Args:
username (str): Hugging Face username.
dataset_name (str, optional): User-provided custom dataset name.
Returns:
str: Formatted dataset name.
"""
if dataset_name isNone:
# If no dataset name is provided, generate a default name with the username and current date
current_date = datetime.now().strftime("%Y%m%d")
dataset_name = f"star_traces_{current_date}"
# Format the dataset name to include the username
returnf"{username}/{dataset_name}"
# Create a dataset on Hugging Face
def create_dataset(manager, dataset_name):
r"""Creates a new dataset on Hugging Face and returns the dataset URL.
Args:
manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
dataset_name (str): Name of the dataset.
Returns:
str: URL of the created dataset.
"""
dataset_url = manager.create_dataset(dataset_name)
return dataset_url
# Create a dataset card with metadata
def create_dataset_card(manager, dataset_name, username):
r"""Creates a dataset card to add metadata
Args:
manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
dataset_name (str): Name of the dataset.
username (str): Hugging Face username.
"""
manager.create_dataset_card(
dataset_name=dataset_name,
description="A dataset containing mathematical problem-solving traces with step-by-step solutions and improvement history. Each record includes a mathematical problem, its final solution, and the iterative improvement process.",
license="mit", # Using lowercase 'mit' as required by HuggingFace
tags=["math", "problem-solving", "step-by-step", "traces"],
authors=[username],
language=["en"],
task_categories=["text-generation"],
content="This dataset contains mathematical problem-solving traces generated using the CAMEL framework. Each entry includes:\n\n"
"- A mathematical problem statement\n"
"- A detailed step-by-step solution\n"
)
# Convert transformed data into Record objects
def create_records(transformed_data):
r"""Converts transformed data into a list of Record objects.
Args:
transformed_data (list): List of trace dictionaries from star_output.json.
Returns:
list: List of Record objects.
"""
records = []
for trace in transformed_data:
record = Record(
source_type=trace['type'],
problem=trace['problem'],
solution=trace['final_trace'],
)
records.append(record)
return records
# Add Record objects to the dataset
def add_records_to_dataset(manager, dataset_name, records):
r"""Adds a list of Record objects to the dataset.
Args:
manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
dataset_name (str): Name of the dataset.
records (list): List of Record objects.
"""
manager.add_records(dataset_name, records)
前往 https://huggingface.co/settings/tokens/new?tokenType=write 获取Hugging Face 的 API 密钥,并确保你已开启对仓库的写入权限。
接下来,在 Hugging Face上创建一个新的数据集:
# Get HuggingFace token and username
HUGGING_FACE_TOKEN = getpass('Enter your HUGGING_FACE_TOKEN: ')
os.environ["HUGGING_FACE_TOKEN"] = HUGGING_FACE_TOKEN
username = input("Enter your HuggingFace username: ")
dataset_name = input("Enter your dataset name:")
# Load the star output data
current_dir = os.getcwd()
star_output_path = os.path.join(current_dir, './generated_data.json')
traces = load_star_output(star_output_path)
# Upload the data to HuggingFace
dataset_url = upload_to_huggingface(traces, username, dataset_name)
print(f"\nDataset uploaded successfully!")
print(f"You can view your dataset at: {dataset_url}")
最终上传的数据预览。
至此,本教程详细讲解了如何利使用硅基流动 SiliconCloud 平台的DeepSeek-R1 模型以及 CAMEL 的合成数据生成模块蒸馏数学推理数据,并将生成的数据集上传至 Hugging Face 平台。
近期更新
• 大量应用接入 DeepSeek R1 & V3
• 硅基流动上线Pro版DeepSeek R1 & V3
• 硅基流动助力华为小艺接入DeepSeek-R1
• 硅基流动上线DeepSeek R1 & V3企业级服务
• 硅基流动上线加速版 DeepSeek-R1 蒸馏模型
• 硅基流动x华为云推出DeepSeek R1&V3服务
让超级产品开发者实现“Token自由”
邀好友用SiliconCloud,狂送2000万Token/人
即刻体验DeepSeek-R1&V3
cloud.siliconflow.cn
扫码加入用户交流群