硅基流动 02月15日
SiliconCloud x CAMEL:教你用 DeepSeek-R1 蒸馏数学推理数据
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍如何利用CAMEL框架高效蒸馏硅基流动平台上的DeepSeek R1的数学推理能力,生成有价值的数据集,并上传至Hugging Face平台。涵盖安装依赖、设置密钥、下载与处理数据、蒸馏推理数据及上传数据等步骤。

📌安装所需Python库并设置硅基流动API密钥

📌从Hugging Face下载原始数学数据并进行处理

📌蒸馏包含思维过程的数学推理数据,运行CAMEL的Self-Improve数据生成模块

📌将生成的数据上传至Hugging Face平台,包括加载、转换格式、创建数据集卡片等操作

CAMEL-AI 2025-02-14 19:05 北京

手把手教你高效、低成本蒸馏大模型的数据。


DeepSeek-R1 凭借卓越的数学推理能力和高效的逻辑处理引发了广泛关注。无论是基础算术还是复杂的数学难题,它都能轻松应对,为开发者提供强大地计算支持。

现在,利用硅基流动大模型云服务平台 SiliconCloud 上高效稳定且价格亲民的满血版 DeepSeek-R1 模型https://cloud.siliconflow.cn/models),结合 CAMEL-AI 开源社区发布的多智能体框架 CAMEL(https://huggingface.co/datasets/camel-ai/),你可以利用长链式思维(CoT)提取数学问题的详细推理过程,从 DeepSeek R1 中蒸馏出高质量的数学推理数据。目前,这一数据集上传至 Hugging Face 供公开使用,推动更智能的数学推理研究。

本教程将手把手带你探索如何用 CAMEL 框架高效蒸馏硅基流动 SiliconCloud 平台上的 DeepSeek R1 的数学推理能力,生成有价值的数据集。

通过 CAMEL-AI 社区的合成数据生成工具,CAMEL-AI 精心打造了三个高质量的数据集。
无论你是希望探索 AI 如何解决复杂问题,还是想深入钻研数学推理,这些数据集都是绝佳的资源,现已发布在 Hugging Face 平台:

1. 前期准备


1.  安装依赖

首先,安装所需的Python库,从命令行执行以下命令:

pip install "git+https://github.com/camel-ai/camel.git@4210cb0849f3f13d6a46fefeb9e2c3e791c158cb#egg=camel-ai"
pip install datasets
pip install rouge


2. 设置硅基流动 SiliconCloud API 密钥

进入 SIliconCloud 网站,设置 SILICONFLOW_API_KEYhttps://cloud.siliconflow.cn/account/ak),这些密钥将用于结合思维过程来提炼数学推理数据。


from getpass import getpass
import os

SILICONFLOW_API_KEY = getpass('Enter your SILICONFLOW_API_KEY: ')
os.environ[
"SILICONFLOW_API_KEY"] = SILICONFLOW_API_KEY

# To make deepseek r1 responds with thought process content,we should set the following environment variable
os.environ[
"GET_REASONING_CONTENT"]="True"


3. 从 Hugging Face 下载数据

从 Hugging Face 平台开始准备原始的数学数据,这些数据的核心部分主要包括问题答案两部分。接下来以 GSM8K 数据集为例,详细讲解具体操作步骤。

# Set the number of problems to download from GSM8K in huggingface
NUMBER_OF_PROBLEMS=
10

import json
from pathlib import Path
import uuid
from datasets import load_dataset
def download_gsm8k_dataset():
    
try:
        
# Load the dataset using the datasets library
        dataset = load_dataset(
"openai/gsm8k""main")
        
# Get the items from train split
        data = dataset[
'train'].select(range(NUMBER_OF_PROBLEMS))
        
# Convert to the desired format
        formatted_data = []
        
for item in data:
            
# Extract the final answer from the solution
            solution = item[
'answer']
            
if solution:
                
# GSM8K solutions typically end with "#### number"
                
import re
                match = re.search(
r'####\s*(\d+)', solution)
                
if match:
                    number = match.group(
1)
                    
# Replace the "#### number" with "\boxed{number}"
                    solution = re.sub(
                        
r'####\s*\d+'f'\\\\boxed{{{number}}}', solution
                    )
            formatted_item = {
                
"id": str(uuid.uuid4()),  # GSM8K doesn't provide IDs
                
"problem": item['question'],
                
"type""openai/gsm8k",  # All problems are from GSM8K
                
"solution": solution,  # Use the modified solution with \boxed
            }
            formatted_data.append(formatted_item)
        
# Save to a file
        output = formatted_data
        output_file = 
"downloaded_gsm8k_10.json"
        
with open(output_file, "w"as f:
            json.dump(output, f, indent=
2)
        print(
f"Successfully downloaded and saved GSM8K dataset to {output_file}")
    
except Exception as e:
        print(
f"Error downloading GSM8K dataset: {e}")
if __name__ == "__main__":
    download_gsm8k_dataset()


获得了一些符合目标格式的示例数据,接下来让我们开始蒸馏一些包含详细思维过程的数学推理数据。

2. 蒸馏包含思维过程的数学推理数据(长链思维数据,Long CoT Data)


1. 导入所需的库

import nest_asyncio
nest_asyncio.apply()
import json
import os
import time
from camel.agents import ChatAgent
from camel.datagen import STaRPipeline
from camel.models import ModelFactory
from camel.types import ModelPlatformType, ModelType


2. 设置推理模型和评估模型

通过硅基流动的 SiliconCloud 平台来调用 DeepSeek R1 模型。CAMEL 的模型管理器会根据请求的成功情况自动切换模型。

# Set DeepSeek R1 served by siliconflow as reason model 1
reason_model_1 = ModelFactory.create(
    model_platform=ModelPlatformType.OPENAI_COMPATIBLE_MODEL,
    model_type=
"deepseek-ai/DeepSeek-R1",
    api_key=os.environ[
"SILICONFLOW_API_KEY"],
    url=
"https://api.siliconflow.cn/v1",
    model_config_dict={
"max_tokens"4096}, # Config the max_token carefully
)
# Set DeepSeek R1 served by deepseek cloud as reason model 2
reason_model_2 = ModelFactory.create(
    model_platform=ModelPlatformType.DEEPSEEK,
    model_type=ModelType.DEEPSEEK_REASONER,
)


3. 运行 CAMEL 的 Self-Improve 数据生成模块

在运行之前,请注意一些关键参数的设置,例如:


注意事项:


运行完成后,你将在output_path中找到生成的高质量数学推理数据集。

start_time = time.time()
problems_path = 
"downloaded_gsm8k_10.json"
output_path = 
"generated_data.json"
# Load problems from JSON file
with open(problems_path, 'r'as f:
    problems = json.load(f)
# Initialize agent
reason_agent_system_message = 
"""Answer my question and give your
final answer within \\boxed{}."""

evaluate_agent_system_message = 
"""You are a highly critical teacher who
evaluates the student's answers with a meticulous and demanding approach.
"""

# Set up reason agent
reason_agent = ChatAgent(
    system_message=reason_agent_system_message,
    model=[reason_model_1, reason_model_2], 
# add models to the list, You can also swtich to other models
)
# # Set up evaluate agent(optional)
# evaluate_agent = ChatAgent(
#     system_message=evaluate_agent_system_message
# )
# # Initialize reward model (optional)
# reward_model = NemotronRewardModel(
#     model_type=ModelType.NVIDIA_NEMOTRON_340B_REWARD,
#     url="https://integrate.api.nvidia.com/v1",
#     api_key=os.environ.get("NVIDIA_API_KEY"),
# )
# # Set score thresholds for different dimensions (optional)
# score_threshold = {
#     "correctness": 1.0,
#     "clarity": 0.0,
#     "completeness": 0.0,
# }
# # Or use a single threshold for all dimensions:
# score_threshold = 0.9
# Create and run pipeline
pipeline = STaRPipeline(
    reason_agent=reason_agent,
    problems=problems,  
# Pass problems list directly
    output_path=output_path,
    max_iterations=
0,
    batch_size=
100# Size of batch to process the data (optional)
    
# evaluate_agent=evaluate_agent, # To use evaluate agent(optional)
    
# score_threshold=score_threshold, # Score thresholds for agent evaluation (optional)
    
# reward_model=reward_model,  # To use a reward model (optional)
)
print(
"Start generation! May take some time, please wait..")
results = pipeline.generate(rationalization=
False)
end_time = time.time()
execution_time = end_time - start_time
print(
f"\nProcessed {len(results)} problems")
print(
f"Results saved to: {output_path}")
print(
f"Total execution time: {execution_time:.2f} seconds")


通过以下代码查看生成的CoT数据:

with open('generated_data.json''r'as f:
    data = json.load(f)
    print(json.dumps(data, indent=
2))


3. 上传数据到Hugging Face平台


具体步骤包含:


# Import necessary modules and classes
from camel.datahubs.huggingface import HuggingFaceDatasetManager  # Manages interactions with Hugging Face datasets
from camel.datahubs.models import Record  # Represents a single record in the dataset
from datetime import datetime  # Handles date and time operations
import json  # For reading JSON files
def load_star_output(file_path):
    
r"""Load and parse the star output JSON file.
    Args:
        file_path (str): Path to the star_output.json file.
    Returns:
        list: List of traces from the JSON file.
    """

    
with open(file_path, 'r'as f:
        data = json.load(f)
    
return data['traces']
# Main function: Upload dataset to Hugging Face
def upload_to_huggingface(transformed_data, username, dataset_name=None):
    
r"""Uploads transformed data to the Hugging Face dataset platform.
    Args:
        transformed_data (list): Transformed data, typically a list of dictionaries.
        username (str): Hugging Face username.
        dataset_name (str, optional): Custom dataset name.
    Returns:
        str: URL of the uploaded dataset.
    """

    
# Initialize HuggingFaceDatasetManager to interact with Hugging Face datasets
    manager = HuggingFaceDatasetManager()
    
# Generate or validate the dataset name
    dataset_name = generate_or_validate_dataset_name(username, dataset_name)
    
# Create the dataset on Hugging Face and get the dataset URL
    dataset_url = create_dataset(manager, dataset_name)
    
# Create a dataset card to add metadata
    create_dataset_card(manager, dataset_name, username)
    
# Convert the transformed data into a list of Record objects
    records = create_records(transformed_data)
    
# Add the Record objects to the dataset
    add_records_to_dataset(manager, dataset_name, records)
    
# Return the dataset URL
    
return dataset_url
# Generate or validate the dataset name
def generate_or_validate_dataset_name(username, dataset_name):
    
r"""Generates a default dataset name or validates and formats a user-provided name.
    Args:
        username (str): Hugging Face username.
        dataset_name (str, optional): User-provided custom dataset name.
    Returns:
        str: Formatted dataset name.
    """

    
if dataset_name isNone:
        
# If no dataset name is provided, generate a default name with the username and current date
        current_date = datetime.now().strftime(
"%Y%m%d")
        dataset_name = 
f"star_traces_{current_date}"
    
# Format the dataset name to include the username
    
returnf"{username}/{dataset_name}"
# Create a dataset on Hugging Face
def create_dataset(manager, dataset_name):
    
r"""Creates a new dataset on Hugging Face and returns the dataset URL.
    Args:
        manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
        dataset_name (str): Name of the dataset.
    Returns:
        str: URL of the created dataset.
    """

    dataset_url = manager.create_dataset(dataset_name)
    
return dataset_url
# Create a dataset card with metadata
def create_dataset_card(manager, dataset_name, username):
    
r"""Creates a dataset card to add metadata
    Args:
        manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
        dataset_name (str): Name of the dataset.
        username (str): Hugging Face username.
    """

    manager.create_dataset_card(
        dataset_name=dataset_name,
        description=
"A dataset containing mathematical problem-solving traces with step-by-step solutions and improvement history. Each record includes a mathematical problem, its final solution, and the iterative improvement process.",
        license=
"mit",  # Using lowercase 'mit' as required by HuggingFace
        tags=[
"math""problem-solving""step-by-step""traces"],
        authors=[username],
        language=[
"en"],
        task_categories=[
"text-generation"],
        content=
"This dataset contains mathematical problem-solving traces generated using the CAMEL framework. Each entry includes:\n\n"
                
"- A mathematical problem statement\n"
                
"- A detailed step-by-step solution\n"
    )
# Convert transformed data into Record objects
def create_records(transformed_data):
    
r"""Converts transformed data into a list of Record objects.
    Args:
        transformed_data (list): List of trace dictionaries from star_output.json.
    Returns:
        list: List of Record objects.
    """

    records = []
    
for trace in transformed_data:
        record = Record(
            source_type=trace[
'type'],
            problem=trace[
'problem'],
            solution=trace[
'final_trace'],
        )
        records.append(record)
    
return records
# Add Record objects to the dataset
def add_records_to_dataset(manager, dataset_name, records):
    
r"""Adds a list of Record objects to the dataset.
    Args:
        manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
        dataset_name (str): Name of the dataset.
        records (list): List of Record objects.
    """

    manager.add_records(dataset_name, records)


前往 https://huggingface.co/settings/tokens/new?tokenType=write 获取Hugging Face 的 API 密钥,并确保你已开启对仓库的写入权限。

接下来,在 Hugging Face上创建一个新的数据集:

# Get HuggingFace token and username
HUGGING_FACE_TOKEN = getpass(
'Enter your HUGGING_FACE_TOKEN: ')
os.environ[
"HUGGING_FACE_TOKEN"] = HUGGING_FACE_TOKEN
username = input(
"Enter your HuggingFace username: ")
dataset_name = input(
"Enter your dataset name:")
# Load the star output data
current_dir = os.getcwd()
star_output_path = os.path.join(current_dir, 
'./generated_data.json')
traces = load_star_output(star_output_path)
# Upload the data to HuggingFace
dataset_url = upload_to_huggingface(traces, username, dataset_name)
print(
f"\nDataset uploaded successfully!")
print(
f"You can view your dataset at: {dataset_url}")


最终上传的数据预览。

至此,本教程详细讲解了如何利使用硅基流动 SiliconCloud 平台的DeepSeek-R1 模型以及 CAMEL 的合成数据生成模块蒸馏数学推理数据,并将生成的数据集上传至 Hugging Face 平台。

近期更新

 大量应用接入 DeepSeek R1 & V3
 硅基流动上线Pro版DeepSeek R1 & V3

 硅基流动助力华为小艺接入DeepSeek-R1
 硅基流动上线DeepSeek R1 & V3企业级服务
 硅基流动上线加速版 DeepSeek-R1 蒸馏模型
 硅基流动x华为云推出DeepSeek R1&V3服务

让超级产品开发者实现“Token自由”

邀好友用SiliconCloud,狂送2000万Token/人

即刻体验DeepSeek-R1&V3

cloud.siliconflow.cn

扫码加入用户交流群

阅读原文

跳转微信打开

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

CAMEL框架 DeepSeek R1 数学推理 数据蒸馏 Hugging Face
相关文章