如何基于自定义MCP服务器构建支持工具调用的Llama智能体（含code）

在人工智能应用日益普及的今天，隐私保护与数据主权成为重要挑战。传统的AI模型依赖外部服务，导致私有知识面临泄露风险。本文将详细介绍如何构建一个完全本地化的AI智能体，通过自定义的Model Context Protocol（MCP）服务器实现知识隔离，并结合Llama 3.2轻量级模型实现工具调用能力。

一、背景与目标：从知识隔离到本地化智能体

在人工智能应用日益普及的今天，隐私保护与数据主权成为重要挑战。传统的AI模型依赖外部服务，导致私有知识面临泄露风险。本文将详细介绍如何构建一个完全本地化的AI智能体，通过自定义的Model Context Protocol（MCP）服务器实现知识隔离，并结合Llama 3.2轻量级模型实现工具调用能力。这一方案不仅确保数据完全在本地运行，还能通过工具调用与私有知识库深度交互，为本地化智能应用提供了可行路径。

（一）MCP服务器的前世今生

在之前的文章中，笔者构建了一个自定义MCP服务器，其核心目标是实现三大功能：

只读访问控制

路径隐私保护

协议深度理解

该服务器通过标准输入输出（stdio）与外部通信，能够连接Obsidian知识库，提供工具调用接口。例如，list_knowledges工具用于获取知识库中的文件列表，get_knowledge_by_uri工具通过URI检索具体知识内容。这些工具为后续智能体的构建奠定了基础。

（二）从外部模型到本地化的挑战

尽管现有MCP服务器已实现知识隔离，但依赖外部AI模型（如Claude）仍存在两大问题：

成本限制

隐私风险

因此，构建完全本地化的智能体成为必然选择。核心目标包括：

实现MCP客户端，与自定义服务器通信。集成本地运行的LLM模型，替代外部服务。构建智能体框架，结合MCP工具与LLM实现问答逻辑。

二、核心技术选型：轻量级模型与工具调用机制

（一）Llama 3.2模型的选择

在智能体开发中，语言模型是核心“大脑”。考虑到本地化运行需求，需选择轻量级且支持工具调用的模型。Llama 3.2系列的1B/3B模型成为理想选择，其特点包括：

设备友好性

工具调用支持

性能平衡

Meta官方数据显示，Llama 3.2 3B模型在保持较小体积的同时，能够处理复杂指令并生成高质量响应。例如，在工具调用场景中，该模型可解析函数参数并生成正确的调用格式，尽管其多轮对话能力稍逊于70B以上的大型模型。

（二）工具调用流程解析

Llama模型的工具调用基于特定的提示格式与角色机制，核心流程如下：

1.系统提示定义工具在系统提示中以JSON格式声明可用工具及其参数。

复制

{  "name": "get_user_name",  "description": "Retrieve a name for a specific user by their unique identifier",  "parameters": {    "type": "dict",    "required": ["user_id"],    "properties": {"user_id": {"type": "integer"}}  }}1.2.3.4.5.6.7.8.9.

2.用户提问触发调用用户问题触发模型判断是否需要工具。例如，查询用户ID为7890的名称时，模型生成工具调用表达式[get_user_name(user_id=7890)]。

3.执行工具并反馈结果应用解析调用表达式，通过MCP客户端执行工具，将结果（如{"output": "Hyunjong Lee"}）以ipython角色返回模型。

4.结果合成响应模型结合工具输出生成最终回答，如“The name of user who has the ID is Hyunjong Lee”。

需要注意的是，轻量级模型（如3B）在处理多轮工具调用时可能不稳定。Meta建议，对于复杂对话场景，优先使用70B以上模型，但在单轮或简单多轮调用中，3B模型仍可胜任。

三、智能体架构设计：从客户端到对话逻辑

智能体的整体架构包含三大核心组件：MCP客户端与管理器、LLM模型接口、智能体逻辑层。以下是各部分的详细实现。

（一）MCP客户端与管理器

1. MCP客户端实现

使用Python的MCP SDK构建客户端，通过标准输入输出与服务器进程通信。核心类MCPClient负责连接服务器、初始化会话并执行工具调用：

复制

class MCPClient:    async def connect_to_server(self, server_script_path):        server_params = StdioServerParameters(command="python", args=[server_script_path])        stdio_transport = await self.exit_stack.enter_async_context(stdio_client(server_params))        self.read, self.write = stdio_transport        self.session = await self.exit_stack.enter_async_context(ClientSession(self.read, self.write))        init_result = await self.session.initialize()        self.name = f"{init_result.serverInfo.name}(v{init_result.serverInfo.version})"    async def call_tool(self, name, args):        response = await self.session.call_tool(name, args)        return response.isError, response.content1.2.3.4.5.6.7.8.9.10.11.12.

客户端遵循MCP握手流程：首先发送初始化请求，获取服务器信息（如协议版本、工具列表），然后通过list_tools、list_resources等方法枚举可用资源，通过call_tool执行具体工具。

2. MCP管理器

为支持多服务器管理，设计MCPClientManager类，负责客户端实例的创建、销毁及工具映射：

复制

class MCPClientManager:    def __init__(self):        self.clients = []        self.tool_map = {}  # 工具名到客户端索引的映射    async def init_mcp_client(self, server_paths):        for path in server_paths:            client = MCPClient()            await client.connect_to_server(path)            self.clients.append(client)            tools = await client.list_tools()            for tool in tools:                self.tool_map[tool.name] = len(self.clients) - 1    async def call_tool(self, name, params):        idx = self.tool_map.get(name, -1)        if idx == -1:            raise Exception(f"Tool {name} not found")        return await self.clients[idx].call_tool(name, params)1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.

管理器维护工具与客户端的映射关系，确保工具调用请求正确路由至对应的服务器实例。

（二）LLM模型集成

1. Llama.cpp的本地化部署

使用Llama.cpp库在本地运行Llama模型，步骤如下：

模型下载

复制

from huggingface_hub import snapshot_downloadsnapshot_download("meta-llama/Llama-3.2-3B-Instruct", local_dir="./models")1.2.

格式转换

复制

python convert_hf_to_gguf.py ./models/Llama-3.2-3B-Instruct --outfile model.gguf --outtype f161.

Python接口封装

LlamaCPP

复制

class LlamaCPP:    def __init__(self, model_path):        self.model = Llama(model_path=model_path, n_ctx=1024)    def generate(self, prompt, max_tokens=512):        output = self.model(prompt, max_tokens=max_tokens)        return output["choices"][0]["text"].strip()1.2.3.4.5.6.7.

2. 提示工程与对话历史管理

为适配Llama的提示格式，设计LlamaMessage与LlamaPrompt类，负责消息格式化与对话历史维护：

复制

class LlamaMessage:    def __init__(self, role, cnotallow="", tool_scheme=""):        self.role = role        self.content = content        self.tool_scheme = tool_scheme    def template(self, tool_enabled=False):        prompt = f"<|start_header_id|>{self.role}<|end_header_id|>"        if tool_enabled and self.tool_scheme:            prompt += self.tool_scheme        if self.content:            prompt += f"{self.content}<|eot_id|>"        return promptclass LlamaPrompt:    def __init__(self):        self.system_prompt = LlamaMessage("system", "You are a helpful assistant.")        self.history = History()    def get_generation_prompt(self, tool_enabled=False, last=50):        prompt = [self.system_prompt] + self.history.get_chat_history(last) + [LlamaMessage("assistant")]        return ''.join([msg.template(tool_enabled) for msg in prompt])1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.

LlamaPrompt类支持动态添加用户、助手、工具结果等角色的消息，并根据tool_enabled参数决定是否在提示中包含工具调用说明，避免轻量级模型因持续看到工具指令而产生混淆。

（三）智能体核心逻辑

1. 工具调用决策与结果处理

智能体通过正则表达式匹配工具调用模式，解析函数名与参数，并调用对应的MCP工具：

复制

class Agent:    def __init__(self, model, prompt, mcp_manager):        self.llm = model        self.prompt = prompt        self.mcp_manager = mcp_manager        self.tool_pattern = re.compile(r'[([A-Za-z0-9_]+(.*?),?\s?)+]')    def _is_tool_required(self, response):        return bool(self.tool_pattern.match(response))    async def get_result_tool(self, response):        results = []        for name, params in self.parse_func_params(response):            is_error, content = await self.mcp_manager.call_tool(name, params)            results.append({"name": name, "output": [c.text for c in content]})        return json.dumps(results)1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.

2. 对话流程控制

智能体的chat方法实现完整的对话流程：

用户提问：

模型响应：

工具执行：

结果合成

复制

async def chat(self, question): tool_scheme = TOOL_CALL_PROMPT.format(function_scheme=self.mcp_manager.get_func_scheme()) user_msg = self.prompt.get_user_prompt(question, tool_scheme) self.prompt.append_history(user_msg) response = self.llm.generate(self.prompt.get_generation_prompt(tool_enabled=True)) if self._is_tool_required(response):     tool_result = await self.get_result_tool(response)     tool_msg = self.prompt.get_tool_result_prompt(tool_result)     self.prompt.append_history(tool_msg)     response = self.llm.generate(self.prompt.get_generation_prompt(tool_enabled=False)) return response1.2.3.4.5.6.7.8.9.10.11.12.13.

通过tool_enabled参数的切换，智能体在工具调用决策阶段包含工具指令，而在结果合成阶段移除指令，避免模型过度关注工具调用，提升回答的连贯性。

四、实验与优化：从问题发现到效果提升

（一）工具指令注入方式对比

1. 系统提示注入（持续暴露工具指令）

问题表现

合成回答时出现空响应或无关工具调用。

模型过度依赖工具，即使已有结果仍重复调用。

案例用户查询特定知识内容后，模型在回答时错误调用list_knowledges工具。

2. 用户提示注入（仅在需要时暴露）

优化策略

仅在生成工具调用决策时包含工具指令。

结果合成阶段移除指令，专注于内容整合。

效果提升

回答相关性显著提高，工具调用更精准。

模型能有效利用工具结果，如正确解析知识内容并生成摘要。

（二）实际应用案例

1. 知识摘要生成

查询

流程

模型调用get_knowledge_by_uri工具获取笔记内容。

根据内容长度自动格式化为表格，尽管存在轻微参数错误，但结果结构化程度高。

2. 空笔记检测

查询

流程

模型调用list_knowledges获取所有笔记元数据。

通过文件字节大小判断空笔记，结果部分正确，需进一步优化筛选逻辑。

3. 问题生成

查询

流程

调用工具获取笔记内容。

模型分析内容结构，生成符合要求的问题，如“什么是监督学习？”

（三）性能与局限性

模型性能

功能限制

多轮工具调用能力较弱，难以处理复杂推理任务。

对模糊查询的理解不足，需明确参数（如正确的URI）才能有效执行工具。