由于模型、权重文件已经下载好了,所有跳过这些步骤。open-webui也在昨天已经安装好,同样跳过。无废话流程
硬件环境
租的AutoDL的GPU服务器做的测试
•软件环境PyTorch 2.5.1、Python 3.12(ubuntu22.04)、Cuda 12.4
•硬件环境○GPU:RTX 4090(24GB) * 4(实际只使用一张GPU)○CPU:64 vCPU Intel(R) Xeon(R) Gold 6430○内存:480G(至少需要382G)○硬盘:1.8T(实际使用需要380G左右)
一、创建环境
创建虚拟环境
conda create --prefix=/root/autodl-tmp/jacky/envs/deepseekr1-671b python==3.12.3conda activate /root/autodl-tmp/jacky/envs/deepseekr1-671b
安装 PyTorch、packaging、ninja
pip install torch packaging ninja cpufeature numpy
安装flash-attn
pip install flash-attn
安装libstdcxx-ng
conda install -c conda-forge libstdcxx-ng
二、编译安装ktransformers
cd /root/autodl-tmp/jackycp -r ktransformers ktransformers-newcd ktransformers-newexport TORCH_CUDA_ARCH_LIST="8.9"pip install -r requirements-local_chat.txtpip install setuptools wheel packaging
修改./install.sh,加入:
export MAX_JOBS=64export CMAKE_BUILD_PARALLEL_LEVEL=64
sh install.sh
三、运行
运行ktransformer
export TORCH_CUDA_ARCH_LIST="8.9"
启动命令行聊天
export TORCH_CUDA_ARCH_LIST="8.9"python ./ktransformers/local_chat.py –model_path /root/autodl-tmp/DeepSeek-R1 –gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF –cpu_infer 64 –max_new_tokens 1000 –force_think true | tee runlog1.log
启动本地聊天API端点
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 ktransformers/server/main.py \--gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF/ \--model_path /root/autodl-tmp/DeepSeek-R1 \--model_name deepseek-r1-new \--cpu_infer 64 \--max_new_tokens 8192 \--cache_lens 32768 \--total_context 32768 \--cache_q4 true \--temperature 0.6 \--top_p 0.95 \--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml \--force_think \--use_cuda_graph \--host 127.0.0.1 \--port 12345
运行open-webui
cd /root/autodl-tmp/jacky/open-webuish start.sh
建立 ssh转发
等服务器上webui和api端点都起来后,在本地PC上,建一个ssh转发规则
ssh -CNg -L 3000:127.0.0.1:3000 root@connect.nmb1.seetacloud.com -p 22305
打开浏览器进行测试
四、参数调整
将cpu_info降低,观察tps变化
直接上结论,数据看后面:
测试之前的猜想:理论上讲cpu_infer越小就相不于把更多的事情放到gpu去做,在GPU够的情况下,应该是会有改善的(前面测试下来GPU使用率才50%不到)。
测试并在ktransformers文档中确认后,被打脸了。这个cpu_infer通常是大一点(比实际的CPU核数小)会比较好。其工作原理并不是cpu_infer越小就会放越多的任务到gpu,什么任务在什么地方跑是由yaml配置文件里指定好的,跟cpu_infer的值无关。
cpu_info = 64
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 ktransformers/server/main.py --gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF/ --model_path /root/autodl-tmp/DeepSeek-R1 --model_name deepseek-r1-new --cpu_infer 64 --max_new_tokens 8192 --cache_lens 32768 --total_context 32768 --cache_q4 true --temperature 0.6 --top_p 0.95 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml --force_think --use_cuda_graph --host 127.0.0.1 --port 12344
Performance(T/s): prefill 47.41532433167817, decode 5.593497795879592. Time(s): tokenize 0.08410024642944336, prefill 21.40658140182495, decode 84.56247186660767Performance(T/s): prefill 44.39721927042498, decode 5.727537880501856. Time(s): tokenize 0.015021562576293945, prefill 15.856848955154419, decode 290.0024468898773
cpu_infer = 32
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 ktransformers/server/main.py --gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF/ --model_path /root/autodl-tmp/DeepSeek-R1 --model_name deepseek-r1-new --cpu_infer 32 --max_new_tokens 8192 --cache_lens 32768 --total_context 32768 --cache_q4 true --temperature 0.6 --top_p 0.95 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml --force_think --use_cuda_graph --host 127.0.0.1 --port 12344
Performance(T/s): prefill 53.283252444225695, decode 7.717662862281071. Time(s): tokenize 0.07866573333740234, prefill 34.41982078552246, decode 69.19193148612976Performance(T/s): prefill 46.742691185571395, decode 7.326065169900766. Time(s): tokenize 0.02002429962158203, prefill 39.407230377197266, decode 285.0097496509552
cpu_infer = 16
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 ktransformers/server/main.py --gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF/ --model_path /root/autodl-tmp/DeepSeek-R1 --model_name deepseek-r1-new --cpu_infer 16 --max_new_tokens 8192 --cache_lens 32768 --total_context 32768 --cache_q4 true --temperature 0.6 --top_p 0.95 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml --force_think --use_cuda_graph --host 127.0.0.1 --port 12344
启用多GPU
kvcache-ai.github.io/ktransforme…不要用ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml,改用DeepSeek-V3-Chat-multi-gpu-4.yaml,或者自定义。
放大context上下文
命令行模式:local_chat.py
You can increase the context window size by setting --max_new_tokens to a larger value.
服务模式:server
Increase the `--cache_lens' to a larger value.
将更多权重移动到GPU
Refer to the ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-4.yaml
- match: name: "^model\\.layers\\.([4-10])\\.mlp\\.experts$" replace: class: ktransformers.operators.experts.KTransformersExperts kwargs: generate_device: "cuda:0" generate_op: "KExpertsMarlin" recursive: False
You can modify layer as you want, eg. name: "^model\.layers\.([4-10])\.mlp\.experts" to move more weights to the GPU.
Note: The first matched rule in yaml will be applied. For example, if you have two rules that match the same layer, only the first rule's replacement will be valid.Note:Currently, executing experts on the GPU will conflict with CUDA Graph. Without CUDA Graph, there will be a significant slowdown. Therefore, unless you have a substantial amount of VRAM (placing a single layer of experts for DeepSeek-V3/R1 on the GPU requires at least 5.6GB of VRAM), we do not recommend enabling this feature. We are actively working on optimization.Note KExpertsTorch is untested.
升级到ktransformers 0.3.0
ktransformers v0.2.2 没有开启amx, 开启了amx,应该可以增加40%性能,但是前面放出来的那个0.3 preview版本安装有问题,暂放弃。github.com/kvcache-ai/…
五、问题
联网搜索功能异常
File "/root/autodl-tmp/jacky/ds2/lib/python3.12/site-packages/open_webui/routers/retrieval.py", line 1400, in process_web_search web_results = search_web( └ <function search_web at 0x7f3336425bc0> File "/root/autodl-tmp/jacky/ds2/lib/python3.12/site-packages/open_webui/routers/retrieval.py", line 1329, in search_web return search_duckduckgo( └ <function search_duckduckgo at 0x7f3337487060> File "/root/autodl-tmp/jacky/ds2/lib/python3.12/site-packages/open_webui/retrieval/web/duckduckgo.py", line 27, in search_duckduckgo ddgs_gen = ddgs.text( │ └ <function DDGS.text at 0x7f3336be84a0> └ <duckduckgo_search.duckduckgo_search.DDGS object at 0x7f32a0cf21e0> File "/root/autodl-tmp/jacky/ds2/lib/python3.12/site-packages/duckduckgo_search/duckduckgo_search.py", line 252, in text raise DuckDuckGoSearchException(err) │ └ TimeoutException('https://lite.duckduckgo.com/lite/ RuntimeError: error sending request for url (https://lite.duckduckgo.com/... └ <class 'duckduckgo_search.exceptions.DuckDuckGoSearchException'>duckduckgo_search.exceptions.DuckDuckGoSearchException: https://lite.duckduckgo.com/lite/ RuntimeError: error sending request for url (https://lite.duckduckgo.com/lite/): operation timed outCaused by: operation timed out
六、初步结论
结合前辈们的结论,KTransformer项目部署硬件配置:
GPU单卡或者多卡对实际运行效率的影响
GPU对实际运行效率提升不大,单卡3090、单卡4090、或者是多卡GPU服务器都没有太大影响,只需要留足20G以上显存(最小可行性实验的话只需要14G显存)即可;
权重规则的运行效率tps的影响
若是多卡服务器,则可以进一步尝试手动编写模型权重卸载规则,使用更多的GPU进行推理,可以一定程度减少内存需求,但对于实际运行效率提升不大。最省钱的方案仍然是单卡GPU+大内存配置;
KTransformers版本对性能的影响
KTransformer目前有V0.2.0、V0.2.1、V0.2.2和V0.3.0,其中V0.3.0目前只有预览版,只支持二进制文件下载和安装,而V0.2.0和V0.2.1支持各类CPU。从V0.3.0开始,只支持AMX CPU,也就是最新几代的Intel CPU。这几个版本实际部署流程和调用指令没有任何区别,若当前CPU支持AMX,则可以考虑使用V3.0进行实验,推理速度会提升40%左右。