vLLM本地化部署
AI培训班deepseek部署测试
AI培训联通云电脑
用户名 | 密码 |
---|---|
cuichaoyang | xxxxxx |
培训虚拟机交付表
虚拟机明细
虚拟机名称 | 备注 | 虚拟机Root密码 |
---|---|---|
cuichaoyang | 初始的root密码: | xxxxxx |
cuichaoyang | 修改后的root密码: | xxxxxx |
端口映射表
项目名称: | IP地址 |
---|---|
公网地址: | x.x.x.x |
虚拟机名称 | 虚拟机ID | 映射端口 | 虚拟机端口 |
---|---|---|---|
cuichaoyang | 8600037278844887040 | 12380 | 22 |
cuichaoyang | 8600037278844887040 | 12396 | 1088 |
cuichaoyang | 8600037278844887040 | 12412 | 8080 |
cuichaoyang | 8600037278844887040 | 12428 | 11434 |
本地化部署
查看当前目录下的文件大小:
root@cuichaoyang:~#
ls -lh
结果如下:
total 2.0G
drwxr-xr-x 2 root root 4.0K Apr 16 10:29 install_env
drwxr-xr-x 11 root root 4.0K Sep 20 2022 NVIDIA_CUDA-11.0_Samples
-rw-r--r-- 1 root root 348M Oct 2 2022 NVIDIA-Linux-x86_64-515.76.run
-rw-r--r-- 1 root root 1.7G Apr 12 23:02 ollama-linux-amd64.tgz
-rw-r--r-- 1 root root 1.3K Apr 11 15:11 run_cluster.sh
drwx------ 3 root root 4.0K Sep 19 2022 snap
-rw-r--r-- 1 root root 770 Apr 16 10:57 start-vllm.sh
一、Ollama本地部署
1.1 Ollama Windows系统部署
暂无
1.2 Ollama Linux系统部署
暂无
1.3 Ollama Docker运行
暂无
二、vLLM 本地部署
2.1 vLLM Docker单机部署
vLLM是什么?
vLLM(Virtual Large Language Model)是一款专为大规模语言模型(LLM)设计的高效推理和服务引擎,其核心目标是通过创新的内存管理和并行计算技术,显著提升LLM推理的吞吐量和资源利用率,同时降低延迟,从而支持高并发场景下的实时应用。现已发展成为一个由学术界和工业界共同贡献的社区驱动项目。
下载docker镜像文件1
root@cuichaoyang:~#
docker pull vllm/vllm-openai:v0.7.3
结果如下:
v0.7.3: Pulling from vllm/vllm-openai
Digest: sha256:4f4037303e8c7b69439db1077bb849a0823517c0f785b894dc8e96d58ef3a0c2
Status: Image is up to date for vllm/vllm-openai:v0.7.3
docker.io/vllm/vllm-openai:v0.7.3
下载docker镜像文件2
root@cuichaoyang:~#
docker pull ghcr.io/open-webui/open-webui:cuda
结果如下:
cuda: Pulling from open-webui/open-webui
Digest: sha256:b768633c723f01a2d566d090710a853f39c569c362b913476dfe4977addd613e
Status: Image is up to date for ghcr.io/open-webui/open-webui:cuda
ghcr.io/open-webui/open-webui:cuda
下载docker镜像文件3
root@cuichaoyang:~#
docker pull gpustack/gpustack
结果如下:
Using default tag: latest
latest: Pulling from gpustack/gpustack
Digest: sha256:8c0521de972f1f16494a87e3de696d4c06cc3d2a08716aa990656132bb9d2eb0
Status: Image is up to date for gpustack/gpustack:latest
docker.io/gpustack/gpustack:latest
查看docker镜像文件
root@cuichaoyang:~#
docker images
结果如下:
REPOSITORY TAG IMAGE ID CREATED SIZE
ghcr.io/open-webui/open-webui cuda b768633c723f 47 hours ago 13.4GB
gpustack/gpustack 0.6.0rc1 292a17d3032b 4 days ago 23.7GB
vllm/vllm-openai v0.7.3 4f4037303e8c 7 weeks ago 25.4GB
gpustack/gpustack latest 8c0521de972f 2 months ago 21.3GB
下载1.5B推理模型
阿里魔搭社区,下载deepseek-r1-1.5B模型的命令脚本01_model_download-1.5B.sh
如下:
#!/bin/bash
# 安装依赖
apt install python3-pip -y
pip3 install modelscope
# 创建模型存储目录
#mkdir -p /data01/snapshot_download/deepseekr1_Qwen_32B
mkdir -p /data01/snapshot_download/deepseekr1_Qwen_1.5B
# 执行Python下载代码
python3 - <<EOF
from modelscope import snapshot_download
model_dir = snapshot_download(
'deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B',
cache_dir='/data01/snapshot_download/deepseekr1_Qwen_1.5B',
revision='master'
)
print(f"Model downloaded to: {model_dir}")
EOF
查看下载后的1.5B文件目录:
root@cuichaoyang:~#
ls /data01/snapshot_download/deepseekr1_Qwen_1.5B/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/ -lh
结果如下:
total 3.4G
-rw-r--r-- 1 root root 679 Apr 16 10:29 config.json
-rw-r--r-- 1 root root 73 Apr 16 10:29 configuration.json
drwxr-xr-x 2 root root 4.0K Apr 16 10:29 figures
-rw-r--r-- 1 root root 181 Apr 16 10:29 generation_config.json
-rw-r--r-- 1 root root 1.1K Apr 16 10:29 LICENSE
-rw-r--r-- 1 root root 3.4G Apr 16 10:32 model.safetensors
-rw-r--r-- 1 root root 16K Apr 16 10:29 README.md
-rw-r--r-- 1 root root 3.0K Apr 16 10:29 tokenizer_config.json
-rw-r--r-- 1 root root 6.8M Apr 16 10:29 tokenizer.json
安装显卡驱动
查看NVIDIA显卡的状态(vLLM运行需要cuda版本为12.1以上,可通过nvidia-smi来查看版本。如版本满足,次步骤可略过。):
root@cuichaoyang:~#
nvidia-smi
执行结果如下:
Wed Apr 16 16:12:49 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 Off | 00000000:00:07.0 Off | 0 |
| N/A 72C P0 34W / 70W | 12571MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1236 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 15817 C /usr/bin/python3 12550MiB |
+-----------------------------------------------------------------------------------------+
若没有安装,自行去NVIDIA官网下载相关驱动,并安装。
docker启动
做好端口映射,将8000:8000
替换为12412:8000
root@cuichaoyang:~#
docker run -d --runtime nvidia --gpus all \
-v /data01/snapshot_download/deepseekr1_Qwen_1.5B/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B:/home/llm_deploy \
--env "HF_HUB_OFFLINE=1" \
-p 12412:8000 \
--ipc=host \
vllm/vllm-openai:v0.7.3 \
--model /home/llm_deploy \
--served_model_name DeepSeek-R1-1.5B \
--max_model_len 2048 \
--tensor-parallel-size 1 \
--dtype=half \
--gpu-memory-utilization=0.9 \
--api_key OPENWEBUI123
ifconfig查看本机ip地址:
root@cuichaoyang:~#
ifconfig
执行结果如下:
docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255
inet6 fe80::389a:d5ff:fe36:188e prefixlen 64 scopeid 0x20<link>
ether 3a:9a:d5:36:18:8e txqueuelen 0 (Ethernet)
RX packets 34 bytes 3299 (3.2 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 40 bytes 10068 (10.0 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.3.229 netmask 255.255.255.0 broadcast 192.168.3.255
inet6 fe80::f816:3eff:fe8b:25b9 prefixlen 64 scopeid 0x20<link>
ether fa:16:3e:8b:25:b9 txqueuelen 1000 (Ethernet)
RX packets 3313814 bytes 11559062108 (11.5 GB)
RX errors 0 dropped 2 overruns 0 frame 0
TX packets 2134191 bytes 136495740 (136.4 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 4538 bytes 538920 (538.9 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 4538 bytes 538920 (538.9 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth9e2d49d: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::8c:26ff:fe5e:6325 prefixlen 64 scopeid 0x20<link>
ether 02:8c:26:5e:63:25 txqueuelen 0 (Ethernet)
RX packets 19 bytes 3145 (3.1 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 38 bytes 8764 (8.7 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
vethb0eb23a: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::b0ab:b9ff:fe2c:30f prefixlen 64 scopeid 0x20<link>
ether b2:ab:b9:2c:03:0f txqueuelen 0 (Ethernet)
RX packets 3 bytes 126 (126.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 13 bytes 1777 (1.7 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
问答(相关命令):
root@cuichaoyang:~#
curl http://192.168.3.229:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer OPENWEBUI123" \
-d '{
"model": "DeepSeek-R1-1.5B",
"messages": [
{"role": "system", "content": "世界上最高的山峰是哪座山?"},
{"role": "user", "content": "你好"}
]
}'
结果如下:
{"id":"chatcmpl-0b7f8076e98d4178a8309a9641049e04","object":"chat.completion","created":1744788362,"model":"DeepSeek-R1-1.5B","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"好,现在我要回答用户的问题:“世界上最高的山峰是哪座山?”首先,我想确认一下用户是在寻找这个问题的答案,可能他们想知道答案的来源,或者对火星的名称产生兴趣。\n\n接下来,我需要思考这个问题在历史上和科学上的相关性。我认为这个问题涉及到多个方面,包括已知山峰的高度,以及科学史上的故事。我应该先整理一下已知的最高的山峰,比如珠穆朗玛峰,然后我记得(lineshape尔峰的高度了。但为了准确回答,最好在正式上下文中补充相关信息,但这样可能会让用户有点困惑。\n\n我还应该考虑用户可能关心的是这座山是否有特别的意义,比如向地球、人类或者其他生命形式对的一种认同。这也是一个值得探讨的话题,可以增加回答的深度和趣味性。\n\n最后,我应该保持语言简洁明了,避免使用过于专业或晦涩的词汇,让用户容易理解。同时,考虑到用户可能没有直接回答问题的意图,我需要确保回答不仅满足他们的需求,还能为他们提供更多有用的信息,比如更多关于珠穆朗玛峰等的背景知识。\n\n总结一下,我的回答应该是:\n世界上最高的山峰是珠穆朗玛峰,高度超过8848米,秋季达到最高峰,年均气温36.2摄氏度。它以其独特的山顶king角色而闻名,被用来向地球、人类与其他生命形式呈现一种认同感。此外, Lineshape尔峰也是世界上的山峰,高度超过8811米,位于引发这个词意义的地方。\n</think>\n\n世界上最高的山峰之一是珠穆朗玛峰,其海拔为8848米,是山Tán的 Namesake之一。 blenders of geothermal fragile and engineering campaigns have made supporter the revered geothermal wafer. spectral properties and members ofING качество.德意志联合档案机构保留了老 (^(speed这个部分似乎有些疑问。来自_lineshape尔峰。.md attribute: Lineshape尔峰. 此山也是世界上的山峰,当我们将它视为这一概念时,这反映了它对为何能成为某个误会或误解的组成部分。 Dallas & lifelong Andrea T. Crow. 🐔讽刺了Lineshape尔峰与其他山峰生长之间的关系。图片体现Lineshape尔峰。青灰色 ,来自_lineshape尔峰。的click以外的颜色的 Purchase Conflict Dogxt;.graphs主题图片 file sailing logo elifaces it无直接的使用可能要展示意图。","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":15,"total_tokens":526,"completion_tokens":511,"prompt_tokens_details":null}
查看docker的进程:
root@cuichaoyang:~#
docker ps -a
执行结果:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
734a3ff89d8c vllm/vllm-openai:v0.7.3 "python3 -m vllm.ent…" 39 minutes ago Exited (1) 39 minutes ago clever_hermann
5bc72edf47f9 vllm/vllm-openai:v0.7.3 "python3 -m vllm.ent…" 43 minutes ago Up 43 minutes 0.0.0.0:8000->8000/tcp, [::]:8000->8000/tcp dreamy_pascal
01eb5b671c82 vllm/vllm-openai:v0.7.3 "python3 -m vllm.ent…" 5 hours ago Created exciting_cannon
453430ead584 vllm/vllm-openai:v0.7.3 "python3 -m vllm.ent…" 5 hours ago Exited (1) 5 hours ago stoic_golick
8afccd73a918 vllm/vllm-openai:v0.7.3 "python3 -m vllm.ent…" 5 hours ago Created keen_noether
5df6ef72b29a vllm/vllm-openai:v0.7.3 "python3 -m vllm.ent…" 5 hours ago Exited (1) 5 hours ago interesting_knuth
5148204b5c14 vllm/vllm-openai:v0.7.3 "python3 -m vllm.ent…" 5 hours ago Created practical_lehmann
e9a538621323 vllm/vllm-openai:v0.7.3 "python3 -m vllm.ent…" 5 hours ago Exited (1) 5 hours ago mystifying_tharp
592e2e324e11 vllm/vllm-openai:v0.7.3 "python3 -m vllm.ent…" 3 days ago Exited (1) 3 days ago strange_bhabha
2.2 vLLM集群部署
暂无
三、GPUStack本地部署
暂无
GPUStack是什么?
GPUStack 支持异构 GPU 构建统一管理的算力集群,无论目标 GPU 运行在 Apple Mac、Windows PC 还是 Linux 服务器上, GPUStack 都能统一纳管并形成统一算力集群。GPUStack 管理员可以从诸如 HuggingFace、modelscope等流行的大语言模型仓库中轻松部署任意 LLM。进而,开发人员则可以像访问 OpenAI 或 Microsoft Azure 等供应商提供的公有 LLM 服务的 API 一样,非常简便地调用 OpenAI 兼容的 API 访问部署就绪的私有 LLM。
3.1GPUStack集群模式
暂无