vllm-cpu推理-以qwen2-0.5b为例

硬件信息：

BCC: 2核8G

创建推理环境

参考链接

https://docs.vllm.ai/en/stable/getting_started/cpu-installation.html

https://github.com/Williamyzd/vllm/blob/main/Dockerfile.cpu

https://github.com/vllm-project/vllm/blob/main/requirements-cpu.txt

https://www.cnblogs.com/obullxl/p/18353447/NTopic2024081101

准备镜像

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


# 拉取代码
#如果没有git需要先安装 yum install git
git clone https://github.com/vllm-project/vllm.git vllm-project
cd vllm-project
# 打镜像
# 以下命令在 Docker version 26.1.4 中测试
# 可编辑Dockerfile.cpu做优化
# 1. 修改基础镜像为：registry.cn-hangzhou.aliyuncs.com/reg_pub/ubuntu:22.04
# 2. 调整pip源和source为国内源
docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .

如果执行速度慢可以考虑加上国内源

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


#!/bin/bash
# 添加清华pip源
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

# 备份原有的sources.list文件
sudo cp /etc/apt/sources.list /etc/apt/sources.list.bak

# 创建一个新的sources.list文件，并添加清华源
cat > /etc/apt/sources.list <<EOF
# 默认注释了源码镜像以提高 apt update 速度，如有需要可自行取消注释
deb https://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe multiverse

# deb https://mirrors.aliyun.com/ubuntu/ jammy-proposed main restricted universe multiverse
# deb-src https://mirrors.aliyun.com/ubuntu/ jammy-proposed main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe multiverse

EOF

dockerfile样例

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81


# https://github.com/Williamyzd/vllm/blob/main/Dockerfile.cpu
# This vLLM Dockerfile is used to construct image that can build and run vLLM on x86 CPU platform.
# >>>>>>>更换为国内镜像
FROM registry.cn-hangzhou.aliyuncs.com/reg_pub/ubuntu:22.04 AS cpu-test-1
ENV CCACHE_DIR=/root/.cache/ccache
ENV CMAKE_CXX_COMPILER_LAUNCHER=ccache

# >>>>>>>添加国内源
RUN  cp /etc/apt/sources.list /etc/apt/sources.list.bak \
&& echo "deb https://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse" >> /etc/apt/sources.list \
&& echo "deb-src https://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse"  >> /etc/apt/sources.list \
&& echo "deb https://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe multiverse"  >> /etc/apt/sources.list \
&& echo "deb-src https://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe multiverse"  >> /etc/apt/sources.list \
&& echo "deb https://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe multiverse"  >> /etc/apt/sources.list \
&& echo "deb-src https://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe multiverse"  >> /etc/apt/sources.list \
&& echo "deb https://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe multiverse"  >> /etc/apt/sources.list \
&& echo "deb-src https://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe multiverse"  >> /etc/apt/sources.list 

# >>>>>>>这里添加了清华pip源
RUN --mount=type=cache,target=/var/cache/apt \
    apt-get update -y \
    && apt-get install -y curl ccache git wget vim numactl gcc-12 g++-12 python3 python3-pip libtcmalloc-minimal4 libnuma-dev \
    && apt-get install -y ffmpeg libsm6 libxext6 libgl1 \
    && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12 \
    && pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
# https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/performance_tuning/tuning_guide.html
# intel-openmp provides additional performance improvement vs. openmp
# tcmalloc provides better memory allocation efficiency, e.g, holding memory in caches to speed up access of commonly-used objects.
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install intel-openmp

ENV LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:/usr/local/lib/libiomp5.so"

RUN echo 'ulimit -c 0' >> ~/.bashrc

RUN pip install https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_dev/cpu/intel_extension_for_pytorch-2.4.0%2Bgitfbaa4bc-cp310-cp310-linux_x86_64.whl

ENV PIP_EXTRA_INDEX_URL=https://download.pytorch.org/whl/cpu
RUN --mount=type=cache,target=/root/.cache/pip \
    --mount=type=bind,src=requirements-build.txt,target=requirements-build.txt \
    pip install --upgrade pip && \
    pip install -r requirements-build.txt

# install oneDNN
RUN git clone -b rls-v3.5 https://github.com/oneapi-src/oneDNN.git

RUN --mount=type=cache,target=/root/.cache/ccache \
    cmake -B ./oneDNN/build -S ./oneDNN -G Ninja -DONEDNN_LIBRARY_TYPE=STATIC \ 
    -DONEDNN_BUILD_DOC=OFF \ 
    -DONEDNN_BUILD_EXAMPLES=OFF \ 
    -DONEDNN_BUILD_TESTS=OFF \ 
    -DONEDNN_BUILD_GRAPH=OFF \ 
    -DONEDNN_ENABLE_WORKLOAD=INFERENCE \ 
    -DONEDNN_ENABLE_PRIMITIVE=MATMUL && \
    cmake --build ./oneDNN/build --target install --config Release

FROM cpu-test-1 AS build

WORKDIR /workspace/vllm

RUN --mount=type=cache,target=/root/.cache/pip \
    --mount=type=bind,src=requirements-common.txt,target=requirements-common.txt \
    --mount=type=bind,src=requirements-cpu.txt,target=requirements-cpu.txt \
    pip install -v -r requirements-cpu.txt

COPY ./ ./

# Support for building with non-AVX512 vLLM: docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" ...
ARG VLLM_CPU_DISABLE_AVX512
ENV VLLM_CPU_DISABLE_AVX512=${VLLM_CPU_DISABLE_AVX512}

RUN --mount=type=cache,target=/root/.cache/pip \
    --mount=type=cache,target=/root/.cache/ccache \
    VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel && \
    pip install dist/*.whl

WORKDIR /workspace/

RUN ln -s /workspace/vllm/tests && ln -s /workspace/vllm/examples && ln -s /workspace/vllm/benchmarks

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

下载模型文件

方法一：使用git

1
2
3
4
5
6
7


# 使用git 下载，文件比较大，因而使用git-lfs来支持断点续传  安装参考：https://github.com/git-lfs/git-lfs/blob/main/INSTALLING.md
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash

yum install git-lfs
mkdir -p ~/ModelSpace && cd ~/ModelSpace
git lfs install
git clone https://www.modelscope.cn/qwen/qwen2-0.5b.git Qwen2-0.5B

方法二：直接使用modelscope工具

1
2
3
4


pip install modelscope
modelscope download --model qwen/qwen2-0.5b --local_dir  ~/ModelSpace/
#modelscope -h 查看使用说明
#https://www.modelscope.cn/docs/%E6%A8%A1%E5%9E%8B%E7%9A%84%E4%B8%8B%E8%BD%BD

运行测试

直接跑会报错，重新写一个dockerfile以方便测试

1
2
3
4


# 注意这里的镜像名称为上一步所打镜像
FROM vllm-cpu-env
# 这里是为了覆盖掉老的endpoint以方便做测试
ENTRYPOINT tail -f /dev/null

执行以下命令进行测试，指定了 8000端口暴漏，同时将下载的模型挂载到容器，并指定共享内存为4G

docker run -d –name llms -v /root/llmvs/ModelSpace:/workspace/ModelSpace -p 8000:8000 –shm-size 4g llms:v0.2

正式运行

1

docker run -d --name llms -v /root/llmvs/ModelSpace:/workspace/ModelSpace -p 8000:8000 --shm-size 4g  vllm-cpu-env:latest   --model /workspace/ModelSpace/Qwen2-0.5B

参考链接：https://qwen.readthedocs.io/en/latest/deployment/vllm.html

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


curl -i http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "/workspace/ModelSpace/Qwen2-0.5B",
  "messages": [
    {"role": "system", "content": "你是一名人工智能领域的专家"},
    {"role": "user", "content": "如何学习大模型应用？"}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "repetition_penalty": 1.05,
  "max_tokens": 300
}'

如果以上过程你都不想做，可以直接拉取我打好的cpu镜像：registry.cn-hangzhou.aliyuncs.com/reg_pub/vllm-cpu-env:latest