CrazyAirhead

疯狂的傻瓜,傻瓜也疯狂——傻方能执著,疯狂才专注!

0%

Mac 上开发 Ragflow

说明

在「Mac 上编译 Ragflow」的文章中只是简单的说到可以通过 Remote Development 来进入容器,可以通过修改Dockerfile的方式,增加映射卷的方式来开发。为什么要映射卷呢?这样可以保持 ragflow 原有的目录结构,同时保持 git 的代码管理,文件在卷中也不用担心因容器销毁等导致修改代码丢失的情况。

我今天完整的走通了开发的过程,重新整理开发部分的文档。为了保持完整,保留了 Remote Development 的部分说明。注意,前提是已经在 Mac 上编译完 ragflow 的 Dockerfile,并且可以正常运行。

开发

Remote Development

VS Code 安装 Remote Development 插件。它包含多个插件,其中 Dev Containters 可以连接到 Docker 容器作为开发环境,这样的好处是开发环境与部署环境一致。

img

当安装好 Remote Development,就可以通过 Remote Explorer 查看已经安装的容器,并通过点击对应容器的箭头(注意下图红色框的部分)。

img

进入到容器内部,选对应的目录,就可以打开 /ragflow,这样就能看到实际的代码了。

img

修改Dockerfile

主要是屏蔽编译过程和 probuction 步骤,增加 /raflow 卷的映射。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
# base stage
FROM docker.m.daocloud.io/ubuntu:22.04 AS base
USER root
SHELL ["/bin/bash", "-c"]

ARG NEED_MIRROR=0
ARG LIGHTEN=0
ENV LIGHTEN=${LIGHTEN}

WORKDIR /ragflow

# Copy models downloaded via download_deps.py
RUN mkdir -p /ragflow/rag/res/deepdoc /root/.ragflow
RUN --mount=type=bind,from=infiniflow/ragflow_deps:latest,source=/huggingface.co,target=/huggingface.co \
cp /huggingface.co/InfiniFlow/huqie/huqie.txt.trie /ragflow/rag/res/ && \
tar --exclude='.*' -cf - \
/huggingface.co/InfiniFlow/text_concat_xgb_v1.0 \
/huggingface.co/InfiniFlow/deepdoc \
| tar -xf - --strip-components=3 -C /ragflow/rag/res/deepdoc
RUN --mount=type=bind,from=infiniflow/ragflow_deps:latest,source=/huggingface.co,target=/huggingface.co \
if [ "$LIGHTEN" != "1" ]; then \
(tar -cf - \
/huggingface.co/BAAI/bge-large-zh-v1.5 \
/huggingface.co/BAAI/bge-reranker-v2-m3 \
/huggingface.co/maidalun1020/bce-embedding-base_v1 \
/huggingface.co/maidalun1020/bce-reranker-base_v1 \
| tar -xf - --strip-components=2 -C /root/.ragflow) \
fi

# https://github.com/chrismattmann/tika-python
# This is the only way to run python-tika without internet access. Without this set, the default is to check the tika version and pull latest every time from Apache.
RUN --mount=type=bind,from=infiniflow/ragflow_deps:latest,source=/,target=/deps \
cp -r /deps/nltk_data /root/ && \
cp /deps/tika-server-standard-3.0.0.jar /deps/tika-server-standard-3.0.0.jar.md5 /ragflow/ && \
cp /deps/cl100k_base.tiktoken /ragflow/9b5ad71b2ce5302211f9c61530b329a4922fc6a4

ENV TIKA_SERVER_JAR="file:///ragflow/tika-server-standard-3.0.0.jar"
ENV DEBIAN_FRONTEND=noninteractive

# Setup apt
# Python package and implicit dependencies:
# opencv-python: libglib2.0-0 libglx-mesa0 libgl1
# aspose-slides: pkg-config libicu-dev libgdiplus libssl1.1_1.1.1f-1ubuntu2_amd64.deb
# python-pptx: default-jdk tika-server-standard-3.0.0.jar
# selenium: libatk-bridge2.0-0 chrome-linux64-121-0-6167-85
# Building C extensions: libpython3-dev libgtk-4-1 libnss3 xdg-utils libgbm-dev
RUN --mount=type=cache,id=ragflow_apt,target=/var/cache/apt,sharing=locked \
if [ "$NEED_MIRROR" == "1" ]; then \
sed -i 's|http://archive.ubuntu.com|https://mirrors.tuna.tsinghua.edu.cn|g' /etc/apt/sources.list; \
fi; \
rm -f /etc/apt/apt.conf.d/docker-clean && \
echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' > /etc/apt/apt.conf.d/keep-cache && \
chmod 1777 /tmp && \
apt update && \
apt --no-install-recommends install -y ca-certificates && \
apt update && \
apt install -y libglib2.0-0 libglx-mesa0 libgl1 && \
apt install -y pkg-config libicu-dev libgdiplus && \
apt install -y default-jdk && \
apt install -y libatk-bridge2.0-0 && \
apt install -y libpython3-dev libgtk-4-1 libnss3 xdg-utils libgbm-dev && \
apt install -y libjemalloc-dev && \
apt install -y python3-pip pipx nginx unzip curl wget git vim less

RUN if [ "$NEED_MIRROR" == "1" ]; then \
pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple && \
pip3 config set global.trusted-host mirrors.aliyun.com; \
mkdir -p /etc/uv && \
echo "[[index]]" > /etc/uv/uv.toml && \
echo 'url = "https://mirrors.aliyun.com/pypi/simple"' >> /etc/uv/uv.toml && \
echo "default = true" >> /etc/uv/uv.toml; \
fi; \
pipx install uv

ENV PYTHONDONTWRITEBYTECODE=1 DOTNET_SYSTEM_GLOBALIZATION_INVARIANT=1
ENV PATH=/root/.local/bin:$PATH

# nodejs 12.22 on Ubuntu 22.04 is too old
RUN --mount=type=cache,id=ragflow_apt,target=/var/cache/apt,sharing=locked \
curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \
apt purge -y nodejs npm cargo && \
apt autoremove -y && \
apt update && \
apt install -y nodejs

# A modern version of cargo is needed for the latest version of the Rust compiler.
RUN apt update && apt install -y curl build-essential \
&& if [ "$NEED_MIRROR" == "1" ]; then \
# Use TUNA mirrors for rustup/rust dist files
export RUSTUP_DIST_SERVER="https://mirrors.tuna.tsinghua.edu.cn/rustup"; \
export RUSTUP_UPDATE_ROOT="https://mirrors.tuna.tsinghua.edu.cn/rustup/rustup"; \
echo "Using TUNA mirrors for Rustup."; \
fi; \
# Force curl to use HTTP/1.1
curl --proto '=https' --tlsv1.2 --http1.1 -sSf https://sh.rustup.rs | bash -s -- -y --profile minimal \
&& echo 'export PATH="/root/.cargo/bin:${PATH}"' >> /root/.bashrc

ENV PATH="/root/.cargo/bin:${PATH}"

RUN cargo --version && rustc --version

# Add msssql ODBC driver
# macOS ARM64 environment, install msodbcsql18.
# general x86_64 environment, install msodbcsql17.
RUN --mount=type=cache,id=ragflow_apt,target=/var/cache/apt,sharing=locked \
curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - && \
curl https://packages.microsoft.com/config/ubuntu/22.04/prod.list > /etc/apt/sources.list.d/mssql-release.list && \
apt update && \
arch="$(uname -m)"; \
if [ "$arch" = "arm64" ] || [ "$arch" = "aarch64" ]; then \
# ARM64 (macOS/Apple Silicon or Linux aarch64)
ACCEPT_EULA=Y apt install -y unixodbc-dev msodbcsql18; \
else \
# x86_64 or others
ACCEPT_EULA=Y apt install -y unixodbc-dev msodbcsql17; \
fi || \
{ echo "Failed to install ODBC driver"; exit 1; }

# Add dependencies of selenium
RUN --mount=type=bind,from=infiniflow/ragflow_deps:latest,source=/chrome-linux64-121-0-6167-85,target=/chrome-linux64.zip \
unzip /chrome-linux64.zip && \
mv chrome-linux64 /opt/chrome && \
ln -s /opt/chrome/chrome /usr/local/bin/
RUN --mount=type=bind,from=infiniflow/ragflow_deps:latest,source=/chromedriver-linux64-121-0-6167-85,target=/chromedriver-linux64.zip \
unzip -j /chromedriver-linux64.zip chromedriver-linux64/chromedriver && \
mv chromedriver /usr/local/bin/ && \
rm -f /usr/bin/google-chrome

# https://forum.aspose.com/t/aspose-slides-for-net-no-usable-version-of-libssl-found-with-linux-server/271344/13
# aspose-slides on linux/arm64 is unavailable
RUN --mount=type=bind,from=infiniflow/ragflow_deps:latest,source=/,target=/deps \
if [ "$(uname -m)" = "x86_64" ]; then \
dpkg -i /deps/libssl1.1_1.1.1f-1ubuntu2_amd64.deb; \
elif [ "$(uname -m)" = "aarch64" ]; then \
dpkg -i /deps/libssl1.1_1.1.1f-1ubuntu2_arm64.deb; \
fi

# builder stage
FROM base AS builder
USER root

WORKDIR /ragflow

# install dependencies from uv.lock file
# COPY pyproject.toml uv.lock ./

# https://github.com/astral-sh/uv/issues/10462
# uv records index url into uv.lock but doesn't failover among multiple indexes
# RUN --mount=type=cache,id=ragflow_uv,target=/root/.cache/uv,sharing=locked \
# if [ "$NEED_MIRROR" == "1" ]; then \
# sed -i 's|pypi.org|mirrors.aliyun.com/pypi|g' uv.lock; \
# else \
# sed -i 's|mirrors.aliyun.com/pypi|pypi.org|g' uv.lock; \
# fi; \
# if [ "$LIGHTEN" == "1" ]; then \
# uv sync --python 3.10 --frozen; \
# else \
# uv sync --python 3.10 --frozen --all-extras; \
# fi

# COPY web web
# COPY docs docs
# RUN --mount=type=cache,id=ragflow_npm,target=/root/.npm,sharing=locked \
# cd web && npm install && npm run build

COPY .git /ragflow/.git

RUN version_info=$(git describe --tags --match=v* --first-parent --always); \
if [ "$LIGHTEN" == "1" ]; then \
version_info="$version_info slim"; \
else \
version_info="$version_info full"; \
fi; \
echo "RAGFlow version: $version_info"; \
echo $version_info > /ragflow/VERSION

# production stage
# FROM base AS production
# USER root

# WORKDIR /ragflow

# Copy Python environment and packages
ENV VIRTUAL_ENV=/ragflow/.venv
# COPY --from=builder ${VIRTUAL_ENV} ${VIRTUAL_ENV}
ENV PATH="${VIRTUAL_ENV}/bin:${PATH}"

ENV PYTHONPATH=/ragflow/

# COPY web web
# COPY api api
# COPY conf conf
# COPY deepdoc deepdoc
# COPY rag rag
# COPY agent agent
# COPY graphrag graphrag
# COPY agentic_reasoning agentic_reasoning
# COPY pyproject.toml uv.lock ./

# COPY docker/service_conf.yaml.template ./conf/service_conf.yaml.template
# COPY docker/entrypoint.sh docker/entrypoint-parser.sh ./
# RUN chmod +x ./entrypoint*.sh

# Copy compiled web pages
# COPY --from=builder /ragflow/web/dist /ragflow/web/dist

# COPY --from=builder /ragflow/VERSION /ragflow/VERSION

VOLUME /ragflow

ENTRYPOINT ["./entrypoint.sh"]

重新编译镜像

1
docker build --build-arg LIGHTEN=1 --build-arg NEED_MIRROR=1 -f Dockerfile -t infiniflow/ragflow:nightly-slim .

修改docker-compose-macos.yml

这里设置本地的 ragflow 为镜像的/ragflow卷的,映射目录。这的好处就是还可以保留 git 的代码管理。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
include:
- ./docker-compose-base.yml

services:
ragflow:
depends_on:
mysql:
condition: service_healthy
build:
context: ../
dockerfile: Dockerfile
container_name: ragflow-server
ports:
- ${SVR_HTTP_PORT}:9380
- 80:80
- 443:443
volumes:
- ./ragflow-logs:/ragflow/logs
- ./nginx/ragflow.conf:/etc/nginx/conf.d/ragflow.conf
- ./nginx/proxy.conf:/etc/nginx/proxy.conf
- ./nginx/nginx.conf:/etc/nginx/nginx.conf
- ../:/ragflow
env_file: .env
environment:
- TZ=${TIMEZONE}
- HF_ENDPOINT=${HF_ENDPOINT}
- MACOS=${MACOS:-1}
- LIGHTEN=${LIGHTEN:-1}
networks:
- ragflow
restart: on-failure
# https://docs.docker.com/engine/daemon/prometheus/#create-a-prometheus-configuration
# If you're using Docker Desktop, the --add-host flag is optional. This flag makes sure that the host's internal IP gets exposed to the Prometheus container.
extra_hosts:
- "host.docker.internal:host-gateway"

因为增加了 ragflow 的卷映射,这里注意拷贝 docker/service_conf.yaml.template 到 ragflow/conf 目录下,docker/entrypoint.sh docker/entrypoint-parser.sh 拷贝到 raflow 目录下,避免程序无法启动。

1
2
cd docker
$ docker compose -f docker-compose-macos.yml up -d

修改entrypoint.sh

增加 --debug 支持调试信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/bin/bash

# replace env variables in the service_conf.yaml file
rm -rf /ragflow/conf/service_conf.yaml
while IFS= read -r line || [[ -n "$line" ]]; do
# Use eval to interpret the variable with default values
eval "echo \"$line\"" >> /ragflow/conf/service_conf.yaml
done < /ragflow/conf/service_conf.yaml.template

/usr/sbin/nginx

export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/

PY=python3
if [[ -z "$WS" || $WS -lt 1 ]]; then
WS=1
fi

function task_exe(){
JEMALLOC_PATH=$(pkg-config --variable=libdir jemalloc)/libjemalloc.so
while [ 1 -eq 1 ];do
LD_PRELOAD=$JEMALLOC_PATH $PY rag/svr/task_executor.py $1;
done
}

for ((i=0;i<WS;i++))
do
task_exe $i &
done

while [ 1 -eq 1 ];do
$PY api/ragflow_server.py --debug
done

wait;

编译

正常情况下,这个时候服务是无法访问的,但容器已经启动,可以通过 dev containers 来访问容器。进入容器后编译前端和重新安装 ragflow 的 python 依赖。

前端

1
2
3
cd web
npm install
npm run build

后端

重新获取依赖

1
uv sync --python 3.10 --frozen;

验证

目录结构一致

img

修改下代码,比如api/apps/user_app.py。

img

我们可以看到修改的代码已经被执行。

img

小结

编程很多时候的问题是部署开发环境的问题,现在有了 docker 已经可以开发部署简单多了。但因为众所周知的原因,你还需要掌握科学上网的魔法。

这次部署 ragflow 环境下来,发现硬盘太小是个问题,模型或者镜像都要确定哪些没用了及时删除。

欢迎关注我的其它发布渠道