Skip to content

Commit 71fe886

Browse files
Replaced TGI with vLLM for guardrail serving (#1815)
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
1 parent 1a6f821 commit 71fe886

3 files changed

Lines changed: 36 additions & 22 deletions

File tree

ChatQnA/docker_compose/intel/hpu/gaudi/README.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -214,13 +214,13 @@ This setup might allow for more Gaudi devices to be dedicated to the `vllm-servi
214214

215215
### compose_guardrails.yaml - Guardrails Deployment
216216

217-
The _compose_guardrails.yaml_ Docker Compose file introduces enhancements over the default deployment by incorporating additional services focused on safety and ChatQnA response control. Notably, it includes the `tgi-guardrails-service` and `guardrails` services. The `tgi-guardrails-service` uses the `ghcr.io/huggingface/tgi-gaudi:2.3.1` image and is configured to run on Gaudi hardware, providing functionality to manage input constraints and ensure safe operations within defined limits. The guardrails service, using the `opea/guardrails:latest` image, acts as a safety layer that interfaces with the `tgi-guardrails-service` to enforce safety protocols and manage interactions with the large language model (LLM). This backend server now depends on the `tgi-guardrails-service` and `guardrails`, alongside existing dependencies like `redis-vector-db`, `tei-embedding-service`, `retriever`, `tei-reranking-service`, and `vllm-service`. The environment configurations for the backend are also updated to include settings for the guardrail services.
217+
The _compose_guardrails.yaml_ Docker Compose file introduces enhancements over the default deployment by incorporating additional services focused on safety and ChatQnA response control. Notably, it includes the `vllm-guardrails-service` and `guardrails` services. The `vllm-guardrails-service` uses the `opea/vllm-gaudi:latest` image and is configured to run on Gaudi hardware, providing functionality to manage input constraints and ensure safe operations within defined limits. The guardrails service, using the `opea/guardrails:latest` image, acts as a safety layer that interfaces with the `vllm-guardrails-service` to enforce safety protocols and manage interactions with the large language model (LLM). This backend server now depends on the `vllm-guardrails-service` and `guardrails`, alongside existing dependencies like `redis-vector-db`, `tei-embedding-service`, `retriever`, `tei-reranking-service`, and `vllm-service`. The environment configurations for the backend are also updated to include settings for the guardrail services.
218218

219219
| Service Name | Image Name | Gaudi Specific | Uses LLM |
220220
| ---------------------------- | ----------------------------------------------------- | -------------- | -------- |
221221
| redis-vector-db | redis/redis-stack:7.2.0-v9 | No | No |
222222
| dataprep-redis-service | opea/dataprep:latest | No | No |
223-
| _tgi-guardrails-service_ | ghcr.io/huggingface/tgi-gaudi:2.3.1 | 1 card | Yes |
223+
| _vllm-guardrails-service_ | opea/vllm-gaudi:latest | 1 card | Yes |
224224
| _guardrails_ | opea/guardrails:latest | No | No |
225225
| tei-embedding-service | ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 | No | No |
226226
| retriever | opea/retriever:latest | No | No |
@@ -230,7 +230,7 @@ The _compose_guardrails.yaml_ Docker Compose file introduces enhancements over t
230230
| chatqna-gaudi-ui-server | opea/chatqna-ui:latest | No | No |
231231
| chatqna-gaudi-nginx-server | opea/nginx:latest | No | No |
232232

233-
The deployment with guardrails introduces additional Gaudi-specific services, such as the `tgi-guardrails-service`, which necessitates careful consideration of Gaudi allocation. This deployment aims to balance safety and performance, potentially requiring a strategic distribution of Gaudi devices between the guardrail services and the LLM tasks to maintain both operational safety and efficiency.
233+
The deployment with guardrails introduces additional Gaudi-specific services, such as the `vllm-guardrails-service`, which necessitates careful consideration of Gaudi allocation. This deployment aims to balance safety and performance, potentially requiring a strategic distribution of Gaudi devices between the guardrail services and the LLM tasks to maintain both operational safety and efficiency.
234234

235235
### Telemetry Enablement - compose.telemetry.yaml and compose_tgi.telemetry.yaml
236236

@@ -290,9 +290,13 @@ The `ghcr.io/huggingface/text-embeddings-inference:cpu-1.6` image supporting `te
290290

291291
The `tgi-guardrails-service` uses the `GUARDRAILS_MODEL_ID` parameter to select a [supported model](https://github.com/huggingface/tgi-gaudi?tab=readme-ov-file#tested-models-and-configurations) for the associated `ghcr.io/huggingface/tgi-gaudi:2.3.1` image. Like the `tei-embedding-service` and `tei-reranking-service` services, it doesn't use the `NUM_CARDS` parameter.
292292

293+
### vllm-gaurdrails-service
294+
295+
The `vllm-guardrails-service` uses the `GUARDRAILS_MODEL_ID` parameter to select a [supported model](https://docs.vllm.ai/en/latest/models/supported_models.html) for the associated `opea/vllm-gaudi:latest` image. It uses the `NUM_CARDS` parameter.
296+
293297
## Conclusion
294298

295-
In examining the various services and configurations across different deployments, developers should gain a comprehensive understanding of how each component contributes to the overall functionality and performance of a ChatQnA pipeline on an Intel® Gaudi® platform. Key services such as the `vllm-service`, `tei-embedding-service`, `tei-reranking-service`, and `tgi-guardrails-service` each consume Gaudi accelerators, leveraging specific models and hardware resources to optimize their respective tasks. The `LLM_MODEL_ID`, `EMBEDDING_MODEL_ID`, `RERANK_MODEL_ID`, and `GUARDRAILS_MODEL_ID` parameters specify the models used, directly impacting the quality and effectiveness of language processing, embedding, reranking, and safety operations.
299+
In examining the various services and configurations across different deployments, developers should gain a comprehensive understanding of how each component contributes to the overall functionality and performance of a ChatQnA pipeline on an Intel® Gaudi® platform. Key services such as the `vllm-service`, `tei-embedding-service`, `tei-reranking-service`, `tgi-guardrails-service`and `vllm-guardrails-service` each consume Gaudi accelerators, leveraging specific models and hardware resources to optimize their respective tasks. The `LLM_MODEL_ID`, `EMBEDDING_MODEL_ID`, `RERANK_MODEL_ID`, and `GUARDRAILS_MODEL_ID` parameters specify the models used, directly impacting the quality and effectiveness of language processing, embedding, reranking, and safety operations.
296300

297301
The allocation of Gaudi devices, affected by the Gaudi dependent services and the `NUM_CARDS` parameter supporting the `vllm-service` or `tgi-service`, determines where computational power is utilized to enhance performance.
298302

ChatQnA/docker_compose/intel/hpu/gaudi/compose_guardrails.yaml

Lines changed: 23 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,9 @@ services:
2525
INDEX_NAME: ${INDEX_NAME}
2626
TEI_ENDPOINT: http://tei-embedding-service:80
2727
HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
28-
tgi-guardrails-service:
29-
image: ghcr.io/huggingface/tgi-gaudi:2.3.1
30-
container_name: tgi-guardrails-server
28+
vllm-guardrails-service:
29+
image: ${REGISTRY:-opea}/vllm-gaudi:${TAG:-latest}
30+
container_name: vllm-guardrails-server
3131
ports:
3232
- "8088:80"
3333
volumes:
@@ -36,32 +36,37 @@ services:
3636
no_proxy: ${no_proxy}
3737
http_proxy: ${http_proxy}
3838
https_proxy: ${https_proxy}
39-
HUGGING_FACE_HUB_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
40-
HF_HUB_DISABLE_PROGRESS_BARS: 1
41-
HF_HUB_ENABLE_HF_TRANSFER: 0
39+
HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
4240
HABANA_VISIBLE_DEVICES: all
4341
OMPI_MCA_btl_vader_single_copy_mechanism: none
44-
ENABLE_HPU_GRAPH: true
45-
LIMIT_HPU_GRAPH: true
46-
USE_FLASH_ATTENTION: true
47-
FLASH_ATTENTION_RECOMPUTE: true
42+
GURADRAILS_MODEL_ID: ${GURADRAILS_MODEL_ID}
43+
NUM_CARDS: ${NUM_CARDS}
44+
VLLM_TORCH_PROFILER_DIR: "/mnt"
4845
runtime: habana
4946
cap_add:
5047
- SYS_NICE
5148
ipc: host
52-
command: --model-id ${GURADRAILS_MODEL_ID} --max-input-length 1024 --max-total-tokens 2048
49+
command: --model ${GURADRAILS_MODEL_ID} --tensor-parallel-size ${NUM_CARDS} --host 0.0.0.0 --port 80 --block-size 128 --max-num-seqs 256 --max-seq-len-to-capture 2048
50+
healthcheck:
51+
test: ["CMD-SHELL", "curl -f http://$host_ip:8088/health || exit 1"]
52+
interval: 10s
53+
timeout: 10s
54+
retries: 150
5355
guardrails:
5456
image: ${REGISTRY:-opea}/guardrails:${TAG:-latest}
5557
container_name: guardrails-gaudi-server
5658
ports:
5759
- "9090:9090"
5860
ipc: host
61+
depends_on:
62+
vllm-guardrails-service:
63+
condition: service_healthy
5964
environment:
6065
no_proxy: ${no_proxy}
6166
http_proxy: ${http_proxy}
6267
https_proxy: ${https_proxy}
6368
SAFETY_GUARD_MODEL_ID: ${GURADRAILS_MODEL_ID}
64-
SAFETY_GUARD_ENDPOINT: http://tgi-guardrails-service:80
69+
SAFETY_GUARD_ENDPOINT: http://vllm-guardrails-service:80
6570
HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
6671
restart: unless-stopped
6772
tei-embedding-service:
@@ -140,12 +145,17 @@ services:
140145
- SYS_NICE
141146
ipc: host
142147
command: --model ${LLM_MODEL_ID} --tensor-parallel-size ${NUM_CARDS} --host 0.0.0.0 --port 80 --block-size 128 --max-num-seqs 256 --max-seq-len-to-capture 2048
148+
healthcheck:
149+
test: ["CMD-SHELL", "curl -f http://$host_ip:8008/health || exit 1"]
150+
interval: 10s
151+
timeout: 10s
152+
retries: 150
143153
chatqna-gaudi-backend-server:
144154
image: ${REGISTRY:-opea}/chatqna:${TAG:-latest}
145155
container_name: chatqna-gaudi-guardrails-server
146156
depends_on:
147157
- redis-vector-db
148-
- tgi-guardrails-service
158+
- vllm-guardrails-service
149159
- guardrails
150160
- tei-embedding-service
151161
- retriever

ChatQnA/tests/test_compose_guardrails_on_gaudi.sh

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,6 @@ function build_docker_images() {
3131
service_list="chatqna chatqna-ui dataprep retriever vllm-gaudi guardrails nginx"
3232
docker compose -f build.yaml build ${service_list} --no-cache > ${LOG_PATH}/docker_image_build.log
3333

34-
docker pull ghcr.io/huggingface/tgi-gaudi:2.3.1
3534
docker pull ghcr.io/huggingface/text-embeddings-inference:cpu-1.6
3635
docker pull ghcr.io/huggingface/tei-gaudi:1.5.0
3736

@@ -46,6 +45,7 @@ function start_services() {
4645
export NUM_CARDS=1
4746
export INDEX_NAME="rag-redis"
4847
export HUGGINGFACEHUB_API_TOKEN=${HUGGINGFACEHUB_API_TOKEN}
48+
export host_ip=${ip_address}
4949
export GURADRAILS_MODEL_ID="meta-llama/Meta-Llama-Guard-2-8B"
5050

5151
# Start Docker Containers
@@ -61,12 +61,12 @@ function start_services() {
6161
n=$((n+1))
6262
done
6363

64-
# Make sure tgi guardrails service is ready
64+
# Make sure vllm guardrails service is ready
6565
m=0
66-
until [[ "$m" -ge 160 ]]; do
66+
until [[ "$m" -ge 200 ]]; do
6767
echo "m=$m"
68-
docker logs tgi-guardrails-server > tgi_guardrails_service_start.log
69-
if grep -q Connected tgi_guardrails_service_start.log; then
68+
docker logs vllm-guardrails-server > vllm_guardrails_service_start.log
69+
if grep -q "Warmup finished" vllm_guardrails_service_start.log; then
7070
break
7171
fi
7272
sleep 5s

0 commit comments

Comments
 (0)