Skip to content

Commit 1f06dca

Browse files
authored
Release v1.3 ChatQnA OOB benchmark data (#2041)
Signed-off-by: chensuyue <suyue.chen@intel.com>
1 parent b4ad636 commit 1f06dca

2 files changed

Lines changed: 165 additions & 0 deletions

File tree

ChatQnA/benchmark_results.md

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
# ChatQnA Benchmark Results
2+
3+
## Overview
4+
5+
ChatQnA deployed on a single node with ICX cores as the head node and supporting 8x Gaudi2 cards.
6+
This is based on OPEA v1.3 release helm charts and images using vLLM inferencing platform.
7+
8+
## Methodology
9+
10+
Tests scale concurrent users from 1 to 256, and each user send 4 queries. Measuring end to end (E2E) latency average for each query, time to first token (TTFT) average and time per output token (TPOT) average.
11+
12+
## Hardware and Software Configuration
13+
14+
| **Category** | **Details** |
15+
| --------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
16+
| **System Summary** | 1-node, 2x Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz, 40 cores, 270W TDP, HT On, Turbo On, NUMA 2, Integrated Accelerators Available [used]: DLB 0 [0], DSA 0 [0], IAA 0 [0], QAT 0 [0], Total Memory 1024GB (32x32GB DDR4 3200 MT/s [3200 MT/s]), BIOS ETM02, microcode 0xd0003b9, 8x Habana Labs Ltd., 4x MT28800 Family [ConnectX-5 Ex], 4x 7T INTEL SSDPF2KX076TZ, 2x 894.3G SAMSUNG MZ1L2960HCJR-00A07, Ubuntu 22.04.3 LTS, 5.15.0-92-generic. Software: WORKLOAD+VERSION, COMPILER, LIBRARIES, OTHER_SW. |
17+
| **Framework** | langchain, vLLM, habana framework |
18+
| **Orchestration** | k8s/docker |
19+
| **Containers and Virtualization** | Kubernetes v1.29.9 |
20+
| **Drivers** | habana driver 1.20.1-366eb9c |
21+
| **VM vcpu, Memory** | 160 vCPUs, 1T memory |
22+
| **OPEA Release Version** | v1.3 |
23+
| **Dataset** | pubmed_10.txt |
24+
| **Embedding Model** | BAAI/bge-base-en-v1.5 |
25+
| **Database** | redis |
26+
| **LLM Model** | meta-llama/Llama-3.1-8B-Instruct |
27+
| **Precision** | bf16 |
28+
| **Output Length** | 1024 |
29+
| **Command Line Parameters** | python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml --target-node 1 --test-mode oob |
30+
| **Batch Size** | 256 |
31+
32+
## Benchmark Results
33+
34+
| Users | E2E Latency Avg (ms) | TTFT Avg (ms) | TPOT Avg (ms) |
35+
| ----- | -------------------- | ------------- | ------------- |
36+
| 256 | 35,034.7 | 1,042.8 | 33.1 |
37+
| 128 | 20,996.0 | 529.8 | 19.9 |
38+
| 64 | 16,602.1 | 404.9 | 15.8 |
39+
| 32 | 14,646.5 | 260.1 | 14.0 |
40+
| 16 | 13,669.3 | 193.7 | 13.1 |
41+
| 8 | 13,275.2 | 157.3 | 12.8 |
42+
| 4 | 13,038.8 | 127.7 | 12.5 |
43+
| 2 | 13,059.0 | 129.4 | 12.6 |
44+
| 1 | 12,906.5 | 126.8 | 12.5 |
45+
46+
## Benchmark Config Yaml
47+
48+
<details>
49+
<summary>Click to Check Benchmark Config Yaml</summary>
50+
51+
```yaml
52+
deploy:
53+
device: gaudi
54+
version: 1.3.0
55+
modelUseHostPath: /home/sdp/opea_benchmark/model
56+
HUGGINGFACEHUB_API_TOKEN: xxx
57+
node: [1]
58+
namespace: default
59+
timeout: 1000 # timeout in seconds for services to be ready, default 30 minutes
60+
interval: 5 # interval in seconds between service ready checks, default 5 seconds
61+
62+
services:
63+
backend:
64+
resources:
65+
enabled: False
66+
cores_per_instance: "16"
67+
memory_capacity: "8000Mi"
68+
replicaCount: [1, 2, 4, 8]
69+
70+
teirerank:
71+
enabled: False
72+
model_id: ""
73+
resources:
74+
enabled: False
75+
cards_per_instance: 1
76+
replicaCount: [1, 1, 1, 1]
77+
78+
tei:
79+
model_id: ""
80+
resources:
81+
enabled: False
82+
cores_per_instance: "80"
83+
memory_capacity: "20000Mi"
84+
replicaCount: [1, 2, 4, 8]
85+
86+
llm:
87+
engine: vllm
88+
model_id: "meta-llama/Llama-3.1-8B-Instruct" # mandatory
89+
replicaCount:
90+
with_teirerank: [7, 15, 31, 63] # When teirerank.enabled is True
91+
without_teirerank: [8, 16, 32, 64] # When teirerank.enabled is False
92+
resources:
93+
enabled: False
94+
cards_per_instance: 1
95+
model_params:
96+
vllm: # VLLM specific parameters
97+
batch_params:
98+
enabled: True
99+
max_num_seqs: [256]
100+
token_params:
101+
enabled: False
102+
max_input_length: ""
103+
max_total_tokens: ""
104+
max_batch_total_tokens: ""
105+
max_batch_prefill_tokens: ""
106+
tgi: # TGI specific parameters
107+
batch_params:
108+
enabled: True
109+
max_batch_size: [1, 2, 4, 8] # Each value triggers an LLM service upgrade
110+
token_params:
111+
enabled: False
112+
max_input_length: "1280"
113+
max_total_tokens: "2048"
114+
max_batch_total_tokens: "65536"
115+
max_batch_prefill_tokens: "4096"
116+
117+
data-prep:
118+
resources:
119+
enabled: False
120+
cores_per_instance: ""
121+
memory_capacity: ""
122+
replicaCount: [1, 1, 1, 1]
123+
124+
retriever-usvc:
125+
resources:
126+
enabled: False
127+
cores_per_instance: "8"
128+
memory_capacity: "8000Mi"
129+
replicaCount: [1, 2, 4, 8]
130+
131+
redis-vector-db:
132+
resources:
133+
enabled: False
134+
cores_per_instance: ""
135+
memory_capacity: ""
136+
replicaCount: [1, 1, 1, 1]
137+
138+
chatqna-ui:
139+
replicaCount: [1, 1, 1, 1]
140+
141+
nginx:
142+
replicaCount: [1, 1, 1, 1]
143+
144+
benchmark:
145+
# http request behavior related fields
146+
user_queries: [4, 8, 16, 32, 64, 128, 256, 512, 1024]
147+
concurrency: [1, 2, 4, 8, 16, 32, 64, 128, 256]
148+
load_shape_type: "constant" # "constant" or "poisson"
149+
poisson_arrival_rate: 1.0 # only used when load_shape_type is "poisson"
150+
warmup_iterations: 10
151+
seed: 1024
152+
153+
# workload, all of the test cases will run for benchmark
154+
bench_target: [chatqna_qlist_pubmed]
155+
dataset: ["/home/sdp/opea_benchmark/pubmed_10.txt"]
156+
prompt: [10]
157+
158+
llm:
159+
# specify the llm output token size
160+
max_token_size: [1024]
161+
```

README-deploy-benchmark.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,3 +192,7 @@ Choose "oob" mode when you want to selectively enable optimizations, or "tune" m
192192
- After cleaning up the directory, try running the deployment again
193193

194194
Note: Always ensure there are no leftover Helm chart directories from previous failed runs before starting a new deployment.
195+
196+
## ChatQnA Release Data
197+
198+
The ChatQnA benchmark results are available in the [ChatQnA/benchmark_results.md](./ChatQnA/benchmark_results.md).

0 commit comments

Comments
 (0)