Release v1.3 ChatQnA OOB benchmark data (#2041)

chensuyue · web-flow · commit 1f06dcad7276 · 2025-06-11T09:40:38.000+08:00
Signed-off-by: chensuyue &lt;suyue.chen@intel.com&gt;
diff --git a/ChatQnA/benchmark_results.md b/ChatQnA/benchmark_results.md
@@ -0,0 +1,161 @@
+# ChatQnA Benchmark Results
+
+## Overview
+
+ChatQnA deployed on a single node with ICX cores as the head node and supporting 8x Gaudi2 cards.
+This is based on OPEA v1.3 release helm charts and images using vLLM inferencing platform.
+
+## Methodology
+
+Tests scale concurrent users from 1 to 256, and each user send 4 queries. Measuring end to end (E2E) latency average for each query, time to first token (TTFT) average and time per output token (TPOT) average.
+
+## Hardware and Software Configuration
+
+| **Category**                      | **Details**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+| --------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **System Summary**                | 1-node, 2x Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz, 40 cores, 270W TDP, HT On, Turbo On, NUMA 2, Integrated Accelerators Available [used]: DLB 0 [0], DSA 0 [0], IAA 0 [0], QAT 0 [0], Total Memory 1024GB (32x32GB DDR4 3200 MT/s [3200 MT/s]), BIOS ETM02, microcode 0xd0003b9, 8x Habana Labs Ltd., 4x MT28800 Family [ConnectX-5 Ex], 4x 7T INTEL SSDPF2KX076TZ, 2x 894.3G SAMSUNG MZ1L2960HCJR-00A07, Ubuntu 22.04.3 LTS, 5.15.0-92-generic. Software: WORKLOAD+VERSION, COMPILER, LIBRARIES, OTHER_SW. |
+| **Framework**                     | langchain, vLLM, habana framework                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+| **Orchestration**                 | k8s/docker                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+| **Containers and Virtualization** | Kubernetes v1.29.9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+| **Drivers**                       | habana driver 1.20.1-366eb9c                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+| **VM vcpu, Memory**               | 160 vCPUs, 1T memory                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+| **OPEA Release Version**          | v1.3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+| **Dataset**                       | pubmed_10.txt                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+| **Embedding Model**               | BAAI/bge-base-en-v1.5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+| **Database**                      | redis                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+| **LLM Model**                     | meta-llama/Llama-3.1-8B-Instruct                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+| **Precision**                     | bf16                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+| **Output Length**                 | 1024                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+| **Command Line Parameters**       | python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml --target-node 1 --test-mode oob                                                                                                                                                                                                                                                                                                                                                                                                                   |
+| **Batch Size**                    | 256                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+
+## Benchmark Results
+
+| Users | E2E Latency Avg (ms) | TTFT Avg (ms) | TPOT Avg (ms) |
+| ----- | -------------------- | ------------- | ------------- |
+| 256   | 35,034.7             | 1,042.8       | 33.1          |
+| 128   | 20,996.0             | 529.8         | 19.9          |
+| 64    | 16,602.1             | 404.9         | 15.8          |
+| 32    | 14,646.5             | 260.1         | 14.0          |
+| 16    | 13,669.3             | 193.7         | 13.1          |
+| 8     | 13,275.2             | 157.3         | 12.8          |
+| 4     | 13,038.8             | 127.7         | 12.5          |
+| 2     | 13,059.0             | 129.4         | 12.6          |
+| 1     | 12,906.5             | 126.8         | 12.5          |
+
+## Benchmark Config Yaml
+
+<details>
+<summary>Click to Check Benchmark Config Yaml</summary>
+
+```yaml
+deploy:
+  device: gaudi
+  version: 1.3.0
+  modelUseHostPath: /home/sdp/opea_benchmark/model
+  HUGGINGFACEHUB_API_TOKEN: xxx
+  node: [1]
+  namespace: default
+  timeout: 1000 # timeout in seconds for services to be ready, default 30 minutes
+  interval: 5 # interval in seconds between service ready checks, default 5 seconds
+
+  services:
+    backend:
+      resources:
+        enabled: False
+        cores_per_instance: "16"
+        memory_capacity: "8000Mi"
+      replicaCount: [1, 2, 4, 8]
+
+    teirerank:
+      enabled: False
+      model_id: ""
+      resources:
+        enabled: False
+        cards_per_instance: 1
+      replicaCount: [1, 1, 1, 1]
+
+    tei:
+      model_id: ""
+      resources:
+        enabled: False
+        cores_per_instance: "80"
+        memory_capacity: "20000Mi"
+      replicaCount: [1, 2, 4, 8]
+
+    llm:
+      engine: vllm
+      model_id: "meta-llama/Llama-3.1-8B-Instruct" # mandatory
+      replicaCount:
+        with_teirerank: [7, 15, 31, 63] # When teirerank.enabled is True
+        without_teirerank: [8, 16, 32, 64] # When teirerank.enabled is False
+      resources:
+        enabled: False
+        cards_per_instance: 1
+      model_params:
+        vllm: # VLLM specific parameters
+          batch_params:
+            enabled: True
+            max_num_seqs: [256]
+          token_params:
+            enabled: False
+            max_input_length: ""
+            max_total_tokens: ""
+            max_batch_total_tokens: ""
+            max_batch_prefill_tokens: ""
+        tgi: # TGI specific parameters
+          batch_params:
+            enabled: True
+            max_batch_size: [1, 2, 4, 8] # Each value triggers an LLM service upgrade
+          token_params:
+            enabled: False
+            max_input_length: "1280"
+            max_total_tokens: "2048"
+            max_batch_total_tokens: "65536"
+            max_batch_prefill_tokens: "4096"
+
+    data-prep:
+      resources:
+        enabled: False
+        cores_per_instance: ""
+        memory_capacity: ""
+      replicaCount: [1, 1, 1, 1]
+
+    retriever-usvc:
+      resources:
+        enabled: False
+        cores_per_instance: "8"
+        memory_capacity: "8000Mi"
+      replicaCount: [1, 2, 4, 8]
+
+    redis-vector-db:
+      resources:
+        enabled: False
+        cores_per_instance: ""
+        memory_capacity: ""
+      replicaCount: [1, 1, 1, 1]
+
+    chatqna-ui:
+      replicaCount: [1, 1, 1, 1]
+
+    nginx:
+      replicaCount: [1, 1, 1, 1]
+
+benchmark:
+  # http request behavior related fields
+  user_queries: [4, 8, 16, 32, 64, 128, 256, 512, 1024]
+  concurrency: [1, 2, 4, 8, 16, 32, 64, 128, 256]
+  load_shape_type: "constant" # "constant" or "poisson"
+  poisson_arrival_rate: 1.0 # only used when load_shape_type is "poisson"
+  warmup_iterations: 10
+  seed: 1024
+
+  # workload, all of the test cases will run for benchmark
+  bench_target: [chatqna_qlist_pubmed]
+  dataset: ["/home/sdp/opea_benchmark/pubmed_10.txt"]
+  prompt: [10]
+
+  llm:
+    # specify the llm output token size
+    max_token_size: [1024]
+```
diff --git a/README-deploy-benchmark.md b/README-deploy-benchmark.md
@@ -192,3 +192,7 @@ Choose "oob" mode when you want to selectively enable optimizations, or "tune" m
 - After cleaning up the directory, try running the deployment again
 
 Note: Always ensure there are no leftover Helm chart directories from previous failed runs before starting a new deployment.
+
+## ChatQnA Release Data
+
+The ChatQnA benchmark results are available in the [ChatQnA/benchmark_results.md](./ChatQnA/benchmark_results.md).