|
| 1 | +# ChatQnA Benchmark Results |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +ChatQnA deployed on a single node with ICX cores as the head node and supporting 8x Gaudi2 cards. |
| 6 | +This is based on OPEA v1.3 release helm charts and images using vLLM inferencing platform. |
| 7 | + |
| 8 | +## Methodology |
| 9 | + |
| 10 | +Tests scale concurrent users from 1 to 256, and each user send 4 queries. Measuring end to end (E2E) latency average for each query, time to first token (TTFT) average and time per output token (TPOT) average. |
| 11 | + |
| 12 | +## Hardware and Software Configuration |
| 13 | + |
| 14 | +| **Category** | **Details** | |
| 15 | +| --------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
| 16 | +| **System Summary** | 1-node, 2x Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz, 40 cores, 270W TDP, HT On, Turbo On, NUMA 2, Integrated Accelerators Available [used]: DLB 0 [0], DSA 0 [0], IAA 0 [0], QAT 0 [0], Total Memory 1024GB (32x32GB DDR4 3200 MT/s [3200 MT/s]), BIOS ETM02, microcode 0xd0003b9, 8x Habana Labs Ltd., 4x MT28800 Family [ConnectX-5 Ex], 4x 7T INTEL SSDPF2KX076TZ, 2x 894.3G SAMSUNG MZ1L2960HCJR-00A07, Ubuntu 22.04.3 LTS, 5.15.0-92-generic. Software: WORKLOAD+VERSION, COMPILER, LIBRARIES, OTHER_SW. | |
| 17 | +| **Framework** | langchain, vLLM, habana framework | |
| 18 | +| **Orchestration** | k8s/docker | |
| 19 | +| **Containers and Virtualization** | Kubernetes v1.29.9 | |
| 20 | +| **Drivers** | habana driver 1.20.1-366eb9c | |
| 21 | +| **VM vcpu, Memory** | 160 vCPUs, 1T memory | |
| 22 | +| **OPEA Release Version** | v1.3 | |
| 23 | +| **Dataset** | pubmed_10.txt | |
| 24 | +| **Embedding Model** | BAAI/bge-base-en-v1.5 | |
| 25 | +| **Database** | redis | |
| 26 | +| **LLM Model** | meta-llama/Llama-3.1-8B-Instruct | |
| 27 | +| **Precision** | bf16 | |
| 28 | +| **Output Length** | 1024 | |
| 29 | +| **Command Line Parameters** | python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml --target-node 1 --test-mode oob | |
| 30 | +| **Batch Size** | 256 | |
| 31 | + |
| 32 | +## Benchmark Results |
| 33 | + |
| 34 | +| Users | E2E Latency Avg (ms) | TTFT Avg (ms) | TPOT Avg (ms) | |
| 35 | +| ----- | -------------------- | ------------- | ------------- | |
| 36 | +| 256 | 35,034.7 | 1,042.8 | 33.1 | |
| 37 | +| 128 | 20,996.0 | 529.8 | 19.9 | |
| 38 | +| 64 | 16,602.1 | 404.9 | 15.8 | |
| 39 | +| 32 | 14,646.5 | 260.1 | 14.0 | |
| 40 | +| 16 | 13,669.3 | 193.7 | 13.1 | |
| 41 | +| 8 | 13,275.2 | 157.3 | 12.8 | |
| 42 | +| 4 | 13,038.8 | 127.7 | 12.5 | |
| 43 | +| 2 | 13,059.0 | 129.4 | 12.6 | |
| 44 | +| 1 | 12,906.5 | 126.8 | 12.5 | |
| 45 | + |
| 46 | +## Benchmark Config Yaml |
| 47 | + |
| 48 | +<details> |
| 49 | +<summary>Click to Check Benchmark Config Yaml</summary> |
| 50 | + |
| 51 | +```yaml |
| 52 | +deploy: |
| 53 | + device: gaudi |
| 54 | + version: 1.3.0 |
| 55 | + modelUseHostPath: /home/sdp/opea_benchmark/model |
| 56 | + HUGGINGFACEHUB_API_TOKEN: xxx |
| 57 | + node: [1] |
| 58 | + namespace: default |
| 59 | + timeout: 1000 # timeout in seconds for services to be ready, default 30 minutes |
| 60 | + interval: 5 # interval in seconds between service ready checks, default 5 seconds |
| 61 | + |
| 62 | + services: |
| 63 | + backend: |
| 64 | + resources: |
| 65 | + enabled: False |
| 66 | + cores_per_instance: "16" |
| 67 | + memory_capacity: "8000Mi" |
| 68 | + replicaCount: [1, 2, 4, 8] |
| 69 | + |
| 70 | + teirerank: |
| 71 | + enabled: False |
| 72 | + model_id: "" |
| 73 | + resources: |
| 74 | + enabled: False |
| 75 | + cards_per_instance: 1 |
| 76 | + replicaCount: [1, 1, 1, 1] |
| 77 | + |
| 78 | + tei: |
| 79 | + model_id: "" |
| 80 | + resources: |
| 81 | + enabled: False |
| 82 | + cores_per_instance: "80" |
| 83 | + memory_capacity: "20000Mi" |
| 84 | + replicaCount: [1, 2, 4, 8] |
| 85 | + |
| 86 | + llm: |
| 87 | + engine: vllm |
| 88 | + model_id: "meta-llama/Llama-3.1-8B-Instruct" # mandatory |
| 89 | + replicaCount: |
| 90 | + with_teirerank: [7, 15, 31, 63] # When teirerank.enabled is True |
| 91 | + without_teirerank: [8, 16, 32, 64] # When teirerank.enabled is False |
| 92 | + resources: |
| 93 | + enabled: False |
| 94 | + cards_per_instance: 1 |
| 95 | + model_params: |
| 96 | + vllm: # VLLM specific parameters |
| 97 | + batch_params: |
| 98 | + enabled: True |
| 99 | + max_num_seqs: [256] |
| 100 | + token_params: |
| 101 | + enabled: False |
| 102 | + max_input_length: "" |
| 103 | + max_total_tokens: "" |
| 104 | + max_batch_total_tokens: "" |
| 105 | + max_batch_prefill_tokens: "" |
| 106 | + tgi: # TGI specific parameters |
| 107 | + batch_params: |
| 108 | + enabled: True |
| 109 | + max_batch_size: [1, 2, 4, 8] # Each value triggers an LLM service upgrade |
| 110 | + token_params: |
| 111 | + enabled: False |
| 112 | + max_input_length: "1280" |
| 113 | + max_total_tokens: "2048" |
| 114 | + max_batch_total_tokens: "65536" |
| 115 | + max_batch_prefill_tokens: "4096" |
| 116 | + |
| 117 | + data-prep: |
| 118 | + resources: |
| 119 | + enabled: False |
| 120 | + cores_per_instance: "" |
| 121 | + memory_capacity: "" |
| 122 | + replicaCount: [1, 1, 1, 1] |
| 123 | + |
| 124 | + retriever-usvc: |
| 125 | + resources: |
| 126 | + enabled: False |
| 127 | + cores_per_instance: "8" |
| 128 | + memory_capacity: "8000Mi" |
| 129 | + replicaCount: [1, 2, 4, 8] |
| 130 | + |
| 131 | + redis-vector-db: |
| 132 | + resources: |
| 133 | + enabled: False |
| 134 | + cores_per_instance: "" |
| 135 | + memory_capacity: "" |
| 136 | + replicaCount: [1, 1, 1, 1] |
| 137 | + |
| 138 | + chatqna-ui: |
| 139 | + replicaCount: [1, 1, 1, 1] |
| 140 | + |
| 141 | + nginx: |
| 142 | + replicaCount: [1, 1, 1, 1] |
| 143 | + |
| 144 | +benchmark: |
| 145 | + # http request behavior related fields |
| 146 | + user_queries: [4, 8, 16, 32, 64, 128, 256, 512, 1024] |
| 147 | + concurrency: [1, 2, 4, 8, 16, 32, 64, 128, 256] |
| 148 | + load_shape_type: "constant" # "constant" or "poisson" |
| 149 | + poisson_arrival_rate: 1.0 # only used when load_shape_type is "poisson" |
| 150 | + warmup_iterations: 10 |
| 151 | + seed: 1024 |
| 152 | + |
| 153 | + # workload, all of the test cases will run for benchmark |
| 154 | + bench_target: [chatqna_qlist_pubmed] |
| 155 | + dataset: ["/home/sdp/opea_benchmark/pubmed_10.txt"] |
| 156 | + prompt: [10] |
| 157 | + |
| 158 | + llm: |
| 159 | + # specify the llm output token size |
| 160 | + max_token_size: [1024] |
| 161 | +``` |
0 commit comments