Added llama3.1-70b Benchmarking recipe on A3-Mega nodes#246
Added llama3.1-70b Benchmarking recipe on A3-Mega nodes#246krishnakanthankam-qt wants to merge 8 commits into
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
|
|
||
| This recipe supports the following models. Running TRTLLM inference benchmarking on these models are only tested and validated on A3-Mega GKE nodes with certain combination of TP, PP, EP, number of GPU chips, input & output sequence length, precision, etc. | ||
|
|
||
| Example model configuration YAML files included in this repo only show a certain combination of parallelism hyperparameters and configs for benchmarking purposes. Input and output length in `/home/akrishnakanth/gpu-recipes/inference/a3mega/llama3.1-70b/trtllm-gke/values.yaml` need to be adjusted according to the model and its configs. |
| rm -rf $engine_dir | ||
| rm -f $dataset_file | ||
| rm -rf $engine_dir || true | ||
| rm -f $dataset_file || true |
| --backend "pytorch" \ | ||
| --kv_cache_free_gpu_mem_fraction $kv_cache_free_gpu_mem_fraction \ | ||
| $extra_args $vl_args > $output_file | ||
| $extra_args $vl_args | tee "$output_file" |
There was a problem hiding this comment.
| tee - This change can be reverted back to orginal.
| --dataset $dataset_file \ | ||
| --engine_dir $engine_dir \ | ||
| --kv_cache_free_gpu_mem_fraction $kv_cache_free_gpu_mem_fraction $extra_args >$output_file | ||
| --kv_cache_free_gpu_mem_fraction $kv_cache_free_gpu_mem_fraction $extra_args | tee $output_file |
There was a problem hiding this comment.
| tee - This also can revert back to original.
| --kv_cache_free_gpu_mem_fraction $kv_cache_free_gpu_mem_fraction $extra_args | tee $output_file | ||
| fi | ||
|
|
||
| cat $output_file |
There was a problem hiding this comment.
add this back to the file.
| serverArgs: | ||
| max-model-len: 32768 | ||
| max-num-seqs: 128 | ||
| gpu-memory-utilization: 0.90 No newline at end of file |
There was a problem hiding this comment.
please remove gpu-memory-utilization: 0.90 from here, you are passing this value from trtllm-configs
| helm install -f values.yaml \ | ||
| --set workload.benchmarks.experiments[0].isl=128 \ | ||
| --set workload.benchmarks.experiments[0].osl=128 \ | ||
| --set workload.benchmarks.experiments[0].num_requests=1000 \ |
There was a problem hiding this comment.
Line 235 to 237 can be removed, we are passing these values from values.yaml. we don't usually hardcode any values on Readme
| $REPO_ROOT/src/helm-charts/a3mega/trtllm-inference/single-node | ||
| ``` | ||
| > [!NOTE] | ||
| > You can modify the benchmark configuration at runtime by changing the values for `isl`, `osl`, and `num_requests` (number of prompts) in the Helm command to test different scenarios. |
There was a problem hiding this comment.
Please check the other recipes to update this line.
| =========================================================== | ||
| DATASET DETAILS | ||
| =========================================================== | ||
| Dataset Path: /ssd/token-norm-dist_llama3.1-70b_128_128_tp4.json |
| PYTORCH BACKEND | ||
| =========================================================== | ||
| Model: nvidia/Llama3.1-70b | ||
| Model Path: /ssd/nvidia/Llama3.1-70b |
Description
Title
Add Llama 3.1 70B Recipe and Optimized Sequential Benchmarking
Summary
Introduces a high-performance recipe for serving and benchmarking Llama 3.1 70B on A3mega GKE node pools.