Problem
The concurrent load testing playbook (llm-benchmark-concurrent-load.yml) runs 3 phases:
- Phase 1: Baseline (Fixed Tokens, No Caching)
- Phase 2: Realistic (Variable Tokens, No Caching)
- Phase 3: Production (Variable Tokens, With Caching)
However, Phases 2 and 3 only execute for workloads with a _var variant. Currently, only chat_var and code_var exist.
The summarization workload has no _var variant, so it only runs in Phase 1.
Current Behavior
Lines 112-113 and 151-152 in llm-benchmark-concurrent-load.yml:
when:
- not (skip_phase_2 | default(false) | bool)
- (base_workload + '_var') in ['chat_var', 'code_var'] # ❌ summarization_var missing
This means:
base_workload=summarization → Only Phase 1 runs
base_workload=chat → All 3 phases run
base_workload=code → All 3 phases run
Impact
- Incomplete testing: Summarization workload cannot be tested with realistic variable token distributions
- No prefix caching evaluation: Cannot measure caching benefits for summarization use cases
- Test matrix gap: Models with summarization as a default workload (e.g.,
facebook/opt-125m) miss 2/3 of the test phases
Proposed Solution
Add summarization_var workload definition to test-workloads.yml
Recommended Configuration
# Summarization workload with variability (Realistic traffic simulation)
summarization_var:
workload_type: "summarization_var"
isl: 1024 # Mean input length
isl_stdev: 256 # Input length std dev (~25% variance)
isl_min: 512 # Minimum input (short articles)
isl_max: 2048 # Maximum input (long articles)
osl: 256 # Mean output length
osl_stdev: 64 # Output length std dev (~25% variance)
osl_min: 128 # Minimum output (brief summaries)
osl_max: 512 # Maximum output (detailed summaries)
variability: true # Enable statistical distribution
backend: "openai-completions"
vllm_args:
- "--dtype=bfloat16"
- "--no-enable-prefix-caching" # Baseline mode: no prefix caching
kv_cache_space: "40GiB" # ~1280 avg tokens * ~32 concurrent
Changes Required
-
File: automation/test-execution/ansible/inventory/group_vars/all/test-workloads.yml
- Add
summarization_var workload definition (as shown above)
-
File: automation/test-execution/ansible/llm-benchmark-concurrent-load.yml
- Update line 113:
- (base_workload + '_var') in ['chat_var', 'code_var', 'summarization_var']
- Update line 152:
- (base_workload + '_var') in ['chat_var', 'code_var', 'summarization_var']
-
File: tests/concurrent-load/concurrent-load.md (documentation)
- Update Phase 2/3 sections to mention
summarization_var support
- Add test IDs like
CONC-OPT125M-SUMM-VAR, CONC-LLAMA32-SUMM-VAR, etc.
Additional Context
Recommended Datasets for Phase 3 (Production)
For realistic summarization testing with prefix caching benefits:
Affected Models
Models with summarization in their default_workloads (from model-matrix.yaml):
facebook/opt-125m (Test ID: CONC-OPT125M-SUMM)
- Potentially others if added in future
References
Problem
The concurrent load testing playbook (
llm-benchmark-concurrent-load.yml) runs 3 phases:However, Phases 2 and 3 only execute for workloads with a
_varvariant. Currently, onlychat_varandcode_varexist.The
summarizationworkload has no_varvariant, so it only runs in Phase 1.Current Behavior
Lines 112-113 and 151-152 in
llm-benchmark-concurrent-load.yml:This means:
base_workload=summarization→ Only Phase 1 runsbase_workload=chat→ All 3 phases runbase_workload=code→ All 3 phases runImpact
facebook/opt-125m) miss 2/3 of the test phasesProposed Solution
Add
summarization_varworkload definition totest-workloads.ymlRecommended Configuration
Changes Required
File:
automation/test-execution/ansible/inventory/group_vars/all/test-workloads.ymlsummarization_varworkload definition (as shown above)File:
automation/test-execution/ansible/llm-benchmark-concurrent-load.yml- (base_workload + '_var') in ['chat_var', 'code_var', 'summarization_var']- (base_workload + '_var') in ['chat_var', 'code_var', 'summarization_var']File:
tests/concurrent-load/concurrent-load.md(documentation)summarization_varsupportCONC-OPT125M-SUMM-VAR,CONC-LLAMA32-SUMM-VAR, etc.Additional Context
Recommended Datasets for Phase 3 (Production)
For realistic summarization testing with prefix caching benefits:
Affected Models
Models with
summarizationin theirdefault_workloads(frommodel-matrix.yaml):facebook/opt-125m(Test ID:CONC-OPT125M-SUMM)References
model-matrix.yamlconcurrent-load.mdtest-workloads.yml