Skip to content

Add summarization_var workload for Phase 2/3 concurrent load testing #52

@maryamtahhan

Description

@maryamtahhan

Problem

The concurrent load testing playbook (llm-benchmark-concurrent-load.yml) runs 3 phases:

  • Phase 1: Baseline (Fixed Tokens, No Caching)
  • Phase 2: Realistic (Variable Tokens, No Caching)
  • Phase 3: Production (Variable Tokens, With Caching)

However, Phases 2 and 3 only execute for workloads with a _var variant. Currently, only chat_var and code_var exist.

The summarization workload has no _var variant, so it only runs in Phase 1.

Current Behavior

Lines 112-113 and 151-152 in llm-benchmark-concurrent-load.yml:

when:
  - not (skip_phase_2 | default(false) | bool)
  - (base_workload + '_var') in ['chat_var', 'code_var']  # ❌ summarization_var missing

This means:

  • base_workload=summarization → Only Phase 1 runs
  • base_workload=chat → All 3 phases run
  • base_workload=code → All 3 phases run

Impact

  1. Incomplete testing: Summarization workload cannot be tested with realistic variable token distributions
  2. No prefix caching evaluation: Cannot measure caching benefits for summarization use cases
  3. Test matrix gap: Models with summarization as a default workload (e.g., facebook/opt-125m) miss 2/3 of the test phases

Proposed Solution

Add summarization_var workload definition to test-workloads.yml

Recommended Configuration

# Summarization workload with variability (Realistic traffic simulation)
summarization_var:
  workload_type: "summarization_var"
  isl: 1024                           # Mean input length
  isl_stdev: 256                      # Input length std dev (~25% variance)
  isl_min: 512                        # Minimum input (short articles)
  isl_max: 2048                       # Maximum input (long articles)
  osl: 256                            # Mean output length
  osl_stdev: 64                       # Output length std dev (~25% variance)
  osl_min: 128                        # Minimum output (brief summaries)
  osl_max: 512                        # Maximum output (detailed summaries)
  variability: true                   # Enable statistical distribution
  backend: "openai-completions"
  vllm_args:
    - "--dtype=bfloat16"
    - "--no-enable-prefix-caching"    # Baseline mode: no prefix caching
  kv_cache_space: "40GiB"             # ~1280 avg tokens * ~32 concurrent

Changes Required

  1. File: automation/test-execution/ansible/inventory/group_vars/all/test-workloads.yml

    • Add summarization_var workload definition (as shown above)
  2. File: automation/test-execution/ansible/llm-benchmark-concurrent-load.yml

    • Update line 113: - (base_workload + '_var') in ['chat_var', 'code_var', 'summarization_var']
    • Update line 152: - (base_workload + '_var') in ['chat_var', 'code_var', 'summarization_var']
  3. File: tests/concurrent-load/concurrent-load.md (documentation)

    • Update Phase 2/3 sections to mention summarization_var support
    • Add test IDs like CONC-OPT125M-SUMM-VAR, CONC-LLAMA32-SUMM-VAR, etc.

Additional Context

Recommended Datasets for Phase 3 (Production)

For realistic summarization testing with prefix caching benefits:

Affected Models

Models with summarization in their default_workloads (from model-matrix.yaml):

  • facebook/opt-125m (Test ID: CONC-OPT125M-SUMM)
  • Potentially others if added in future

References

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions