Skip to content

Commit 64af5a7

Browse files
cquil11Oseltamivirdependabot[bot]functionstackxgithub-actions[bot]
authored
Adding evals after throughput benchmarks (#258)
* initial poc * remove -d flag when launching docker container * syntax error * compatibility fixes * add correct endpoint prefix * remove reference env var * run vllm serve in background * unescape sequences * stop vllm to stdout after it stops * stop vllm to stdout after it stops pt 2 * get rid of docker stop as no longer in detatched * clone bench serving to tmp dir * clone bench serving to tmp dir pt 2 * add explanatory comment * cleaning up * cleaning up * adding mi355x refactor * adding h200 initial refactor * different way to see server logs * cleanup * now fail if server fails * starting on b200 * doign b200 * reverting erroneous change * fixing b200 * fixing b200 pt 2 * updating mi300 * updating mi300 pt 2 * updating mi300 pt 3 -- remove detached mode * cleaning up mi355x * fixing mi300x and updating 325x * reverting max conc to 512 on gptoss fp4 b200 docker * mi325x debug * add back correct launch script for new mi325x slurm cluster (#231) * fixing mi300x and updating 325x * cleanng up * add wait for h200 slurm dsr1 * max num seqs back to 512 for gptoss fpr b200 docker * fix port issue for dsr1 mi300x docker * fix mi355x docker NUM_PROMPTS * adding prop of failure for server logs * add utils function for benchmark * add utils function for benchmark * function-ize the waiting for server to start * dont show arg parsing set -x * dont show arg parsing set +x oops * dont show arg parsing set +x oops * capture server pid * Squash-merge bryan/eval into refactor-docker-runner-launch * evals h100-cr * evals h100-cw * evals h200-nb * move eval script here * evals mi300x-amd * evals mi325x-amd * evals mi300x-tw * evals mi300x-oci * evals mi325x-tw * evals mi325x-tw summary * evals mi325x-tw summary * evals mi355x-amd * evals mi325x-tw summary * evals mi325x-tw summary * evals mi325x-tw summary * all summary * evals b200-nvd * evals b200-nvd 2 * evals b200-nvd 3 * evals h100-cr * evals b200-nvd 1 * evals h200-trt-cw * evals h200-trt-cw 2 * evals h200-trt-cw 3 * evals h100-cr 2 * evals h200-trt-cw 4 * evals h200-trt-cw 5 (EP/TP HARD) * evals h200-trt-cw 6 (EP/TP HARD) * evals h200-trt-cw 6 (EP/TP HARD) * evals h200-cw dsr1 * evals mi300x-cr dsr1 * evals mi300x-cr dsr1 2 * evals mi325x-cr dsr1 * evals mi325x-cr dsr1 2 * evals mi355x-amd dsr1 * evals mi355x-amd dsr1 2 * evals mi355x-amd dsr1 3 * evals mi355x-amd dsr1 4 * evals b200-nvd dsr1 * evals b200-nvd fp8 dsr1 * Lighteval 1 * Lighteval 1.75 * Lighteval Mi325x * Lighteval Mi300x CR * Lighteval Mi355x amd * Lighteval b200_nvd * Lighteval h200_cr0 * Lighteval h200-nb_1 * Lighteval h100-cw_1 * Error reproduction * Error file removal * error reproducibility * should NOT error reproduce * should NOT error reproduce * should NOT error reproduce * should NOT error reproduce * Double check other runner * Cleanup MI300x_AMD * Cleanup MI300x_AMD * Cleanup MI300x_AMD * Cleanup MI300x_AMD MUST WORK * works * Working lighteval * lightevel fix * lighteval test h100-cw_1 * lighteval test h100-cr_1 + parsing * lighteval test b200_nvd * lighteval test b200_nvd * lighteval test mi300x-amd_0 * lighteval test h100-cw_1 * lighteval test mi300x-cr_0 * lighteval test mi325x-tw_1 * lighteval test mi355x-amd_4 * lighteval test b200-nvd_3 * lighteval test h100-cw_1 sudo test * b200 fix check * b200 fix check * b200 fix check * b200 fix check * b200 fix check * b200 fix check * b200 fix check * b200 fix check * b200 fix check * Prelimary lighteval for all * Prelimary lighteval for all 2 - fixed TP * Prelimary lighteval for all 3 * Fix lighteval 1 * Check both * lm-eval check * lm-eval check * lm-eval check * lm-eva l optimization * mi325x test * mi325x test * all change, test deepseek * all change, test deepseek * retest mi325x * test b200 * clean b200 * test h200 * H200 test * B200-nvd2 sleep * B200-nvd2 sleep * B200-nvd2 sleep * mi325x test * mi325x test, no text, no empty fix * h100, tmp eval_out * h100, tmp eval_out, sweep integration * touch up sweep naming, remove funny triton error * touch up sweep summary * touch up run name * Missing eval env var docker * Typo * Add proper coverage * Add evals * Cam's solution * b200 scancel fix * Change to 2 fewshot, forgot eval env var in b200 * Resolve issues * Resolve issues/nits * fix summary table hardware * fix summary table hardware * fix summary table hardware 2 * final touches * Cleanup comments, ammend lighteval * pt 1 manual merge conflict fixes * pt 2 manual merge conflict fixes * use double quotes for gha parsing * getting rid of full sweep sched changes * add back spec decoding and disagg env vars * add an option to ONLY run evals * remove full-sweep-test workflow and add collect-evals job to run sweep and e2e test * add run-eval to e2e tests * math500 prompt and h200 trt evals * remove run prefix * add result-prefix to benchmark tmpl uploaded artifacts * Evals summary refactor * Evals summary refactor 2 * Evals summary aesthetics * TRT package fix, trt testing * trt testing 2 * max_num_tokens * unbounded gen len * Fix tmpl args, add isl/osl to table * add isl/osl * set max tokens * remove nvd * In case of multiple evals * diagnostic * test dp_attn * DP_ATTENTION back * REMOVE LIGHTEVAL * Add evals for atom, trt_mtp * remove tokenizer from benchmarkserving * remove model_name * More evals for spec decode * claude pr comments * chore(deps): bump the github-actions group with 2 updates (#488) * fix: update ep metadata in gb200 dynamo sglang configs to match comments (#486) Update ep values to use the formula: EP = (NODES × 4 GPUs) / num-workers for both dsr1-fp8-gb200-dynamo-sglang and dsr1-fp4-gb200-dynamo-sglang configurations. The metadata isn't used by sglang dynamo scripts (values are hardcoded), but the frontend uses these values. Fixes #485 Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: functionstackx <functionstackx@users.noreply.github.com> * Experimental folder (increasing researcher/developer velocity) (#489) * summary table * Remove git installation and repository cloning Removed git installation check and cloning of bench_serving repository. * evals final * more retries, lower conc, for stability --------- Co-authored-by: Oseltamivir <bryansg2013@gmail.com> Co-authored-by: Bryan Shan <58582368+Oseltamivir@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
1 parent cb23cd9 commit 64af5a7

53 files changed

Lines changed: 1196 additions & 119 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/benchmark-tmpl.yml

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,14 @@ on:
5050
disagg:
5151
required: true
5252
type: string
53+
run-eval:
54+
type: boolean
55+
required: true
56+
default: false
57+
random-range-ratio:
58+
required: false
59+
type: string
60+
default: '0.8'
5361
ref:
5462
description: "Git ref (branch/sha) to checkout"
5563
required: false
@@ -74,6 +82,7 @@ env:
7482
CONC: ${{ inputs.conc }}
7583
SPEC_DECODING: ${{ inputs.spec-decoding }}
7684
DISAGG: ${{ inputs.disagg }}
85+
RUN_EVAL: ${{ inputs.run-eval }}
7786

7887
permissions:
7988
contents: read
@@ -82,7 +91,7 @@ jobs:
8291
benchmark:
8392
runs-on: ${{ inputs.runner }}
8493
timeout-minutes: 180
85-
name: '${{ inputs.exp-name }} ${{ inputs.runner }} ${{ inputs.framework }} ${{ inputs.precision }} tp=${{ inputs.tp }} ep=${{ inputs.ep }} dpa=${{ inputs.dp-attn }} conc=${{ inputs.conc }} spec=${{ inputs.spec-decoding }}'
94+
name: "${{ inputs.exp-name }} ${{ inputs.runner }} ${{ inputs.framework }} ${{ inputs.precision }} ${{ inputs.run-eval && 'eval ' || '' }}tp=${{ inputs.tp }} ep=${{ inputs.ep }} dpa=${{ inputs.dp-attn }} conc=${{ inputs.conc }} spec=${{ inputs.spec-decoding }}"
8695
steps:
8796
- name: Resource cleanup
8897
run: |
@@ -113,7 +122,11 @@ jobs:
113122
- name: Launch job script
114123
env:
115124
RUNNER_NAME: ${{ runner.name }}
125+
RUNNER_TYPE: ${{ inputs.runner }}
116126
RESULT_FILENAME: ${{ env.EXP_NAME }}_${{ env.PRECISION }}_${{ env.FRAMEWORK }}_tp${{ env.TP }}_ep${{ env.EP_SIZE }}_dpa_${{ env.DP_ATTENTION }}_conc${{ env.CONC }}_specdecode_${{ env.SPEC_DECODING }}_${{ runner.name }}
127+
# Suppress per-job eval markdown from being appended to the step summary.
128+
# We'll publish a single combined eval table in the collection job instead.
129+
GITHUB_STEP_SUMMARY: ''
117130
run: |
118131
bash ./runners/launch_${RUNNER_NAME%%_*}.sh
119132
FOUND_RESULT_FILE=
@@ -137,8 +150,27 @@ jobs:
137150
RUNNER_TYPE: ${{ inputs.runner }}
138151
run: |
139152
python3 utils/process_result.py
153+
140154
- name: Upload result
141155
uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
142156
with:
143157
name: bmk_${{ env.RESULT_FILENAME }}
144158
path: agg_${{ env.RESULT_FILENAME }}.json
159+
160+
- name: Upload eval results (if any)
161+
if: ${{ env.RUN_EVAL == 'true' }}
162+
uses: actions/upload-artifact@330a01c490aca151604b8cf639adc76d48f6c5d4 # v5.0.0
163+
with:
164+
name: eval_${{ env.EXP_NAME }}_${{ env.RESULT_FILENAME }}
165+
path: |
166+
meta_env.json
167+
results*.json
168+
sample*.jsonl
169+
if-no-files-found: ignore
170+
171+
- name: Cleanup eval outputs (post-upload)
172+
if: ${{ env.RUN_EVAL == 'true' }}
173+
run: |
174+
rm -f meta_env.json || true
175+
# Remove any eval results JSONs that were moved into workspace
176+
rm -f results*.json || true
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
name: Template - Collect Evals
2+
3+
on:
4+
workflow_call:
5+
inputs:
6+
result-prefix:
7+
required: false
8+
type: string
9+
default: ''
10+
11+
permissions:
12+
contents: read
13+
14+
jobs:
15+
collect-evals:
16+
runs-on: ubuntu-latest
17+
steps:
18+
- name: Checkout code
19+
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
20+
with:
21+
token: ${{ secrets.REPO_PAT }}
22+
fetch-depth: 0
23+
24+
- name: Download eval artifacts
25+
uses: actions/download-artifact@018cc2cf5baa6db3ef3c5f8a56943fffe632ef53 # v6.0.0
26+
with:
27+
path: eval_results/
28+
pattern: ${{ inputs.result-prefix && format('eval_{0}_*', inputs.result-prefix) || 'eval_*' }}
29+
30+
- name: Summarize evals
31+
run: |
32+
pip install tabulate
33+
echo "## Eval Summary" >> $GITHUB_STEP_SUMMARY
34+
echo "" >> $GITHUB_STEP_SUMMARY
35+
python3 utils/collect_eval_results.py eval_results/ ${{ inputs.result-prefix || 'all' }} >> $GITHUB_STEP_SUMMARY
36+
37+
- name: Upload aggregated evals
38+
uses: actions/upload-artifact@330a01c490aca151604b8cf639adc76d48f6c5d4 # v5.0.0
39+
with:
40+
name: eval_results_${{ inputs.result-prefix || 'all' }}
41+
path: agg_eval_${{ inputs.result-prefix || 'all' }}.json
42+
43+
- name: Cleanup downloaded eval artifacts
44+
if: ${{ always() }}
45+
run: |
46+
rm -rf eval_results/ || true

.github/workflows/collect-results.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,9 @@ jobs:
3434
python3 utils/summarize.py results/ >> $GITHUB_STEP_SUMMARY
3535
3636
- name: Aggregate results
37-
run: python3 utils/collect_results.py results/ ${{ inputs.result-prefix || 'all' }}
37+
run: |
38+
pip install tabulate
39+
python3 utils/collect_results.py results/ ${{ inputs.result-prefix || 'all' }}
3840
3941
- name: Upload aggregated results
4042
uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0

.github/workflows/e2e-tests.yml

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,16 +122,25 @@ jobs:
122122
conc: ${{ matrix.config.conc }}
123123
spec-decoding: ${{ matrix.config.spec-decoding }}
124124
disagg: ${{ matrix.config.disagg }}
125+
run-eval: ${{ matrix.config.run-eval }}
125126
ref: ${{ inputs.ref }}
126127

127128
collect-results:
128129
needs: [test-sweep-multi-node, test-sweep-single-node]
129130
if: ${{ always() }}
130131
uses: ./.github/workflows/collect-results.yml
131132
secrets: inherit
133+
with:
134+
result-prefix: "bmk"
135+
136+
collect-evals:
137+
needs: [test-sweep-multi-node, test-sweep-single-node]
138+
if: ${{ always() }}
139+
uses: ./.github/workflows/collect-evals.yml
140+
secrets: inherit
132141

133142
calc-success-rate:
134-
needs: collect-results
143+
needs: [collect-results, collect-evals]
135144
if: ${{ always() }}
136145
runs-on: ubuntu-latest
137146

.github/workflows/run-sweep.yml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,7 @@ jobs:
142142
conc: ${{ matrix.config.conc }}
143143
spec-decoding: ${{ matrix.config.spec-decoding }}
144144
disagg: ${{ matrix.config.disagg }}
145+
run-eval: ${{ matrix.config.run-eval }}
145146

146147
sweep-single-node-1k8k:
147148
needs: setup
@@ -184,6 +185,21 @@ jobs:
184185
with:
185186
result-prefix: "bmk"
186187

188+
collect-evals:
189+
needs:
190+
[
191+
sweep-single-node-1k1k,
192+
sweep-single-node-1k8k,
193+
sweep-single-node-8k1k,
194+
sweep-multi-node-1k1k,
195+
sweep-multi-node-1k8k,
196+
sweep-multi-node-8k1k,
197+
setup,
198+
]
199+
if: ${{ always() && needs.setup.result != 'skipped' }}
200+
uses: ./.github/workflows/collect-evals.yml
201+
secrets: inherit
202+
187203
upload-changelog-metadata:
188204
needs: [setup, collect-results]
189205
if: ${{ always() && needs.setup.result != 'skipped' }}

0 commit comments

Comments
 (0)