Skip to content

Commit 72a4c8c

Browse files
cquil11claude
andcommitted
test: DSv4-Pro B300 - retry-only subset (7 jobs)
Run 25831988077 hit a transient cudaErrorDevicesUnavailable at vLLM worker startup on 7 of 36 configs (low-conc tp=8 jobs that landed on runners while a previous job was still releasing GPUs). Other 29 configs verified clean. This commit narrows the sweep to just those 7 for a quick retry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent a6821dc commit 72a4c8c

1 file changed

Lines changed: 5 additions & 4 deletions

File tree

.github/configs/nvidia-master.yaml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2844,10 +2844,11 @@ dsv4-fp4-b300-vllm:
28442844
agentic-coding:
28452845
- duration: 1800
28462846
search-space:
2847-
- { tp: 4, offloading: none, conc-list: [1, 2, 4, 8, 16, 32, 48, 64, 70] }
2848-
- { tp: 8, offloading: none, conc-list: [1, 2, 4, 8, 16, 32, 48, 64, 70] }
2849-
- { tp: 4, ep: 4, dp-attn: true, offloading: none, conc-list: [1, 2, 4, 8, 16, 32, 48, 64, 70] }
2850-
- { tp: 8, ep: 8, dp-attn: true, offloading: none, conc-list: [1, 2, 4, 8, 16, 32, 48, 64, 70] }
2847+
# Retry-only subset: re-running the 7 jobs that hit a transient
2848+
# cudaErrorDevicesUnavailable at vLLM worker startup in run 25831988077.
2849+
# The other 29 configs verified clean from that run.
2850+
- { tp: 8, offloading: none, conc-list: [1, 2, 8, 16] }
2851+
- { tp: 8, ep: 8, dp-attn: true, offloading: none, conc-list: [1, 4, 8] }
28512852

28522853
dsv4-fp4-b300-trt:
28532854
image: ghcr.io#semianalysisai/trtllm-deepseek-v4:feat-deepseek_v4-9aa3715

0 commit comments

Comments
 (0)