Skip to content

Commit eb2fa8e

Browse files
committed
fix(launcher): recurse subdirectories when injecting monitoring: into recipes
The previous glob `$SRT_RECIPE_DST/*.yaml` only matched top-level YAMLs, but recipes live under workload subdirectories (e.g. 8k1k/*.yaml). The loop iterated zero times, no recipe got the monitoring: block, perfmon never spawned, no perf_samples_*.csv were written, aggregate_power silently skipped patching the agg JSON, and the dashboard had no power data. Sweep #26548110246 burned hours of GB300 time and shipped "success" with zero power keys in every agg artifact — exactly the silent-failure chain we should have caught earlier. Fix: recurse via `find -type f -name '*.yaml'`. Add a loud WARNING when zero recipes get the injection so future regressions surface immediately instead of waiting for missing dashboard data to be noticed.
1 parent a9339df commit eb2fa8e

1 file changed

Lines changed: 14 additions & 3 deletions

File tree

runners/launch_gb300-cw.sh

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -116,13 +116,24 @@ cp -rT "$SRT_RECIPE_SRC" "$SRT_RECIPE_DST"
116116
# orchestrator's _start_perf_monitor short-circuits and no perf_samples_*.csv
117117
# are ever written — multinode measured-power aggregation would silently
118118
# skip. Idempotent: skips recipes that already declare `monitoring:`.
119-
for recipe in "$SRT_RECIPE_DST"/*.yaml; do
120-
[ -f "$recipe" ] || continue
119+
#
120+
# CRITICAL: use `find` recursively, not a flat `*.yaml` glob. Recipes live
121+
# in $SRT_RECIPE_DST/<workload>/*.yaml (e.g. .../8k1k/*.yaml) — a flat glob
122+
# matches zero files, the loop runs zero times, no recipe gets monitoring,
123+
# and perfmon never spawns. PR #1574's first real sweep (#26548110246) hit
124+
# exactly this: completed "success" with no power data because the glob
125+
# matched nothing and the failure was silent end-to-end.
126+
INJECTED_COUNT=0
127+
while IFS= read -r recipe; do
121128
if ! grep -q '^monitoring:' "$recipe"; then
122129
printf '\nmonitoring:\n enabled: true\n sample_interval: 1.0\n' >> "$recipe"
123130
echo "[perfmon] enabled monitoring in recipe: $recipe"
131+
INJECTED_COUNT=$((INJECTED_COUNT + 1))
124132
fi
125-
done
133+
done < <(find "$SRT_RECIPE_DST" -type f -name '*.yaml')
134+
if [ "$INJECTED_COUNT" -eq 0 ]; then
135+
echo "[perfmon] WARNING: zero recipes received monitoring injection under $SRT_RECIPE_DST. Either every recipe already had it, or the directory layout changed — power data will be MISSING from this run." >&2
136+
fi
126137

127138
echo "Installing srtctl..."
128139
# CRITICAL — uv install location.

0 commit comments

Comments
 (0)