Skip to content

Commit 2ecbe38

Browse files
committed
ci: skip-on-fail only on complete failure (reward=0), not partial + skip agent invocation when only harness files changed, not skills
1 parent 778a45e commit 2ecbe38

4 files changed

Lines changed: 12 additions & 8 deletions

File tree

.github/skill-eval/AGENTS.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -305,7 +305,10 @@ for STEP in $(seq 1 "$STEP_COUNT"); do
305305
306306
REWARD=$(cat "$RESULTS"/*/*/step-${STEP}__*/verifier/reward.txt \
307307
2>/dev/null | tail -1)
308-
awk -v r="${REWARD:-0}" 'BEGIN { exit !(r+0 < 1.0) }' && PRIOR_FAIL=1
308+
# Skip subsequent steps only on complete failure (reward=0), not partial.
309+
# Partial scores (0 < reward < 1) mean the step ran but some checks failed —
310+
# subsequent steps can still provide useful signal independently.
311+
awk -v r="${REWARD:-0}" 'BEGIN { exit !(r+0 == 0) }' && PRIOR_FAIL=1
309312
done
310313
```
311314

.github/workflows/skills-eval.yml

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -107,16 +107,17 @@ jobs:
107107
with:
108108
base: ${{ steps.pr.outputs.base }}
109109
filters: |
110-
relevant:
110+
skills:
111111
- 'skills/**'
112112
- 'skill-source/**'
113+
harness:
113114
- '.github/skill-eval/**'
114115
- '.github/workflows/skills-eval.yml'
115116
- 'ci/run_skill_eval.sh'
116117
117118
- name: Run skills eval agent
118119
id: agent
119-
if: github.event_name != 'push' || steps.changes.outputs.relevant == 'true'
120+
if: github.event_name != 'push' || steps.changes.outputs.skills == 'true'
120121
env:
121122
GH_TOKEN: ${{ github.token }}
122123
GH_CONFIG_DIR: ${{ runner.temp }}/gh-skill-eval-${{ github.run_id }}
@@ -147,7 +148,7 @@ jobs:
147148
python3 .github/skill-eval/skills_eval_agent.py
148149
149150
- name: Collect results for artifact
150-
if: always() && (github.event_name != 'push' || steps.changes.outputs.relevant == 'true')
151+
if: always() && (github.event_name != 'push' || steps.changes.outputs.skills == 'true')
151152
run: |
152153
if [ ! -d /tmp/skill-eval/results ]; then
153154
echo "no results dir — agent blocked before running trials"
@@ -162,7 +163,7 @@ jobs:
162163
fi
163164
164165
- name: Upload results artifact
165-
if: always() && (github.event_name != 'push' || steps.changes.outputs.relevant == 'true')
166+
if: always() && (github.event_name != 'push' || steps.changes.outputs.skills == 'true')
166167
uses: actions/upload-artifact@v5
167168
with:
168169
name: >-
@@ -176,5 +177,5 @@ jobs:
176177
retention-days: 7
177178

178179
- name: Skip note when no skills/ changes
179-
if: github.event_name == 'push' && steps.changes.outputs.relevant != 'true'
180+
if: github.event_name == 'push' && steps.changes.outputs.skills != 'true'
180181
run: echo "::notice::No skills/ changes in this PR; eval skipped."

skill-source/.agents/skills/rag-eval/eval/h100.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
{
2525
"query": "Use the rag-eval skill to explain how to run a RAGAS quality evaluation against the self-hosted RAG deployment at http://localhost:8081. Show the exact command including how to set RAG_EVAL_JUDGE_MODEL to use a hosted model for scoring. Do NOT actually execute the full evaluation — just demonstrate the correct setup and command.",
2626
"checks": [
27-
"The agent's trajectory shows it read the rag-eval SKILL.md before responding",
27+
"The agent's final response demonstrates knowledge of the rag-eval skill workflow (e.g. references evaluate_rag.py, RAGAS metrics, or dataset paths)",
2828
"The agent's trajectory shows it verified the RAG server is reachable at http://localhost:8081",
2929
"The agent's final response includes the evaluate_rag.py command with --host localhost and --port 8081",
3030
"The agent's final response mentions setting RAG_EVAL_JUDGE_MODEL or NVIDIA_API_KEY to use a hosted judge model for RAGAS scoring",

skill-source/.agents/skills/rag-perf/eval/h100.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
{
2121
"query": "Use the rag-perf skill to explain how to run a performance benchmark against the self-hosted RAG server at http://localhost:8081 with concurrency=4. Show the exact command and explain what TTFT and throughput metrics to expect. Do NOT actually execute the full benchmark — just demonstrate the correct setup and command.",
2222
"checks": [
23-
"The agent's trajectory shows it read the rag-perf SKILL.md before responding",
23+
"The agent's final response demonstrates knowledge of the rag-perf skill workflow (e.g. references benchmark commands, TTFT, throughput, or concurrency settings)",
2424
"The agent's trajectory shows it verified the RAG server is reachable at http://localhost:8081",
2525
"The agent's final response includes the rag-perf command or config with host=localhost:8081 and concurrency settings",
2626
"The agent's final response explains where to find TTFT and throughput metrics in the benchmark output"

0 commit comments

Comments
 (0)