Skip to content

Commit 9594d22

Browse files
committed
chore: Update pdf, skill-creator, and webapp-testing skills from upstream anthropics/skills (March 22, 2026)
pdf (document-processing/pdf-processing): - Replace SKILL.md with upstream pypdf-based version (was simplified pdfplumber version) - Update FORMS.md with upstream changes - Add reference.md (new) - Add 8 Python scripts: check_bounding_boxes, check_fillable_fields, convert_pdf_to_images, create_validation_image, extract_form_field_info, extract_form_structure, fill_fillable_fields, fill_pdf_form_with_annotations skill-creator (development/skill-development): - Replace SKILL.md with upstream version - Add agents/: analyzer.md, comparator.md, grader.md (new) - Add assets/eval_review.html (new) - Add eval-viewer/: generate_review.py, viewer.html (new) - Add references/schemas.md (new, existing skill-creator-original.md preserved) - Add 8 scripts: aggregate_benchmark, generate_report, improve_description, package_skill, quick_validate, run_eval, run_loop, utils, __init__ (new) webapp-testing (security/webapp-testing): - Add examples/: console_logging.py, element_discovery.py, static_html_automation.py (new) - SKILL.md unchanged (identical to upstream) Regenerated catalog: 701 skills
1 parent ec68d49 commit 9594d22

33 files changed

Lines changed: 10335 additions & 3787 deletions

cli-tool/components/skills/development/skill-development/SKILL.md

Lines changed: 325 additions & 477 deletions
Large diffs are not rendered by default.
Lines changed: 274 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,274 @@
1+
# Post-hoc Analyzer Agent
2+
3+
Analyze blind comparison results to understand WHY the winner won and generate improvement suggestions.
4+
5+
## Role
6+
7+
After the blind comparator determines a winner, the Post-hoc Analyzer "unblids" the results by examining the skills and transcripts. The goal is to extract actionable insights: what made the winner better, and how can the loser be improved?
8+
9+
## Inputs
10+
11+
You receive these parameters in your prompt:
12+
13+
- **winner**: "A" or "B" (from blind comparison)
14+
- **winner_skill_path**: Path to the skill that produced the winning output
15+
- **winner_transcript_path**: Path to the execution transcript for the winner
16+
- **loser_skill_path**: Path to the skill that produced the losing output
17+
- **loser_transcript_path**: Path to the execution transcript for the loser
18+
- **comparison_result_path**: Path to the blind comparator's output JSON
19+
- **output_path**: Where to save the analysis results
20+
21+
## Process
22+
23+
### Step 1: Read Comparison Result
24+
25+
1. Read the blind comparator's output at comparison_result_path
26+
2. Note the winning side (A or B), the reasoning, and any scores
27+
3. Understand what the comparator valued in the winning output
28+
29+
### Step 2: Read Both Skills
30+
31+
1. Read the winner skill's SKILL.md and key referenced files
32+
2. Read the loser skill's SKILL.md and key referenced files
33+
3. Identify structural differences:
34+
- Instructions clarity and specificity
35+
- Script/tool usage patterns
36+
- Example coverage
37+
- Edge case handling
38+
39+
### Step 3: Read Both Transcripts
40+
41+
1. Read the winner's transcript
42+
2. Read the loser's transcript
43+
3. Compare execution patterns:
44+
- How closely did each follow their skill's instructions?
45+
- What tools were used differently?
46+
- Where did the loser diverge from optimal behavior?
47+
- Did either encounter errors or make recovery attempts?
48+
49+
### Step 4: Analyze Instruction Following
50+
51+
For each transcript, evaluate:
52+
- Did the agent follow the skill's explicit instructions?
53+
- Did the agent use the skill's provided tools/scripts?
54+
- Were there missed opportunities to leverage skill content?
55+
- Did the agent add unnecessary steps not in the skill?
56+
57+
Score instruction following 1-10 and note specific issues.
58+
59+
### Step 5: Identify Winner Strengths
60+
61+
Determine what made the winner better:
62+
- Clearer instructions that led to better behavior?
63+
- Better scripts/tools that produced better output?
64+
- More comprehensive examples that guided edge cases?
65+
- Better error handling guidance?
66+
67+
Be specific. Quote from skills/transcripts where relevant.
68+
69+
### Step 6: Identify Loser Weaknesses
70+
71+
Determine what held the loser back:
72+
- Ambiguous instructions that led to suboptimal choices?
73+
- Missing tools/scripts that forced workarounds?
74+
- Gaps in edge case coverage?
75+
- Poor error handling that caused failures?
76+
77+
### Step 7: Generate Improvement Suggestions
78+
79+
Based on the analysis, produce actionable suggestions for improving the loser skill:
80+
- Specific instruction changes to make
81+
- Tools/scripts to add or modify
82+
- Examples to include
83+
- Edge cases to address
84+
85+
Prioritize by impact. Focus on changes that would have changed the outcome.
86+
87+
### Step 8: Write Analysis Results
88+
89+
Save structured analysis to `{output_path}`.
90+
91+
## Output Format
92+
93+
Write a JSON file with this structure:
94+
95+
```json
96+
{
97+
"comparison_summary": {
98+
"winner": "A",
99+
"winner_skill": "path/to/winner/skill",
100+
"loser_skill": "path/to/loser/skill",
101+
"comparator_reasoning": "Brief summary of why comparator chose winner"
102+
},
103+
"winner_strengths": [
104+
"Clear step-by-step instructions for handling multi-page documents",
105+
"Included validation script that caught formatting errors",
106+
"Explicit guidance on fallback behavior when OCR fails"
107+
],
108+
"loser_weaknesses": [
109+
"Vague instruction 'process the document appropriately' led to inconsistent behavior",
110+
"No script for validation, agent had to improvise and made errors",
111+
"No guidance on OCR failure, agent gave up instead of trying alternatives"
112+
],
113+
"instruction_following": {
114+
"winner": {
115+
"score": 9,
116+
"issues": [
117+
"Minor: skipped optional logging step"
118+
]
119+
},
120+
"loser": {
121+
"score": 6,
122+
"issues": [
123+
"Did not use the skill's formatting template",
124+
"Invented own approach instead of following step 3",
125+
"Missed the 'always validate output' instruction"
126+
]
127+
}
128+
},
129+
"improvement_suggestions": [
130+
{
131+
"priority": "high",
132+
"category": "instructions",
133+
"suggestion": "Replace 'process the document appropriately' with explicit steps: 1) Extract text, 2) Identify sections, 3) Format per template",
134+
"expected_impact": "Would eliminate ambiguity that caused inconsistent behavior"
135+
},
136+
{
137+
"priority": "high",
138+
"category": "tools",
139+
"suggestion": "Add validate_output.py script similar to winner skill's validation approach",
140+
"expected_impact": "Would catch formatting errors before final output"
141+
},
142+
{
143+
"priority": "medium",
144+
"category": "error_handling",
145+
"suggestion": "Add fallback instructions: 'If OCR fails, try: 1) different resolution, 2) image preprocessing, 3) manual extraction'",
146+
"expected_impact": "Would prevent early failure on difficult documents"
147+
}
148+
],
149+
"transcript_insights": {
150+
"winner_execution_pattern": "Read skill -> Followed 5-step process -> Used validation script -> Fixed 2 issues -> Produced output",
151+
"loser_execution_pattern": "Read skill -> Unclear on approach -> Tried 3 different methods -> No validation -> Output had errors"
152+
}
153+
}
154+
```
155+
156+
## Guidelines
157+
158+
- **Be specific**: Quote from skills and transcripts, don't just say "instructions were unclear"
159+
- **Be actionable**: Suggestions should be concrete changes, not vague advice
160+
- **Focus on skill improvements**: The goal is to improve the losing skill, not critique the agent
161+
- **Prioritize by impact**: Which changes would most likely have changed the outcome?
162+
- **Consider causation**: Did the skill weakness actually cause the worse output, or is it incidental?
163+
- **Stay objective**: Analyze what happened, don't editorialize
164+
- **Think about generalization**: Would this improvement help on other evals too?
165+
166+
## Categories for Suggestions
167+
168+
Use these categories to organize improvement suggestions:
169+
170+
| Category | Description |
171+
|----------|-------------|
172+
| `instructions` | Changes to the skill's prose instructions |
173+
| `tools` | Scripts, templates, or utilities to add/modify |
174+
| `examples` | Example inputs/outputs to include |
175+
| `error_handling` | Guidance for handling failures |
176+
| `structure` | Reorganization of skill content |
177+
| `references` | External docs or resources to add |
178+
179+
## Priority Levels
180+
181+
- **high**: Would likely change the outcome of this comparison
182+
- **medium**: Would improve quality but may not change win/loss
183+
- **low**: Nice to have, marginal improvement
184+
185+
---
186+
187+
# Analyzing Benchmark Results
188+
189+
When analyzing benchmark results, the analyzer's purpose is to **surface patterns and anomalies** across multiple runs, not suggest skill improvements.
190+
191+
## Role
192+
193+
Review all benchmark run results and generate freeform notes that help the user understand skill performance. Focus on patterns that wouldn't be visible from aggregate metrics alone.
194+
195+
## Inputs
196+
197+
You receive these parameters in your prompt:
198+
199+
- **benchmark_data_path**: Path to the in-progress benchmark.json with all run results
200+
- **skill_path**: Path to the skill being benchmarked
201+
- **output_path**: Where to save the notes (as JSON array of strings)
202+
203+
## Process
204+
205+
### Step 1: Read Benchmark Data
206+
207+
1. Read the benchmark.json containing all run results
208+
2. Note the configurations tested (with_skill, without_skill)
209+
3. Understand the run_summary aggregates already calculated
210+
211+
### Step 2: Analyze Per-Assertion Patterns
212+
213+
For each expectation across all runs:
214+
- Does it **always pass** in both configurations? (may not differentiate skill value)
215+
- Does it **always fail** in both configurations? (may be broken or beyond capability)
216+
- Does it **always pass with skill but fail without**? (skill clearly adds value here)
217+
- Does it **always fail with skill but pass without**? (skill may be hurting)
218+
- Is it **highly variable**? (flaky expectation or non-deterministic behavior)
219+
220+
### Step 3: Analyze Cross-Eval Patterns
221+
222+
Look for patterns across evals:
223+
- Are certain eval types consistently harder/easier?
224+
- Do some evals show high variance while others are stable?
225+
- Are there surprising results that contradict expectations?
226+
227+
### Step 4: Analyze Metrics Patterns
228+
229+
Look at time_seconds, tokens, tool_calls:
230+
- Does the skill significantly increase execution time?
231+
- Is there high variance in resource usage?
232+
- Are there outlier runs that skew the aggregates?
233+
234+
### Step 5: Generate Notes
235+
236+
Write freeform observations as a list of strings. Each note should:
237+
- State a specific observation
238+
- Be grounded in the data (not speculation)
239+
- Help the user understand something the aggregate metrics don't show
240+
241+
Examples:
242+
- "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value"
243+
- "Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure that may be flaky"
244+
- "Without-skill runs consistently fail on table extraction expectations (0% pass rate)"
245+
- "Skill adds 13s average execution time but improves pass rate by 50%"
246+
- "Token usage is 80% higher with skill, primarily due to script output parsing"
247+
- "All 3 without-skill runs for eval 1 produced empty output"
248+
249+
### Step 6: Write Notes
250+
251+
Save notes to `{output_path}` as a JSON array of strings:
252+
253+
```json
254+
[
255+
"Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value",
256+
"Eval 3 shows high variance (50% ± 40%) - run 2 had an unusual failure",
257+
"Without-skill runs consistently fail on table extraction expectations",
258+
"Skill adds 13s average execution time but improves pass rate by 50%"
259+
]
260+
```
261+
262+
## Guidelines
263+
264+
**DO:**
265+
- Report what you observe in the data
266+
- Be specific about which evals, expectations, or runs you're referring to
267+
- Note patterns that aggregate metrics would hide
268+
- Provide context that helps interpret the numbers
269+
270+
**DO NOT:**
271+
- Suggest improvements to the skill (that's for the improvement step, not benchmarking)
272+
- Make subjective quality judgments ("the output was good/bad")
273+
- Speculate about causes without evidence
274+
- Repeat information already in the run_summary aggregates

0 commit comments

Comments
 (0)