| layout | default |
|---|---|
| title | Chapter 2: Skill Categories |
| nav_order | 2 |
| parent | Anthropic Skills Tutorial |
Welcome to Chapter 2: Skill Categories. In this part of Anthropic Skills Tutorial: Reusable AI Agent Capabilities, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Category design controls maintainability. If categories are too broad, skills become brittle and hard to trust.
| Category | Typical Inputs | Typical Outputs | Typical Risk |
|---|---|---|---|
| Document Workflows | Notes, policy docs, datasets | Structured docs/slides/sheets | Formatting drift |
| Creative and Brand | Briefs, tone rules, examples | On-brand copy or concepts | Brand inconsistency |
| Engineering and Ops | Codebase context, tickets, logs | Patches, runbooks, plans | Incorrect assumptions |
| Enterprise Process | Internal standards and controls | Audit artifacts, compliance actions | Governance gaps |
Use one outcome per skill. If two outcomes have different acceptance criteria, split the skill.
Good split:
incident-triagepostmortem-draftstakeholder-update
Bad split:
incident-everything
A single giant skill creates unclear prompts, conflicting priorities, and harder testing.
| Question | If "Yes" | If "No" |
|---|---|---|
| Is the output contract identical across requests? | Keep in same skill | Split into separate skills |
| Do tasks share the same references and policies? | Keep shared references | Isolate by domain |
| Can one test suite verify quality for all use cases? | Keep grouped | Split for clearer quality gates |
| Are escalation paths identical? | Keep grouped | Split by risk/approval path |
- Document skills: prioritize template fidelity and deterministic section ordering.
- Creative skills: define what variation is allowed and what must stay fixed.
- Technical skills: enforce constraints on tools, files, and unsafe operations.
- Enterprise skills: include explicit policy references and audit fields.
- Category names that describe team structure instead of behavior
- Mixing high-stakes and low-stakes actions in one skill
- Using skills as a substitute for missing source documentation
- Requiring hidden tribal knowledge to run the skill
You can now define category boundaries that keep skills focused, testable, and easier to operate.
Next: Chapter 3: Advanced Skill Design
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for core abstractions in this chapter so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 2: Skill Categories as an operating subsystem inside Anthropic Skills Tutorial: Reusable AI Agent Capabilities, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around execution and reliability details as your checklist when adapting these patterns to your own repository.
Under the hood, Chapter 2: Skill Categories usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
core component. - Input normalization: shape incoming data so
execution layerreceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
state model. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Use the following upstream sources to verify implementation details while reading this chapter:
- anthropics/skills repository
Why it matters: authoritative reference on
anthropics/skills repository(github.com).
Suggested trace strategy:
- search upstream code for
SkillandCategoriesto map concrete implementation paths - compare docs claims against actual runtime/config code before reusing patterns in production
- Tutorial Index
- Previous Chapter: Chapter 1: Getting Started
- Next Chapter: Chapter 3: Advanced Skill Design
- Main Catalog
- A-Z Tutorial Directory
The main function in skills/skill-creator/scripts/run_eval.py handles a key part of this chapter's functionality:
while time.time() - start_time < timeout:
if process.poll() is not None:
remaining = process.stdout.read()
if remaining:
buffer += remaining.decode("utf-8", errors="replace")
break
ready, _, _ = select.select([process.stdout], [], [], 1.0)
if not ready:
continue
chunk = os.read(process.stdout.fileno(), 8192)
if not chunk:
break
buffer += chunk.decode("utf-8", errors="replace")
while "\n" in buffer:
line, buffer = buffer.split("\n", 1)
line = line.strip()
if not line:
continue
try:
event = json.loads(line)
except json.JSONDecodeError:
continue
# Early detection via stream events
if event.get("type") == "stream_event":
se = event.get("event", {})
se_type = se.get("type", "")This function is important because it defines how Anthropic Skills Tutorial: Reusable AI Agent Capabilities implements the patterns covered in this chapter.
The calculate_stats function in skills/skill-creator/scripts/aggregate_benchmark.py handles a key part of this chapter's functionality:
def calculate_stats(values: list[float]) -> dict:
"""Calculate mean, stddev, min, max for a list of values."""
if not values:
return {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0}
n = len(values)
mean = sum(values) / n
if n > 1:
variance = sum((x - mean) ** 2 for x in values) / (n - 1)
stddev = math.sqrt(variance)
else:
stddev = 0.0
return {
"mean": round(mean, 4),
"stddev": round(stddev, 4),
"min": round(min(values), 4),
"max": round(max(values), 4)
}
def load_run_results(benchmark_dir: Path) -> dict:
"""
Load all run results from a benchmark directory.
Returns dict keyed by config name (e.g. "with_skill"/"without_skill",
or "new_skill"/"old_skill"), each containing a list of run results.
"""
# Support both layouts: eval dirs directly under benchmark_dir, or under runs/This function is important because it defines how Anthropic Skills Tutorial: Reusable AI Agent Capabilities implements the patterns covered in this chapter.
The load_run_results function in skills/skill-creator/scripts/aggregate_benchmark.py handles a key part of this chapter's functionality:
def load_run_results(benchmark_dir: Path) -> dict:
"""
Load all run results from a benchmark directory.
Returns dict keyed by config name (e.g. "with_skill"/"without_skill",
or "new_skill"/"old_skill"), each containing a list of run results.
"""
# Support both layouts: eval dirs directly under benchmark_dir, or under runs/
runs_dir = benchmark_dir / "runs"
if runs_dir.exists():
search_dir = runs_dir
elif list(benchmark_dir.glob("eval-*")):
search_dir = benchmark_dir
else:
print(f"No eval directories found in {benchmark_dir} or {benchmark_dir / 'runs'}")
return {}
results: dict[str, list] = {}
for eval_idx, eval_dir in enumerate(sorted(search_dir.glob("eval-*"))):
metadata_path = eval_dir / "eval_metadata.json"
if metadata_path.exists():
try:
with open(metadata_path) as mf:
eval_id = json.load(mf).get("eval_id", eval_idx)
except (json.JSONDecodeError, OSError):
eval_id = eval_idx
else:
try:
eval_id = int(eval_dir.name.split("-")[1])This function is important because it defines how Anthropic Skills Tutorial: Reusable AI Agent Capabilities implements the patterns covered in this chapter.
The aggregate_results function in skills/skill-creator/scripts/aggregate_benchmark.py handles a key part of this chapter's functionality:
def aggregate_results(results: dict) -> dict:
"""
Aggregate run results into summary statistics.
Returns run_summary with stats for each configuration and delta.
"""
run_summary = {}
configs = list(results.keys())
for config in configs:
runs = results.get(config, [])
if not runs:
run_summary[config] = {
"pass_rate": {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0},
"time_seconds": {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0},
"tokens": {"mean": 0, "stddev": 0, "min": 0, "max": 0}
}
continue
pass_rates = [r["pass_rate"] for r in runs]
times = [r["time_seconds"] for r in runs]
tokens = [r.get("tokens", 0) for r in runs]
run_summary[config] = {
"pass_rate": calculate_stats(pass_rates),
"time_seconds": calculate_stats(times),
"tokens": calculate_stats(tokens)
}This function is important because it defines how Anthropic Skills Tutorial: Reusable AI Agent Capabilities implements the patterns covered in this chapter.
flowchart TD
A[main]
B[calculate_stats]
C[load_run_results]
D[aggregate_results]
E[generate_benchmark]
A --> B
B --> C
C --> D
D --> E