layout	default
title	Chapter 2: Skill Categories
nav_order	2
parent	Anthropic Skills Tutorial

Chapter 2: Skill Categories

Welcome to Chapter 2: Skill Categories. In this part of Anthropic Skills Tutorial: Reusable AI Agent Capabilities, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.

Category design controls maintainability. If categories are too broad, skills become brittle and hard to trust.

Four Practical Categories

Category	Typical Inputs	Typical Outputs	Typical Risk
Document Workflows	Notes, policy docs, datasets	Structured docs/slides/sheets	Formatting drift
Creative and Brand	Briefs, tone rules, examples	On-brand copy or concepts	Brand inconsistency
Engineering and Ops	Codebase context, tickets, logs	Patches, runbooks, plans	Incorrect assumptions
Enterprise Process	Internal standards and controls	Audit artifacts, compliance actions	Governance gaps

How to Choose Category Boundaries

Use one outcome per skill. If two outcomes have different acceptance criteria, split the skill.

Good split:

incident-triage
postmortem-draft
stakeholder-update

Bad split:

incident-everything

A single giant skill creates unclear prompts, conflicting priorities, and harder testing.

Decision Matrix

Question	If "Yes"	If "No"
Is the output contract identical across requests?	Keep in same skill	Split into separate skills
Do tasks share the same references and policies?	Keep shared references	Isolate by domain
Can one test suite verify quality for all use cases?	Keep grouped	Split for clearer quality gates
Are escalation paths identical?	Keep grouped	Split by risk/approval path

Category-Specific Design Tips

Document skills: prioritize template fidelity and deterministic section ordering.
Creative skills: define what variation is allowed and what must stay fixed.
Technical skills: enforce constraints on tools, files, and unsafe operations.
Enterprise skills: include explicit policy references and audit fields.

Anti-Patterns

Category names that describe team structure instead of behavior
Mixing high-stakes and low-stakes actions in one skill
Using skills as a substitute for missing source documentation
Requiring hidden tribal knowledge to run the skill

Summary

You can now define category boundaries that keep skills focused, testable, and easier to operate.

Next: Chapter 3: Advanced Skill Design

What Problem Does This Solve?

Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for core abstractions in this chapter so behavior stays predictable as complexity grows.

In practical terms, this chapter helps you avoid three common failures:

coupling core logic too tightly to one implementation path
missing the handoff boundaries between setup, execution, and validation
shipping changes without clear rollback or observability strategy

After working through this chapter, you should be able to reason about Chapter 2: Skill Categories as an operating subsystem inside Anthropic Skills Tutorial: Reusable AI Agent Capabilities, with explicit contracts for inputs, state transitions, and outputs.

Use the implementation notes around execution and reliability details as your checklist when adapting these patterns to your own repository.

How it Works Under the Hood

Under the hood, Chapter 2: Skill Categories usually follows a repeatable control path:

Context bootstrap: initialize runtime config and prerequisites for core component.
Input normalization: shape incoming data so execution layer receives stable contracts.
Core execution: run the main logic branch and propagate intermediate state through state model.
Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
Output composition: return canonical result payloads for downstream consumers.
Operational telemetry: emit logs/metrics needed for debugging and performance tuning.

When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.

Source Walkthrough

Use the following upstream sources to verify implementation details while reading this chapter:

anthropics/skills repository Why it matters: authoritative reference on anthropics/skills repository (github.com).

Suggested trace strategy:

search upstream code for Skill and Categories to map concrete implementation paths
compare docs claims against actual runtime/config code before reusing patterns in production

Chapter Connections

Depth Expansion Playbook

Source Code Walkthrough

`skills/skill-creator/scripts/run_eval.py`

The main function in skills/skill-creator/scripts/run_eval.py handles a key part of this chapter's functionality:

            while time.time() - start_time < timeout:
                if process.poll() is not None:
                    remaining = process.stdout.read()
                    if remaining:
                        buffer += remaining.decode("utf-8", errors="replace")
                    break

                ready, _, _ = select.select([process.stdout], [], [], 1.0)
                if not ready:
                    continue

                chunk = os.read(process.stdout.fileno(), 8192)
                if not chunk:
                    break
                buffer += chunk.decode("utf-8", errors="replace")

                while "\n" in buffer:
                    line, buffer = buffer.split("\n", 1)
                    line = line.strip()
                    if not line:
                        continue

                    try:
                        event = json.loads(line)
                    except json.JSONDecodeError:
                        continue

                    # Early detection via stream events
                    if event.get("type") == "stream_event":
                        se = event.get("event", {})
                        se_type = se.get("type", "")

This function is important because it defines how Anthropic Skills Tutorial: Reusable AI Agent Capabilities implements the patterns covered in this chapter.

`skills/skill-creator/scripts/aggregate_benchmark.py`

The calculate_stats function in skills/skill-creator/scripts/aggregate_benchmark.py handles a key part of this chapter's functionality:

def calculate_stats(values: list[float]) -> dict:
    """Calculate mean, stddev, min, max for a list of values."""
    if not values:
        return {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0}

    n = len(values)
    mean = sum(values) / n

    if n > 1:
        variance = sum((x - mean) ** 2 for x in values) / (n - 1)
        stddev = math.sqrt(variance)
    else:
        stddev = 0.0

    return {
        "mean": round(mean, 4),
        "stddev": round(stddev, 4),
        "min": round(min(values), 4),
        "max": round(max(values), 4)
    }


def load_run_results(benchmark_dir: Path) -> dict:
    """
    Load all run results from a benchmark directory.

    Returns dict keyed by config name (e.g. "with_skill"/"without_skill",
    or "new_skill"/"old_skill"), each containing a list of run results.
    """
    # Support both layouts: eval dirs directly under benchmark_dir, or under runs/

This function is important because it defines how Anthropic Skills Tutorial: Reusable AI Agent Capabilities implements the patterns covered in this chapter.

`skills/skill-creator/scripts/aggregate_benchmark.py`

The load_run_results function in skills/skill-creator/scripts/aggregate_benchmark.py handles a key part of this chapter's functionality:

def load_run_results(benchmark_dir: Path) -> dict:
    """
    Load all run results from a benchmark directory.

    Returns dict keyed by config name (e.g. "with_skill"/"without_skill",
    or "new_skill"/"old_skill"), each containing a list of run results.
    """
    # Support both layouts: eval dirs directly under benchmark_dir, or under runs/
    runs_dir = benchmark_dir / "runs"
    if runs_dir.exists():
        search_dir = runs_dir
    elif list(benchmark_dir.glob("eval-*")):
        search_dir = benchmark_dir
    else:
        print(f"No eval directories found in {benchmark_dir} or {benchmark_dir / 'runs'}")
        return {}

    results: dict[str, list] = {}

    for eval_idx, eval_dir in enumerate(sorted(search_dir.glob("eval-*"))):
        metadata_path = eval_dir / "eval_metadata.json"
        if metadata_path.exists():
            try:
                with open(metadata_path) as mf:
                    eval_id = json.load(mf).get("eval_id", eval_idx)
            except (json.JSONDecodeError, OSError):
                eval_id = eval_idx
        else:
            try:
                eval_id = int(eval_dir.name.split("-")[1])

This function is important because it defines how Anthropic Skills Tutorial: Reusable AI Agent Capabilities implements the patterns covered in this chapter.

`skills/skill-creator/scripts/aggregate_benchmark.py`

The aggregate_results function in skills/skill-creator/scripts/aggregate_benchmark.py handles a key part of this chapter's functionality:

def aggregate_results(results: dict) -> dict:
    """
    Aggregate run results into summary statistics.

    Returns run_summary with stats for each configuration and delta.
    """
    run_summary = {}
    configs = list(results.keys())

    for config in configs:
        runs = results.get(config, [])

        if not runs:
            run_summary[config] = {
                "pass_rate": {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0},
                "time_seconds": {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0},
                "tokens": {"mean": 0, "stddev": 0, "min": 0, "max": 0}
            }
            continue

        pass_rates = [r["pass_rate"] for r in runs]
        times = [r["time_seconds"] for r in runs]
        tokens = [r.get("tokens", 0) for r in runs]

        run_summary[config] = {
            "pass_rate": calculate_stats(pass_rates),
            "time_seconds": calculate_stats(times),
            "tokens": calculate_stats(tokens)
        }

This function is important because it defines how Anthropic Skills Tutorial: Reusable AI Agent Capabilities implements the patterns covered in this chapter.

How These Components Connect

flowchart TD
    A[main]
    B[calculate_stats]
    C[load_run_results]
    D[aggregate_results]
    E[generate_benchmark]
    A --> B
    B --> C
    C --> D
    D --> E

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter 2: Skill Categories

Four Practical Categories

How to Choose Category Boundaries

Decision Matrix

Category-Specific Design Tips

Anti-Patterns

Summary

What Problem Does This Solve?

How it Works Under the Hood

Source Walkthrough

Chapter Connections

Depth Expansion Playbook

Source Code Walkthrough

`skills/skill-creator/scripts/run_eval.py`

`skills/skill-creator/scripts/aggregate_benchmark.py`

`skills/skill-creator/scripts/aggregate_benchmark.py`

`skills/skill-creator/scripts/aggregate_benchmark.py`

How These Components Connect

FilesExpand file tree

02-skill-categories.md

Latest commit

History

02-skill-categories.md

File metadata and controls

Chapter 2: Skill Categories

Four Practical Categories

How to Choose Category Boundaries

Decision Matrix

Category-Specific Design Tips

Anti-Patterns

Summary

What Problem Does This Solve?

How it Works Under the Hood

Source Walkthrough

Chapter Connections

Depth Expansion Playbook

Source Code Walkthrough

skills/skill-creator/scripts/run_eval.py

skills/skill-creator/scripts/aggregate_benchmark.py

skills/skill-creator/scripts/aggregate_benchmark.py

skills/skill-creator/scripts/aggregate_benchmark.py

How These Components Connect

`skills/skill-creator/scripts/run_eval.py`

`skills/skill-creator/scripts/aggregate_benchmark.py`

`skills/skill-creator/scripts/aggregate_benchmark.py`

`skills/skill-creator/scripts/aggregate_benchmark.py`