Skip to content

Latest commit

 

History

History
296 lines (211 loc) · 11.4 KB

File metadata and controls

296 lines (211 loc) · 11.4 KB
layout default
title Chapter 2: Skill Categories
nav_order 2
parent Anthropic Skills Tutorial

Chapter 2: Skill Categories

Welcome to Chapter 2: Skill Categories. In this part of Anthropic Skills Tutorial: Reusable AI Agent Capabilities, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.

Category design controls maintainability. If categories are too broad, skills become brittle and hard to trust.

Four Practical Categories

Category Typical Inputs Typical Outputs Typical Risk
Document Workflows Notes, policy docs, datasets Structured docs/slides/sheets Formatting drift
Creative and Brand Briefs, tone rules, examples On-brand copy or concepts Brand inconsistency
Engineering and Ops Codebase context, tickets, logs Patches, runbooks, plans Incorrect assumptions
Enterprise Process Internal standards and controls Audit artifacts, compliance actions Governance gaps

How to Choose Category Boundaries

Use one outcome per skill. If two outcomes have different acceptance criteria, split the skill.

Good split:

  • incident-triage
  • postmortem-draft
  • stakeholder-update

Bad split:

  • incident-everything

A single giant skill creates unclear prompts, conflicting priorities, and harder testing.

Decision Matrix

Question If "Yes" If "No"
Is the output contract identical across requests? Keep in same skill Split into separate skills
Do tasks share the same references and policies? Keep shared references Isolate by domain
Can one test suite verify quality for all use cases? Keep grouped Split for clearer quality gates
Are escalation paths identical? Keep grouped Split by risk/approval path

Category-Specific Design Tips

  • Document skills: prioritize template fidelity and deterministic section ordering.
  • Creative skills: define what variation is allowed and what must stay fixed.
  • Technical skills: enforce constraints on tools, files, and unsafe operations.
  • Enterprise skills: include explicit policy references and audit fields.

Anti-Patterns

  • Category names that describe team structure instead of behavior
  • Mixing high-stakes and low-stakes actions in one skill
  • Using skills as a substitute for missing source documentation
  • Requiring hidden tribal knowledge to run the skill

Summary

You can now define category boundaries that keep skills focused, testable, and easier to operate.

Next: Chapter 3: Advanced Skill Design

What Problem Does This Solve?

Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for core abstractions in this chapter so behavior stays predictable as complexity grows.

In practical terms, this chapter helps you avoid three common failures:

  • coupling core logic too tightly to one implementation path
  • missing the handoff boundaries between setup, execution, and validation
  • shipping changes without clear rollback or observability strategy

After working through this chapter, you should be able to reason about Chapter 2: Skill Categories as an operating subsystem inside Anthropic Skills Tutorial: Reusable AI Agent Capabilities, with explicit contracts for inputs, state transitions, and outputs.

Use the implementation notes around execution and reliability details as your checklist when adapting these patterns to your own repository.

How it Works Under the Hood

Under the hood, Chapter 2: Skill Categories usually follows a repeatable control path:

  1. Context bootstrap: initialize runtime config and prerequisites for core component.
  2. Input normalization: shape incoming data so execution layer receives stable contracts.
  3. Core execution: run the main logic branch and propagate intermediate state through state model.
  4. Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
  5. Output composition: return canonical result payloads for downstream consumers.
  6. Operational telemetry: emit logs/metrics needed for debugging and performance tuning.

When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.

Source Walkthrough

Use the following upstream sources to verify implementation details while reading this chapter:

Suggested trace strategy:

  • search upstream code for Skill and Categories to map concrete implementation paths
  • compare docs claims against actual runtime/config code before reusing patterns in production

Chapter Connections

Depth Expansion Playbook

Source Code Walkthrough

skills/skill-creator/scripts/run_eval.py

The main function in skills/skill-creator/scripts/run_eval.py handles a key part of this chapter's functionality:

            while time.time() - start_time < timeout:
                if process.poll() is not None:
                    remaining = process.stdout.read()
                    if remaining:
                        buffer += remaining.decode("utf-8", errors="replace")
                    break

                ready, _, _ = select.select([process.stdout], [], [], 1.0)
                if not ready:
                    continue

                chunk = os.read(process.stdout.fileno(), 8192)
                if not chunk:
                    break
                buffer += chunk.decode("utf-8", errors="replace")

                while "\n" in buffer:
                    line, buffer = buffer.split("\n", 1)
                    line = line.strip()
                    if not line:
                        continue

                    try:
                        event = json.loads(line)
                    except json.JSONDecodeError:
                        continue

                    # Early detection via stream events
                    if event.get("type") == "stream_event":
                        se = event.get("event", {})
                        se_type = se.get("type", "")

This function is important because it defines how Anthropic Skills Tutorial: Reusable AI Agent Capabilities implements the patterns covered in this chapter.

skills/skill-creator/scripts/aggregate_benchmark.py

The calculate_stats function in skills/skill-creator/scripts/aggregate_benchmark.py handles a key part of this chapter's functionality:

def calculate_stats(values: list[float]) -> dict:
    """Calculate mean, stddev, min, max for a list of values."""
    if not values:
        return {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0}

    n = len(values)
    mean = sum(values) / n

    if n > 1:
        variance = sum((x - mean) ** 2 for x in values) / (n - 1)
        stddev = math.sqrt(variance)
    else:
        stddev = 0.0

    return {
        "mean": round(mean, 4),
        "stddev": round(stddev, 4),
        "min": round(min(values), 4),
        "max": round(max(values), 4)
    }


def load_run_results(benchmark_dir: Path) -> dict:
    """
    Load all run results from a benchmark directory.

    Returns dict keyed by config name (e.g. "with_skill"/"without_skill",
    or "new_skill"/"old_skill"), each containing a list of run results.
    """
    # Support both layouts: eval dirs directly under benchmark_dir, or under runs/

This function is important because it defines how Anthropic Skills Tutorial: Reusable AI Agent Capabilities implements the patterns covered in this chapter.

skills/skill-creator/scripts/aggregate_benchmark.py

The load_run_results function in skills/skill-creator/scripts/aggregate_benchmark.py handles a key part of this chapter's functionality:

def load_run_results(benchmark_dir: Path) -> dict:
    """
    Load all run results from a benchmark directory.

    Returns dict keyed by config name (e.g. "with_skill"/"without_skill",
    or "new_skill"/"old_skill"), each containing a list of run results.
    """
    # Support both layouts: eval dirs directly under benchmark_dir, or under runs/
    runs_dir = benchmark_dir / "runs"
    if runs_dir.exists():
        search_dir = runs_dir
    elif list(benchmark_dir.glob("eval-*")):
        search_dir = benchmark_dir
    else:
        print(f"No eval directories found in {benchmark_dir} or {benchmark_dir / 'runs'}")
        return {}

    results: dict[str, list] = {}

    for eval_idx, eval_dir in enumerate(sorted(search_dir.glob("eval-*"))):
        metadata_path = eval_dir / "eval_metadata.json"
        if metadata_path.exists():
            try:
                with open(metadata_path) as mf:
                    eval_id = json.load(mf).get("eval_id", eval_idx)
            except (json.JSONDecodeError, OSError):
                eval_id = eval_idx
        else:
            try:
                eval_id = int(eval_dir.name.split("-")[1])

This function is important because it defines how Anthropic Skills Tutorial: Reusable AI Agent Capabilities implements the patterns covered in this chapter.

skills/skill-creator/scripts/aggregate_benchmark.py

The aggregate_results function in skills/skill-creator/scripts/aggregate_benchmark.py handles a key part of this chapter's functionality:

def aggregate_results(results: dict) -> dict:
    """
    Aggregate run results into summary statistics.

    Returns run_summary with stats for each configuration and delta.
    """
    run_summary = {}
    configs = list(results.keys())

    for config in configs:
        runs = results.get(config, [])

        if not runs:
            run_summary[config] = {
                "pass_rate": {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0},
                "time_seconds": {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0},
                "tokens": {"mean": 0, "stddev": 0, "min": 0, "max": 0}
            }
            continue

        pass_rates = [r["pass_rate"] for r in runs]
        times = [r["time_seconds"] for r in runs]
        tokens = [r.get("tokens", 0) for r in runs]

        run_summary[config] = {
            "pass_rate": calculate_stats(pass_rates),
            "time_seconds": calculate_stats(times),
            "tokens": calculate_stats(tokens)
        }

This function is important because it defines how Anthropic Skills Tutorial: Reusable AI Agent Capabilities implements the patterns covered in this chapter.

How These Components Connect

flowchart TD
    A[main]
    B[calculate_stats]
    C[load_run_results]
    D[aggregate_results]
    E[generate_benchmark]
    A --> B
    B --> C
    C --> D
    D --> E
Loading