Prose instructions are insufficient for agentic development. Skills and hooks enforce discipline through the tool layer, making correct behavior the path of least resistance.
A note on skills frameworks: The skills referenced throughout this document (brainstorming, test-driven-development, verification-before-completion, etc.) are primarily drawn from the superpowers framework by Jesse Vincent (obra). However, the patterns described here are framework-agnostic. Equivalent skill sets exist in gstack, OpenSpec, and other agent skill frameworks. The important thing is the pattern — overlaying project-specific constraints on base agent capabilities — not which specific framework provides the skills.
L1 established testing discipline. L2 automates enforcement of that discipline through behavioral guardrails. These guardrails operate at the tool layer rather than the instruction layer, ensuring consistent behavior regardless of context or prompt complexity.
The guardrail framework consists of:
- Skills: Overlay patterns that extend base agent capabilities with project-specific rules
- Hooks: Automated triggers that block, advise, or transform operations
- Constitutional rules: Hard constraints that never relax
- Zero-defect tolerance: The discipline that makes agentic development work at scale
Claude Code provides base capabilities through skills like brainstorming and test-driven-development. However, production projects need project-specific constraints layered on top of these generic capabilities. Writing all constraints inline in prompts is verbose and inconsistent.
Skills extend base capabilities through an overlay architecture:
base capability (e.g., superpowers:test-driven-development)
+ project-specific rules (constitutional rules, conventions)
+ hook activations (when to fire which guards)
+ integration points (how this skill connects to others)
= skill overlay
The overlay architecture works with any structured skill set, not just Claude Code's built-in capabilities. OpenSpec, custom agent frameworks, or any system that drives agent behavior through declarative configuration can be wrapped and extended with project-specific rules. The examples in this repo reference superpowers because it's the framework in active use, but the pattern is universal: identify the base skill, then overlay project constraints.
A skill is a markdown file that declares:
- Frontmatter: name, description, base reference
- Purpose: what this skill achieves
- Rules: project-specific constraints this skill enforces
- Hook activations: what guards this skill triggers
- Integration points: what other skills compose with this
- Checklist: steps the agent follows when using the skill
Skills are the mechanism by which a project's constitution (@L0-foundation.md#pattern-04-claude-md-as-project-constitution) becomes executable. CLAUDE.md declares the rules; skills enforce them.
A TDD skill for an ecommerce project might extend superpowers:test-driven-development and add:
- Real dependencies in E2E/integration and stack tests — no mocking database drivers in stack tests (constitutional rule)
- Real dependencies in E2E/integration and stack tests — no mocking payment processor libraries in stack tests (constitutional rule)
- Full-loop assertion requirements (project convention)
- Hook to track test file changes (integration with workflow)
When the agent invokes this skill, it inherits the base TDD workflow plus all project-specific constraints automatically.
Copying the same rules into every prompt. This creates drift—prompts get updated inconsistently, and some code paths miss critical constraints.
- @L0-foundation.md#pattern-04-claude-md-as-project-constitution — CLAUDE.md as rule source
Individual skills are useful, but development requires a complete lifecycle. Isolated skills miss dependencies between phases—code written without tests, tests written without verification, changes submitted without review.
Skills compose into a workflow pipeline where each skill's output becomes the next skill's input:
brain+ (design with stack-first considerations)
+-> plan+ (create plan with testing strategy)
+-> tdd+ (RED-GREEN-REFACTOR with full-loop)
+-> verify+ (evidence before claims)
+-> review+ (checklist-based compliance review)
The chain enforces a complete development lifecycle. Each skill validates its input and produces structured output for the next skill. Skipping a link means losing a guardrail.
Chain phases:
- brain+: Design with stack testing in mind. What needs Docker? What are the full-loop assertions?
- plan+: Create implementation plan with explicit testing strategy. Which tests cover which requirements?
- tdd+: RED-GREEN-REFACTOR with full-loop assertions. (@L1-feedback-loops.md#pattern-12-full-loop-assertion-layering)
- verify+: Evidence-based claims. Run commands, show output, then claim done.
- review+: Checklist-based compliance review. Tests pass, docs updated, constitutional rules followed.
A typical workflow session:
- Agent activates
brain+skill, designs feature with stack testing considerations - Agent activates
plan+skill, produces plan with test coverage matrix - Agent activates
tdd+skill, implements following RED-GREEN-REFACTOR - Agent activates
verify+skill, runs tests and shows output before claiming success - Agent activates
review+skill, confirms all checklist items complete
The review+ skill also integrates test-integrity for test file review, ensuring tests follow @L1-feedback-loops.md#pattern-16-test-integrity-rules.
Superpowers framework integration: The obra/superpowers library provides the base skills that the chain builds on:
| Skill | Role in Chain |
|---|---|
| brainstorming | Design and requirements exploration — feeds into planning |
| writing-plans | Creates bite-sized implementation plans with testing strategy |
| executing-plans | Task-by-task execution with review checkpoints |
| test-driven-development | RED-GREEN-REFACTOR with full-loop assertions |
| verification-before-completion | Evidence-based claims — commands run, output shown, then claim |
| requesting-code-review | Checklist-based review against plan requirements |
| finishing-a-development-branch | Integration decisions (merge, PR, cleanup) when work is complete |
Project-specific skill overlays (like tdd+, verify+, review+) wrap these base skills with project constraints such as constitutional rules, mock policies, and full-loop assertion requirements.
Jumping directly to implementation without planning or testing. This produces code that may work but cannot be verified systematically. The chain only works if all links are present.
- @L1-feedback-loops.md#pattern-11-stack-tests — Stack testing foundation
- @L1-feedback-loops.md#pattern-12-full-loop-assertion-layering — Assertion requirements
- @docs/cross-cutting/glossary.md — Skill chain terminology
The rig repo implements this pattern with SkillPhaseTracker for tracking which phases the agent has visited and validating transitions between skill phases.
Agents forget rules. Instructions written in prose are unreliable enforcement mechanisms. When context shifts or prompts get complex, rules get dropped.
Hooks are automated triggers that fire before or after tool operations. They enforce rules through the tool layer, independent of prompt content.
Three hook types:
PostToolUse hooks: Fire after a tool completes. Examples:
- Track source file edits in
pending-tests.json - Emit "TEST TASK REQUIRED" when code changes without corresponding test changes
- Add modified files to a change-set for review
PreToolUse hooks: Fire before a tool executes. Examples:
- Block
sed -i(always redirect to Edit tool) - Block
docker system prunein shared environments - Require confirmation before deleting files matching a pattern
Context-aware hooks: Detect environment state and adjust behavior. Examples:
- If
./test-logs/has files modified within 5 minutes, blockdocker logsand redirect to log files - If pending tests exist, block implementation tasks until test task is created
- If working in a git worktree, ensure commands respect worktree boundaries
The automation pattern:
hook fires
+-> check state (what files changed? what's on disk? what's the context?)
+-> allow (operation proceeds)
+-> advise (warn but allow, with explanation)
+-> block (stop operation, require different approach)
A PostToolUse hook tracking test coverage:
// Simplified pseudocode
hook.on('PostToolUse', (tool, args) => {
if (tool === 'Edit' && isSourceFile(args.file_path)) {
pendingTests.add(args.file_path);
console.warn('TEST TASK REQUIRED: Source file modified');
}
if (tool === 'Edit' && isTestFile(args.file_path)) {
pendingTests.remove(sourceFileFor(args.file_path));
}
});A PreToolUse hook blocking destructive operations:
hook.on('PreToolUse', (tool, args) => {
if (tool === 'Bash' && args.command.includes('sed -i')) {
throw new Error('Use Edit tool instead of sed -i');
}
});A context-aware hook:
hook.on('PreToolUse', (tool, args) => {
if (tool === 'Bash' && args.command.startsWith('docker logs')) {
const recentLogs = findRecentTestLogs(5 * 60 * 1000); // 5 minutes
if (recentLogs.length > 0) {
console.warn(`Recent test logs available: ${recentLogs.join(', ')}`);
console.warn('Use test logs instead of docker logs');
return; // block the docker logs command
}
}
});Writing hooks that are too permissive or too restrictive. Permissive hooks fail to catch problems; restrictive hooks block legitimate work. Hooks should have clear, narrow purposes with documented behavior.
- @L3-optimization.md#pattern-31-smart-routing--tool-selection — Command routing patterns
- @docs/examples/guardrails/ — Hook implementation examples
The rig repo implements composable hooks in src/enforcement/ with a handlePostToolUse orchestrator that runs multiple checks and resolves to the most severe enforcement level.
Projects have rules that should never be violated—core constraints that define the project's approach. When these rules are scattered across documentation or written as soft suggestions, they get forgotten or worked around.
Constitutional rules are hard constraints declared in CLAUDE.md that never relax. They are the foundation of the project's guardrail system.
Example constitutional rules from a production ecommerce platform:
- Real dependencies in E2E/integration and stack tests — logger, payment processor libraries (Stripe, PayPal), database drivers, HTTP clients for first-party services. Use real components in stack tests. Mocks are appropriate in unit tests.
- Full accounting for every state change — every inventory change, every order, every transaction fee must be logged and queryable.
- Evidence-based claims only — show command output before claiming done. "Tests pass" is not evidence; show the test output.
- Docker-first development — no local OS execution. Everything runs in containers.
- No conditional test assertions — tests must be able to fail. (@L1-feedback-loops.md#pattern-16-test-integrity-rules)
The rules flow:
CLAUDE.md declares constitutional rules
+-> plan+ includes rules in plan template (what must this plan respect?)
+-> tdd+ rejects mocked components in stack test generation
+-> review+ checks constitutional compliance (did this violate any rules?)
Constitutional rule enforcement in the skill chain:
- plan+ reads CLAUDE.md and adds a "Constitutional compliance" section to each plan
- tdd+ checks that proposed stack tests don't mock protected components
- review+ runs a checklist that includes "No constitutional rules violated"
When the agent attempts to mock a database driver in a stack test, the tdd+ skill blocks it with a reference to the constitutional rule.
Writing "soft" rules with exceptions. Constitutional rules must have no escape hatches. If a rule needs exceptions, it is not constitutional—it is a guideline, not a constraint.
- @L0-foundation.md#pattern-04-claude-md-as-project-constitution — CLAUDE.md format
- @L1-patterns/1.5-no-mock-philosophy.md — Real dependencies in E2E/integration and stack tests
The rig repo implements constitutional rule checking with checkConstitutional() which uses regex detection to identify mock patterns in edited files and blocks violations.
Agents don't have intuition to work around known issues. When developers tolerate "unrelated" failures or "pre-existing" warnings, agents lose the ability to self-diagnose systematically. A tolerated defect becomes an undiagnosed regression later.
Zero-defect tolerance: every error, warning, and failure must be addressed. Not just "relevant" errors—ALL of them. This applies to both unit tests and stack tests — dismissing a unit test failure while trusting stack test results means relying on partial feedback.
What zero-defect means:
- "This failure is unrelated" is never acceptable
- "This warning was pre-existing" means fix it now
- Tests that fail for any reason must either pass or be explicitly skipped with documentation
- Compiler warnings must be resolved, not suppressed
- Linter errors must be fixed, not waived
Why this matters for agents: Agents process feedback systematically. When output contains errors, agents must be able to assume those errors are relevant. If some errors are "okay to ignore," the agent cannot distinguish which errors require action and which do not. This ambiguity breaks the feedback loop.
A test run produces output:
FAIL test/ecommerce/order.test.ts
Order processing
+ should process order
PASS test/ecommerce/order.test.ts
Order processing
+ should calculate fees
Error: Cannot find module './utils/config.ts'
Zero-defect response:
- Fix the missing module import first
- Re-run tests
- If the order processing test still fails, investigate that
- Only when ALL errors and warnings are resolved is the task complete
Not zero-defect (wrong approach):
- "The module error is unrelated to the order test"
- Focus only on the order test failure
- Leave the module error unfixed
Classifying errors as "relevant" or "irrelevant" without evidence. Unless you can prove an error is truly unrelated (different subsystem, proven isolation), assume it is relevant.
- @L1-feedback-loops.md#pattern-13-sequential--additive-test-design — Sequential test ordering
- @L4-standards-measurement.md#pattern-41--evidence-based-claims — Evidence standards
The rig repo implements zero-defect enforcement with checkZeroDefect() which parses test runner output for failures, errors, and warnings and surfaces them as enforcement results.
Individual enforcement checks (stale tests, zero-defect, constitutional rules) are useful in isolation, but a production guardrail system needs to compose multiple checks into a single hook response. Without a composition pattern, each check runs independently with no unified severity resolution.
A composable enforcement pipeline where each check is an independent function returning { level, message }. The pipeline runs all checks and resolves to the most severe level (block > advise > silent). Checks are configurable per-rule via a config file (.harness.yaml).
Key concepts:
- Each check has one responsibility (stale test detection, scope control, mock detection, failure parsing)
- Checks compose through a single orchestrator (
handlePostToolUse) - Severity resolution: most severe level wins
- Configurable: each rule can be block/advise/silent independently
// Each check returns a result with severity level
type CheckResult = { level: 'block' | 'advise' | 'silent'; message: string };
// The pipeline runs all checks and resolves to the most severe
async function handlePostToolUse(tool: string, args: unknown): Promise<EnforcementResponse> {
const checks = [
checkStaleTests(tool, args),
checkConstitutional(tool, args),
checkZeroDefect(tool, args),
checkScope(tool, args),
];
const results = await Promise.all(checks);
const maxSeverity = results.reduce((max, r) =>
severityOrder(r.level) > severityOrder(max.level) ? r : max
);
return { level: maxSeverity.level, messages: results.map(r => r.message) };
}The rig repo implements this in src/enforcement/ with composable checks and a handlePostToolUse orchestrator that resolves severity.
Running checks independently without severity resolution. If one check returns "block" and another returns "advise," the agent needs a single clear signal, not conflicting messages.
- Pattern 2.3 — Hook Automation — Hooks fire checks; the pipeline composes them
- Pattern 2.7 — Phase Transition Validation — Phase-aware checks change behavior by skill phase
The skill chain (brain+ → plan+ → tdd+ → verify+ → review+) is described as a conceptual pipeline, but without enforcement agents can skip phases. Skipping plan+ before tdd+ means unstructured implementation. Skipping tdd+ before verify+ means unverified code.
A state machine that tracks which phases the agent has visited and validates transitions. Each phase records its visit with a timestamp. Transition validation checks that prerequisites are met before allowing entry to the next phase.
Key concepts:
- Phase history with timestamps
- Transition validation:
tdd+requires priorplan+visit,verify+requires priortdd+visit - Query methods:
isTddPhase(),isVerifyPhase()for conditional behavior - Reset capability for new workflows
class PhaseTracker {
private history = new Map<string, Date>();
enter(phase: string): void {
this.validateTransition(phase);
this.history.set(phase, new Date());
}
private validateTransition(phase: string): void {
if (phase === 'tdd+' && !this.history.has('plan+')) {
throw new Error('Cannot enter tdd+ without prior plan+ visit');
}
if (phase === 'verify+' && !this.history.has('tdd+')) {
throw new Error('Cannot enter verify+ without prior tdd+ visit');
}
}
}The rig repo implements this in src/skills/phase-tracker.ts with transition validation, query methods, and reset capability.
Treating phase tracking as advisory. If the tracker warns but allows invalid transitions, agents learn to ignore it. Phase enforcement must be blocking.
- Pattern 2.2 — The Skill Chain — The chain that phase validation enforces
- Pattern 2.3 — Hook Automation — Hooks can check phase state before allowing operations
Traditional development accumulates complexity. Each feature adds surface area, each bugfix patches over a gap, and each session starts from roughly the same knowledge baseline as the one before. Agents compound this problem: without persistent memory, every session re-derives the same lessons. The first time an agent solves an N+1 query problem, it takes research. The next session facing the same problem starts from zero.
Each unit of engineering work should make subsequent units easier, not harder. Compound engineering extends the skill chain (Pattern 2.2) with a feedback loop: after each cycle of brainstorm → plan → work → review, explicitly capture learnings back into the system so future sessions — by the same agent or different ones — start from a higher knowledge baseline.
The core cycle:
brainstorm → plan → work → review → compound → repeat
^ |
└─── knowledge feeds back ────────────┘
The compounding principle: 80% planning and review, 20% execution. Thorough planning prevents wasted implementation. Review catches issues while context is fresh. Codified learnings prevent re-deriving solutions. Quality is not a separate phase — it is the mechanism that makes future work easier.
What compounds:
- Documented solutions — When a bug is fixed, document the symptoms, root cause, and prevention strategy. Next time the same symptom appears, the lookup takes minutes instead of research.
- Refined plans — Plans that survived review become templates for similar features. Each planning cycle sharpens the next.
- Captured patterns — Architecture decisions, trade-off analyses, and failure modes recorded during review become reference material for future design sessions.
- Constitutional amendments — When a review catches a novel violation, the constitutional rules (Pattern 2.4) expand to cover it.
What does NOT compound:
- Undocumented fixes that live only in conversation history
- Plans that are discarded after implementation without extracting lessons
- Reviews that approve or reject without recording why
- Errors dismissed as "unrelated" rather than investigated and catalogued
The compound-engineering plugin implements this pattern with /ce:compound — a skill that captures solved problems into structured documentation (docs/solutions/) with YAML frontmatter for searchability. Each documented solution reduces the cost of the next occurrence from research to lookup.
The my-claw project's design rinsing lineage (my-claw case study) demonstrates compounding across sessions: each rinsing phase extracted patterns that the next phase built on, producing more value because it started from a richer foundation.
Compounding across the pattern pyramid:
| Level | What Compounds | Mechanism |
|---|---|---|
| L0 | Clean structure, current docs | Pattern 0.8 — Aggressive Cleanup removes noise each session |
| L1 | Test coverage, verification infrastructure | Stack tests accumulate; each new test catches regressions permanently |
| L2 | Constitutional rules, skills, learnings | Each review that catches a novel violation strengthens the guardrail system |
| L3 | Codebase index, routing rules | Session caching (Pattern 3.8) carries detection forward |
| L4 | Metrics, drift corrections | Measurement creates a tightening feedback loop |
- Disposable sessions — Work that produces code but no learnings. The feature ships, the bug is fixed, but nothing is captured for next time.
- Knowledge silos — Learnings stored in conversation history, personal notes, or tribal memory instead of discoverable documentation.
- Review without extraction — Code reviews that approve or reject without recording the reasoning, missing the opportunity to compound the architectural insight.
- Premature abstraction — Compounding knowledge is not the same as building frameworks. Document the why, not just the what. Let patterns emerge from documented solutions before promoting them into abstractions.
- Pattern 2.2 — The Skill Chain — The workflow pipeline that compound engineering extends
- Pattern 2.4 — Constitutional Rules — Rules that grow through compounding
- Pattern 4.1 — Evidence-Based Claims — Compounding requires evidence, not assumptions
- Reference my-claw Case Study — Design rinsing lineage demonstrates compounding across three phases
The compound-engineering plugin provides the /ce:compound skill for capturing solved problems into structured, searchable documentation. The rig framework's skill chain implements the brainstorm → plan → work → review portion; compound engineering closes the loop by feeding learnings back.
L2 behavioral guardrails automate enforcement of the discipline established in L1. Skills extend base capabilities with project-specific rules. Hooks enforce rules through the tool layer. Constitutional rules declare hard constraints. Zero-defect tolerance ensures systematic self-diagnosis. Compound engineering closes the loop — each cycle of work feeds learnings back into the system, making subsequent cycles more effective.
Together, these patterns make correct behavior the path of least resistance for agentic development.
"Spend a lot of time planning out the work the agent will do." — Peter Steinberger, creator of OpenClaw
The skill chain's brain+ → plan+ phase operationalizes this insight. Fleshing out a plan — challenging the agent, tweaking scope, pushing back on assumptions — before implementation begins prevents wasted tokens on wrong paths. The plan becomes the contract that tdd+, verify+, and review+ enforce.
Previous: L1: Closed Loop Design and Verification | Next: L3: Optimization — Token Efficiency & Agent Performance | Back to Overview

