Swarm Command implements Shadow Score Spec L2 conformance for independent quality validation of commander outputs. Instead of the spec's sealed code tests, Swarm Command generates sealed acceptance criteria β task-specific assertions that commander outputs must satisfy, generated before commanders execute and validated after they complete.
During the SwarmSpeed 250 self-analysis test run (Havoc Hackathon #48), 3 sealed judges gave scores of 44-46/50 to a document containing critical arithmetic errors. The judges evaluated design quality but never tested internal consistency. Shadow scoring closes this blind spot using the sealed-envelope approach: criteria generated before work begins, hidden from all agents, revealed only at validation time.
The core insight: What you measure, agents optimize for. What you don't measure, they ignore. Shadow scoring measures what you don't tell them about.
| Level | Requirement | Swarm Command |
|---|---|---|
| L1 | Compute + report Shadow Score | β Implemented |
| L2 | L1 + sealed-envelope isolation + commitment hash | β Implemented |
| L3 | L2 + hardening loop + velocity tracking | Partial (hardening loop β , velocity tracking not yet) |
Swarm Command implements Shadow Score Spec L2 conformance.
Shadow Score = (sealed_failures / sealed_total) Γ 100
A Shadow Score of 0% means all sealed criteria passed. Higher scores indicate more failures.
| Shadow Score | Level | Emoji | Meaning |
|---|---|---|---|
| 0% | Perfect | β | All sealed criteria passed |
| 1β15% | Minor | π’ | Acceptable β minor gaps only |
| 16β30% | Moderate | π‘ | Notable gaps β review recommended |
| 31β50% | Significant | π | Serious gaps β hardening required |
| > 50% | Critical | π΄ | Fundamental failures β re-work needed |
The Shadow Score Spec defines a 4-phase sealed-envelope protocol. Here's how Swarm Command implements each phase within its swarm execution pipeline:
When: After task decomposition (Phase 1), before commanders execute (Phase 3).
The Nexus generates sealed acceptance criteria from the task specification:
- Analyze the task decomposition and generate 10 binary pass/fail acceptance criteria
- Distribute criteria across 4 categories:
happy_path,edge_case,error_handling,completeness - Compute a SHA-256 commitment hash of the sealed criteria
- Store the sealed envelope in Nexus memory β never shared with any agent
{
"sealed_envelope": {
"generated_at": "2025-01-15T10:30:00Z",
"task_hash": "sha256:b7e2...",
"sealed_hash": "sha256:a3f2...",
"criteria_count": 10,
"criteria": [
{
"id": "sc-01",
"category": "happy_path",
"assertion": "Architecture section defines module boundaries for all 5 domains",
"expected": "Each domain has a clearly defined module boundary"
},
{
"id": "sc-02",
"category": "edge_case",
"assertion": "Implementation handles empty input gracefully",
"expected": "Empty input produces a structured error, not a crash"
}
]
}
}Commanders, Squad Leads, Workers, and Reviewers execute normally. They never see the sealed criteria. The sealed envelope remains locked in Nexus memory throughout the entire execution pipeline.
After commanders complete and cross-review finishes:
- Verify commitment hash β Confirm the sealed criteria haven't drifted since Phase 1.5
- Run each sealed criterion against each Commander bundle as a binary pass/fail
- Compute Shadow Score:
(failures / total) Γ 100 - Classify using the interpretation scale
- Produce a Gap Report for each bundle
If Shadow Score > 15% for any bundle:
- Share only failure messages with the affected Commander β never the criteria themselves
- Commander gets 1 fix cycle to address the failures
- Re-validate and re-compute Shadow Score
- Record pre-hardening and post-hardening scores
What commanders receive during hardening:
SHADOW HARDENING β Fix these issues:
- [sc-07] Edge case for empty input not addressed
- [sc-09] Error response format missing HTTP status codes
What commanders do NOT receive: criteria list, scoring formula, pass/fail breakdown, or any mention of the sealed-envelope protocol.
Phase 0 Mission Intake
Phase 1 Task Decomposition (5 domains)
β
Phase 1.5 βββ SEAL GENERATION βββΊ Sealed criteria locked, hash recorded
β
Phase 2 Context Capsule Construction
Phase 3 Commander Deployment (commanders never see criteria)
Phase 4 Execution & Monitoring
Phase 5 Cross-Review
β
Phase 6 βββ VALIDATION βββΊ Unseal, validate, compute Shadow Score
β
βββ HARDENING βββΊ If score > 15%, share failure messages, one fix cycle
β
Phase 7 Consensus Synthesis (Shadow Gate uses Shadow Score)
Phase 8 Final Output (Gap Report included)
Each bundle receives a Gap Report in the Shadow Score Spec standard format:
{
"shadow_score_spec_version": "1.0.0",
"report": {
"shadow_score": 20.0,
"level": "moderate",
"sealed_hash": "sha256:a3f2..."
},
"sealed_tests": {
"total": 10,
"passed": 8,
"failed": 2
},
"failures": [
{
"test_name": "sc-07",
"category": "edge_case",
"expected": "Output handles empty input gracefully",
"actual": "No empty input handling found in IMPL bundle",
"message": "Edge case for empty input not addressed"
},
{
"test_name": "sc-09",
"category": "error_handling",
"expected": "Error responses include HTTP status codes",
"actual": "Error format uses string messages without status codes",
"message": "Error response format missing HTTP status codes"
}
]
}Shadow Scores act as a quality gate in the consensus pipeline (Phase 7, Stage 3):
| Shadow Score | Action |
|---|---|
| 0% (Perfect) or 1β15% (Minor) | Bundle proceeds normally through consensus |
| 16β30% (Moderate) | Gap Report attached to bundle, warning in output |
| 31β50% (Significant) | Bundle QUARANTINED β Nexus re-reviews with failure messages |
| > 50% (Critical) | Bundle REJECTED from synthesis |
The Shadow Score Spec is designed for code-producing agents where sealed tests are executable test files. Swarm Command adapts this for general-purpose AI orchestration:
| Spec Concept | Swarm Command Adaptation |
|---|---|
| Sealed code tests | Sealed acceptance criteria (natural-language assertions) |
| Test runner execution | Nexus evaluates criteria against bundle content |
| Test pass/fail | Binary assertion pass/fail |
| Test suite | Criteria set across 4 categories |
| CI environment | Nexus memory (sealed, commitment-hashed) |
The math is identical: (failures / total) Γ 100. The isolation is identical: agents never see criteria. The hardening is identical: only failure messages shared. The adaptation is in what gets tested β acceptance criteria instead of code assertions.
Shadow scoring is configured in config.yml:
shadow_scoring:
enabled: true
spec_version: "1.0.0"
conformance_level: "L2"
sealed_criteria_count: 10 # max; per-scale: SS-50=6, SS-100=8, SS-250=10
hardening:
enabled: true # SS-50 overrides to disabled
max_cycles: 1
threshold: 15
categories:
- happy_path
- edge_case
- error_handling
- completenessSet enabled: false to disable shadow scoring entirely (e.g., for cost-sensitive SS-50 deployments).
| Scale | Sealed Criteria | Hardening | Notes |
|---|---|---|---|
| SS-50 | 6 (reduced) | Disabled | Shadow Score computed but no fix cycle |
| SS-100 | 8 | 1 cycle if > 15% | Moderate hardening |
| SS-250 | 10 (full) | 1 cycle if > 15% | Full hardening |
The SHA-256 commitment hash in the sealed-envelope protocol serves as a self-discipline mechanism, not a cryptographic security boundary.
- Commitment device: The Nexus commits to specific acceptance criteria before commanders execute, preventing unconscious criteria drift during validation.
- Drift detection: If the criteria are accidentally modified between Phase 1.5 (generation) and Phase 6 (validation), the hash mismatch surfaces the error immediately.
- Audit trail: The recorded hash creates a verifiable record that criteria were locked before execution began.
- Cross-domain tamper resistance: The Nexus generates the criteria, holds them in its own context, computes the hash, and validates against it β all within the same LLM session. There is no separate trust domain, so the hash provides zero protection against an adversarial or compromised Nexus.
- Cryptographic security guarantees: In a traditional sealed-envelope protocol, the seal is held by an independent party. Here, the same agent is both sealer and validator. The "sealed envelope" is a useful metaphor for the workflow pattern, not a literal security boundary.
Even without cross-domain guarantees, the commitment hash is valuable. It enforces a strict temporal separation between criteria generation and criteria evaluation. Without it, the Nexus could unconsciously shift its acceptance bar after seeing commander outputs β a subtle form of confirmation bias. The hash makes that impossible by turning any drift into an explicit, detectable mismatch.