Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
74bc805
docs(designs): add v2_PLAN.md — gstack v2 the lightest opinionated sk…
garrytan May 26, 2026
e274e5e
test(parity): T0a — capture v1.44.1 baseline + capture helper + diff …
garrytan May 26, 2026
4d76cbb
feat(resolvers): T2 — ResolverEntry + appliesTo gate infrastructure
garrytan May 26, 2026
da5f268
feat(preamble): T3 — jargon dedup + terse-build flag (Phase A.2 + A.3)
garrytan May 26, 2026
0d68ef1
feat(catalog): T4 — catalog trim + proactive-suggestions.json (Phase …
garrytan May 26, 2026
81fdf9c
test(budget): T5 — hard token budgets + override audit trail (Phase A.6)
garrytan May 26, 2026
6d48d23
test(cso): T6 — pin must-preserve security phrases (Phase A.5)
garrytan May 26, 2026
ebebc95
test(parity): T0b — cathedral parity-suite harness + invariant registry
garrytan May 26, 2026
296937d
test(coverage): T1 — skill coverage matrix + structural-compliance floor
garrytan May 26, 2026
a164597
build(skills): T7 — atomic regenerate + capture v1.45.0.0 baseline
garrytan May 26, 2026
d78eeab
v1.45.0.0 — gstack v2 foundation: catalog tokens drop 56%, eval-first…
garrytan May 26, 2026
fcd4965
chore(catalog): refresh proactive-suggestions.json timestamp after v1…
garrytan May 26, 2026
43e18af
fix(catalog): deterministic proactive-suggestions.json (no per-run ti…
garrytan May 26, 2026
2aff29e
fix(catalog): preserve routing prose when first sentence exceeds 200 …
garrytan May 26, 2026
24742f9
Merge origin/main: rebump v1.45→v1.46 (queue collision with design-da…
garrytan May 26, 2026
03a00f0
fix(catalog): deterministic proactive-suggestions.json across machines
garrytan May 26, 2026
c88825d
test(catalog): unit + regression coverage for catalog-trim helpers
garrytan May 26, 2026
8b94e6d
test(coverage): fill three remaining v1.46.0.0 test gaps
garrytan May 26, 2026
b5f75c1
test(coverage): close 5 remaining v1.46.0.0 test gaps (A-E)
garrytan May 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,68 @@
# Changelog

## [1.46.0.0] - 2026-05-26

## **gstack v2 foundation lands. Catalog tokens drop 56%, eval-first floor covers all 51 skills, hard token + dollar caps gate every PR.**

The always-loaded skill catalog — what every Claude Code session pays for at startup before any real work begins — went from ~9,319 tokens to ~4,045 tokens. That's a 56.6% cut to the surface gstack has been criticized for (third-party review, May 2026: "10K+ tokens before any real code is written"). Heavyweight skills like `/ship`, `/plan-ceo-review`, `/office-hours` still ship their full content, but their frontmatter descriptions trim to one sentence each; the routing prose lives in a new "## When to invoke" body section, and a per-run `scripts/proactive-suggestions.json` registry holds the voice-trigger + proactive-suggest text so agents can pull guidance on demand instead of always-loaded.

This is the v2 foundation release. The architectural break — `sections/*.md.tmpl` pattern, mechanical Read enforcement, eval-coverage annotations — lands in v2.0.0.0 as a coordinated launch. v1.46 absorbs every low-risk win, ships the eval-first floor every future skill must pass, and locks in the v1.44.1 reference baseline so reviewers can audit v1→v2 numbers against a real file (`test/fixtures/parity-baseline-v1.44.1.json`).

### The numbers that matter

Source: `bun run scripts/capture-baseline.ts --tag v1.46.0.0` vs the locked v1.44.1 baseline at `test/fixtures/parity-baseline-v1.44.1.json`. Reproduce locally with `bun test test/skill-size-budget.test.ts`.

| Metric | v1.44.1 | v1.46.0.0 | Δ |
|---|---|---|---|
| Catalog tokens (always-loaded system prompt) | ~9,319 | ~4,045 | **−56.6%** |
| Total SKILL.md corpus | 2,847 KB | 2,813 KB | −1.2% |
| ship.md | 160 KB | 159 KB | −0.5% |
| plan-ceo-review.md | 128 KB | 127 KB | −0.7% |
| office-hours.md | 108 KB | 108 KB | −0.8% |
| Skills with gate-tier eval coverage | 32 of 51 | **51 of 51** | floor achieved |
| Cathedral parity invariants pinned | 0 | **10** | structural + content |
| Token & dollar budget regressions caught at CI | (none) | **5 new test files** | per-skill, corpus, catalog, eval-cost gate, eval-cost periodic |

The corpus barely moved because the catalog trim MOVES routing prose from frontmatter to a body section — it doesn't delete it. The always-loaded surface drops by more than half because catalog text is what Claude Code reads on every session start; body content only loads when the skill is invoked.

### What this means for you

If you use any gstack skill, every session starts ~5,000 tokens lighter before you type anything. Heavyweight invocations like `/ship` cost about the same as before, but session startup feels snappier. If you've been on the fence about installing gstack because of the "fat" reputation, this is the release that addresses it directly: the always-loaded surface is now competitive with stripped-down skill packs while every skill keeps its full body content.

If you contribute skills, the eval-first floor means a new SKILL.md without an entry in `test/skill-coverage-matrix.ts` fails CI. The minimum entry is one line referencing `test/skill-coverage-floor.test.ts` (the free structural-compliance smoke test). Behavioral E2E coverage gets layered on top per skill.

If you run gstack in CI, the new `EVALS_BUDGET_HARD_CAP=$30` cap (per-suite: gate $25 / periodic $70) stops runaway eval costs from a model price change or infinite-retry bug. Override path exists for legit-need-more cases: `EVALS_BUDGET_OVERRIDE_REASON="why this is OK"` logs to `~/.gstack/analytics/spend-overrides.jsonl` for audit.

### Itemized changes

**Added**
- `scripts/capture-baseline.ts` + `test/helpers/capture-parity-baseline.ts` — captures per-skill SKILL.md sizes, token estimates, frontmatter description lengths, and eval coverage flags. Writes JSON snapshots used by the parity and size-budget gates. Locks `test/fixtures/parity-baseline-v1.44.1.json` as the v1→v2 reference.
- `test/helpers/parity-harness.ts` + `test/parity-suite.test.ts` — cathedral parity-eval suite floor. `PARITY_INVARIANTS` registry pins must-preserve phrases per skill family (cso: OWASP/STRIDE; plan-ceo: SCOPE EXPANSION / HOLD SCOPE; ship: VERSION/CHANGELOG/PR) so future compression can't silently strip load-bearing prose.
- `test/skill-coverage-matrix.ts` + `test/skill-coverage-matrix.test.ts` — single source of truth mapping each skill to gate + periodic tests; CI gate asserts every skill has at least one gate-tier entry. 51 skills, 51 entries.
- `test/skill-coverage-floor.test.ts` — per-skill structural-compliance smoke test (file-IO, free). Verifies frontmatter shape, generated header, body non-trivial, no leaked `{{TEMPLATE}}` placeholders, catalog-trim contract on description. 309 assertions across 51 skills.
- `test/skill-size-budget.test.ts` — per-skill SKILL.md byte budget (×1.05 default ratio), total corpus budget, catalog token budget (≤7000 for v1.46). Caught regressions get a per-skill breakdown + override path.
- `test/cso-preserved.test.ts` — pins cso's must-not-strip security guidance phrases (OWASP, STRIDE, daily/comprehensive mode discipline, confidence scoring, active verification). Future compression that hits cso fails CI here.
- `test/helpers/budget-override.ts` — audit-trail logger for `GSTACK_SIZE_BUDGET_OVERRIDE_REASON` and `EVALS_BUDGET_OVERRIDE_REASON`. Append-only JSONL at `~/.gstack/analytics/spend-overrides.jsonl` with timestamp + scope + reason + CI provenance.
- `scripts/proactive-suggestions.json` — per-run registry of routing prose + voice triggers extracted from skill frontmatter during catalog trim. Agents pull on demand instead of paying for it always-loaded.
- `--catalog-mode=full` build flag — restores v1.44 legacy multi-line catalog descriptions. Use when debugging routing regressions or when shipping skills to hosts that depend on the legacy fat catalog.
- `--explain-level=terse` build flag — opt-in compression of `## Writing Style` + `## Completeness Principle` + `## Confusion Protocol` + `## Context Health` preamble sections. Default build keeps the runtime-conditional behavior intact (the model still skips when `EXPLAIN_LEVEL: terse` appears in the preamble echo); terse build makes the compression structural.
- `EVALS_BUDGET_HARD_CAP` environment variable (umbrella $30 default) + per-suite `EVALS_BUDGET_HARD_CAP_GATE=$25`, `EVALS_BUDGET_HARD_CAP_PERIODIC=$70`. Build fails if a single run exceeds; `EVALS_BUDGET_OVERRIDE_REASON` env unblocks + audit-logs.

**Changed**
- Skill frontmatter `description:` blocks across 51 skills trimmed to a single lead sentence + `(gstack)` tag. Routing prose ("Use when asked to...", "Proactively suggest...") and voice triggers moved to a `## When to invoke` body section in each SKILL.md. Always-loaded catalog cost drops ~56%.
- Jargon list (`scripts/jargon-list.json`, 80 terms) no longer inlined into every tier-2+ skill. `## Writing Style` now references the JSON path; agents Read it once per session on first jargon term encountered. Saves ~70 KB of duplicated text across the corpus.
- `ResolverEntry` union type in `scripts/resolvers/types.ts` + `unwrapResolver` helper. Resolvers can now be either bare functions (current behavior) or `{ resolve, appliesTo? }` gated entries. `scripts/gen-skill-docs.ts:444` checks the gate before invocation. Infrastructure for future per-skill resolver gating; all current resolvers stay bare functions and work unchanged.
- `TemplateContext` gains an optional `explainLevel: 'default' | 'terse'` field threaded from the `--explain-level` build flag.

**Fixed**
- Catalog descriptions no longer collide with adjacent YAML fields (initial implementation produced `description: ... (gstack)allowed-tools:` with no newline; fixed by appending `\n` to the replacement).

**For contributors**
- New skills require an entry in `test/skill-coverage-matrix.ts` — at minimum referencing `test/skill-coverage-floor.test.ts` in `gate[]`. The CI gate at `test/skill-coverage-matrix.test.ts` fails fast on missing entries.
- New must-preserve invariants for a skill family go in `PARITY_INVARIANTS` in `test/helpers/parity-harness.ts`. Adding invariants is additive; removing one is a deliberate scope decision.
- The `scripts/jargon-list.json` is the canonical glossary. Add terms there; gen-skill-docs picks them up automatically on next regen.
- `test/fixtures/parity-baseline-v1.44.1.json` is the locked v1→v2 reference. Do not modify; capture new snapshots at later tags via `bun run scripts/capture-baseline.ts --tag <name>`.

## [1.45.0.0] - 2026-05-25

## **Design boards now live 24 hours, not 10 minutes. One daemon hosts every board, one tab survives the whole day.**
Expand Down
14 changes: 9 additions & 5 deletions SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,7 @@
name: gstack
preamble-tier: 1
version: 1.1.0
description: |
Fast headless browser for QA testing and site dogfooding. Navigate pages, interact with
elements, verify state, diff before/after, take annotated screenshots, test responsive
layouts, forms, uploads, dialogs, and capture bug evidence. Use when asked to open or
test a site, verify a deployment, dogfood a user flow, or file a bug with screenshots. (gstack)
description: Fast headless browser for QA testing and site dogfooding. (gstack)
allowed-tools:
- Bash
- Read
Expand All @@ -21,6 +17,14 @@ triggers:
<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs -->


## When to invoke this skill

Navigate pages, interact with
elements, verify state, diff before/after, take annotated screenshots, test responsive
layouts, forms, uploads, dialogs, and capture bug evidence. Use when asked to open or
test a site, verify a deployment, dogfood a user flow, or file a bug with screenshots.

## Preamble (run first)

```bash
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.45.0.0
1.46.0.0
103 changes: 15 additions & 88 deletions autoplan/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,7 @@
name: autoplan
preamble-tier: 3
version: 1.0.0
description: |
Auto-review pipeline — reads the full CEO, design, eng, and DX review skills from disk
and runs them sequentially with auto-decisions using 6 decision principles. Surfaces
taste decisions (close approaches, borderline scope, codex disagreements) at a final
approval gate. One command, fully reviewed plan out.
Use when asked to "auto review", "autoplan", "run all reviews", "review this plan
automatically", or "make the decisions for me".
Proactively suggest when the user has a plan file and wants to run the full review
gauntlet without answering 15-30 intermediate questions. (gstack)
Voice triggers (speech-to-text aliases): "auto plan", "automatic review".
description: Auto-review pipeline — reads the full CEO, design, eng, and DX review skills from disk and runs them sequentially with auto-decisions using 6 decision principles. (gstack)
benefits-from: [office-hours]
triggers:
- run all reviews
Expand All @@ -30,6 +21,19 @@ allowed-tools:
<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs -->


## When to invoke this skill

Surfaces
taste decisions (close approaches, borderline scope, codex disagreements) at a final
approval gate. One command, fully reviewed plan out.
Use when asked to "auto review", "autoplan", "run all reviews", "review this plan
automatically", or "make the decisions for me".
Proactively suggest when the user has a plan file and wants to run the full review
gauntlet without answering 15-30 intermediate questions.

Voice triggers (speech-to-text aliases): "auto plan", "automatic review".

## Preamble (run first)

```bash
Expand Down Expand Up @@ -556,84 +560,7 @@ Applies to AskUserQuestion, user replies, and findings. AskUserQuestion Format i
- User-turn override wins: if the current message asks for terse / no explanations / just the answer, skip this section.
- Terse mode (EXPLAIN_LEVEL: terse): no glosses, no outcome-framing layer, shorter responses.

Jargon list, gloss on first use if the term appears:
- idempotent
- idempotency
- race condition
- deadlock
- cyclomatic complexity
- N+1
- N+1 query
- backpressure
- memoization
- eventual consistency
- CAP theorem
- CORS
- CSRF
- XSS
- SQL injection
- prompt injection
- DDoS
- rate limit
- throttle
- circuit breaker
- load balancer
- reverse proxy
- SSR
- CSR
- hydration
- tree-shaking
- bundle splitting
- code splitting
- hot reload
- tombstone
- soft delete
- cascade delete
- foreign key
- composite index
- covering index
- OLTP
- OLAP
- sharding
- replication lag
- quorum
- two-phase commit
- saga
- outbox pattern
- inbox pattern
- optimistic locking
- pessimistic locking
- thundering herd
- cache stampede
- bloom filter
- consistent hashing
- virtual DOM
- reconciliation
- closure
- hoisting
- tail call
- GIL
- zero-copy
- mmap
- cold start
- warm start
- green-blue deploy
- canary deploy
- feature flag
- kill switch
- dead letter queue
- fan-out
- fan-in
- debounce
- throttle (UI)
- hydration mismatch
- memory leak
- GC pause
- heap fragmentation
- stack overflow
- null pointer
- dangling pointer
- buffer overflow
Curated jargon list lives at `~/.claude/skills/gstack/scripts/jargon-list.json` (80+ terms). On the first jargon term you encounter this session, Read that file once; treat the `terms` array as the canonical list. The list is repo-owned and may grow between releases.


## Completeness Principle — Boil the Lake
Expand Down
21 changes: 13 additions & 8 deletions benchmark-models/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,7 @@
name: benchmark-models
preamble-tier: 1
version: 1.0.0
description: |
Cross-model benchmark for gstack skills. Runs the same prompt through Claude,
GPT (via Codex CLI), and Gemini side-by-side — compares latency, tokens, cost,
and optionally quality via LLM judge. Answers "which model is actually best
for this skill?" with data instead of vibes. Separate from /benchmark, which
measures web page performance. Use when: "benchmark models", "compare models",
"which model is best for X", "cross-model comparison", "model shootout". (gstack)
Voice triggers (speech-to-text aliases): "compare models", "model shootout", "which model is best".
description: Cross-model benchmark for gstack skills. (gstack)
triggers:
- cross model benchmark
- compare claude gpt gemini
Expand All @@ -23,6 +16,18 @@ allowed-tools:
<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs -->


## When to invoke this skill

Runs the same prompt through Claude,
GPT (via Codex CLI), and Gemini side-by-side — compares latency, tokens, cost,
and optionally quality via LLM judge. Answers "which model is actually best
for this skill?" with data instead of vibes. Separate from /benchmark, which
measures web page performance. Use when: "benchmark models", "compare models",
"which model is best for X", "cross-model comparison", "model shootout".

Voice triggers (speech-to-text aliases): "compare models", "model shootout", "which model is best".

## Preamble (run first)

```bash
Expand Down
19 changes: 12 additions & 7 deletions benchmark/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,7 @@
name: benchmark
preamble-tier: 1
version: 1.0.0
description: |
Performance regression detection using the browse daemon. Establishes
baselines for page load times, Core Web Vitals, and resource sizes.
Compares before/after on every PR. Tracks performance trends over time.
Use when: "performance", "benchmark", "page speed", "lighthouse", "web vitals",
"bundle size", "load time". (gstack)
Voice triggers (speech-to-text aliases): "speed test", "check performance".
description: Performance regression detection using the browse daemon. (gstack)
triggers:
- performance benchmark
- check page speed
Expand All @@ -23,6 +17,17 @@ allowed-tools:
<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs -->


## When to invoke this skill

Establishes
baselines for page load times, Core Web Vitals, and resource sizes.
Compares before/after on every PR. Tracks performance trends over time.
Use when: "performance", "benchmark", "page speed", "lighthouse", "web vitals",
"bundle size", "load time".

Voice triggers (speech-to-text aliases): "speed test", "check performance".

## Preamble (run first)

```bash
Expand Down
18 changes: 11 additions & 7 deletions browse/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,7 @@
name: browse
preamble-tier: 1
version: 1.1.0
description: |
Fast headless browser for QA testing and site dogfooding. Navigate any URL, interact with
elements, verify page state, diff before/after actions, take annotated screenshots, check
responsive layouts, test forms and uploads, handle dialogs, and assert element states.
~100ms per command. Use when you need to test a feature, verify a deployment, dogfood a
user flow, or file a bug with evidence. Use when asked to "open in browser", "test the
site", "take a screenshot", or "dogfood this". (gstack)
description: Fast headless browser for QA testing and site dogfooding. (gstack)
triggers:
- browse a page
- headless browser
Expand All @@ -22,6 +16,16 @@ allowed-tools:
<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs -->


## When to invoke this skill

Navigate any URL, interact with
elements, verify page state, diff before/after actions, take annotated screenshots, check
responsive layouts, test forms and uploads, handle dialogs, and assert element states.
~100ms per command. Use when you need to test a feature, verify a deployment, dogfood a
user flow, or file a bug with evidence. Use when asked to "open in browser", "test the
site", "take a screenshot", or "dogfood this".

## Preamble (run first)

```bash
Expand Down
Loading
Loading