Author: AI PM Date: 2026-04-24 Audience: pmstack maintainer (you), early adopters, contributors
pmstack v0.1 ships 5 skills as static markdown. Gstack ships 23+ skills wired into a workflow, with telemetry, an installer, multi-host support, and self-update. The gap isn't more skills — it's the system around the skills. Three moves close most of the gap:
- Wire skills into a workflow, not a list. PM work is a pipeline (Discover → Define → Decide → Deliver → Measure). Each skill should hand off to the next.
- Ship infrastructure: installer (done), self-update, opt-in telemetry, taste memory, a real eval suite that scores pmstack on its own outputs (the canonical dogfooding).
- Add the high-leverage skills that PMs actually re-do every week: launch checklist, weekly review, voice-of-customer synthesis, pricing decision, exec readout, decision log.
This memo is the plan to get there. It's prioritized, scoped, and mapped to a release.
Researched 2026-04-24 from github.com/garrytan/gstack.
| Pattern | gstack | pmstack v0.1 | Gap |
|---|---|---|---|
| Install script | ./setup |
None — README said cp -r (now fixed in this release) |
Closed in v0.2 |
| Slash commands wired | 23+ commands in .claude/commands/ |
Skills referenced but commands never installed | Closed in v0.2 |
| Workflow ordering | Skills are sprint-phase ordered | Skills are alphabetical | Open |
| Output handoff | /plan-ceo-review feeds /plan-eng-review feeds /ship |
Each skill produces a standalone artifact | Open |
| Self-update | /gstack-upgrade |
None | Open |
| Telemetry | Opt-in JSONL, local-only | None | Open |
| Multi-host | Claude Code, Codex, Cursor, Factory, Slate, Kiro, Hermes | Claude Code only | Open (defer) |
| Memory / taste | gstack-taste-update decays 5%/week |
None | Open |
| MCP integration | gbrain registered as MCP server |
None | Open |
| Sub-agent orchestration | /pair-agent, /codex cross-model second opinion |
None | Open |
| Personality | "CEO mode", "voice-friendly triggers" | Generic PM voice in CLAUDE.md | Partially closed |
| Self-eval | Manual | None | Open (and embarrassing — we own the eval skill) |
| Safety | /guard, secret scanner, prompt-injection defense |
None | Open |
| Process enforcement | "Confusion Protocol", auto-skill suggestion | None | Open |
The single most embarrassing gap: pmstack's agent-eval-design skill exists, and we have no eval suite for pmstack itself. Fixing this is the v0.3 anchor.
-
.claude/commands/*.mdfor all 5 skills so/eval,/prd, etc. fire natively -
setupscript with project-scope and--globalmodes - README correctly tells users how to install
- First eval artifact in
outputs/(Claude Ultraplan) — proves the loop closes
Success metric: A PM clones the repo, runs ./setup, types /eval, and gets a populated YAML in outputs/. Time-to-first-output < 2 minutes.
The skill we sell is eval design. We should have the most-evaluated PM toolkit on GitHub.
- Self-eval suite:
evals/pmstack-self.yaml— for each skill, define ≥10 input scenarios + a rubric. Rubric scored by Claude Sonnet (different family). Results published inevals/results/. - Golden output set:
examples/golden/— one perfect output per skill, used as anchor for "does the skill still produce work this good?" - Regression alerting: any skill that regresses on the golden set gates a release.
-
/eval-selfcommand: re-runs the suite on demand.
Success metric: Every skill has a public score. Score visible in README. Trend tracked over time.
PM work is a pipeline. Skills should know about each other.
- Skill graph (
skills/_graph.yaml):prdoutputs feedmetrics;metricsfeedseval;competitivefeedsprd. Each skill reads the prior artifact fromoutputs/if present. -
/sprintorchestrator:prd → metrics → eval → briefin one command, with checkpoint approval at each step. - Decision log auto-update: every skill appends a one-line entry to
decisions-log.mdwith the artifact link, so future skills have memory. - CLAUDE.md "context engineering" pass: explicit references to the workflow, not just communication style.
Success metric: A new feature can go from "customer quote" → "shipped eval criteria" in < 30 minutes with one orchestrator call.
The skills a Staff PM runs every week, that pmstack v0.1 doesn't have.
-
/launch— launch checklist generator: pre-launch, day-of, post-launch readouts. Includes legal, ops, support handoff, exec comm. -
/voc— voice-of-customer synthesis: paste in 20–50 quotes / tickets, get themes, frequency, severity, and 3 PRD-ready problem statements. -
/pricing— pricing decision support: framework-driven (Van Westendorp, anchor effects, willingness-to-pay), forces explicit assumptions. -
/standuporchestrator: pulls last week's commits / Linear / Slack into a 5-bullet weekly readout.
Success metric: Adoption — at least one external PM team using the weekly loop in production.
-
/pmstack-upgrade:git pull+ show changelog + warn on local edits. - Opt-in local telemetry:
~/.pmstack/usage.jsonl, captures{skill, duration_ms, success_bool, version}. Never artifact content. Local-only by default; user can opt into upload. -
/pmstack-stats: shows the user their own most-used skills, average duration, eval pass rates over time.
Success metric: Maintainer (you) can see usage patterns without violating privacy. Users see their own PM productivity dashboard.
- Cross-model second opinion (
/critique— gstack's/codexanalog): runs the artifact through GPT/Gemini for adversarial review. Returns a YAML diff: "what would another model push back on?" - Confusion Protocol: when a skill hits ambiguity, it stops and asks instead of guessing. Encoded as a meta-skill that all 5 skills reference.
- Personality dial:
~/.pmstack/voice.yamllets the user choose tone (blunt-Bezos, warm-Doerr, technical-Hsu). Default = blunt.
- Domain packs:
pmstack-fintech,pmstack-devtools,pmstack-healthcare,pmstack-ai-platform. Each adds 3–5 domain-specific skills + tunes the existing 5. - MCP integrations: connect to Linear, Notion, Mixpanel, Amplitude, Slack — so the analysis runs on real data, not the LLM's general knowledge. Per-tool trust policies (read-only by default).
- Multi-host: Cursor and Codex support via the same
setup --host <name>pattern gstack uses.
Worth saying explicitly so we don't mindlessly clone.
- 23 skills out of the gate. Five well-scoped skills with great evals beat twenty generic ones. Add slowly.
- Sprint metaphor as the only workflow. PM work isn't always sprint-shaped. Roadmap planning, deep-research weeks, and exec prep all break the model.
- Heavy CLI tooling in Bash + Go. pmstack should stay markdown-first. A
setupscript is the right amount of code; a multi-binary toolchain is not. - Telemetry uploaded by default. Even opt-in cloud telemetry is a hard sell in PM teams (artifacts contain customer quotes, internal strategy, sometimes legal-flagged terms). Local-only with explicit upload.
- Distribution. gstack got 72K stars partly via Garry's reach. What's the launch strategy? (Hacker News post, "PM newsletter" guest essay, conference workshop, all three?)
- Naming. Is "pmstack" too close to gstack? Adjacent enough to ride the SEO, distinct enough to stand alone? I'd keep it.
- License. Inherit MIT from the inspiration? (Probably yes.)
- Contribution model. Skills are inherently opinionated. Do we accept skill PRs, or only "fork your own pack"? Recommend: PRs only for the core 5; encourage forks for everything else.
- Does anyone outside Ryan use pmstack? If not, the v0.5 weekly-loop skills should come after user research with 5 PMs. Building skills no one runs is the worst failure mode in USER.md.
- Does
/eval Claude Ultraplanproduce something a PM would actually take to a launch review? That's the live test of the existing skill quality. Ship the v0.2 release, run it, gut-check the output. - Is the workflow / handoff feature (v0.4) more valuable than just adding skills (v0.5)? I'd argue yes — but it depends on whether users already love the existing 5. If they do, v0.5 first.
Ship v0.2 today. It removes the "the README lies" problem.
Then run the v0.3 self-eval next week — it's the most pmstack-coherent thing we can do, and the results will tell us whether the existing skills need tuning before adding more.
Defer v0.4–v1.0 until v0.3 surfaces real evidence about quality and use.