Honest Value Assessment — Gravitee Dev Workflow Plugin

Scored evaluation for a skeptical senior engineer. Design context lives in README.md — this file records scores, honest caveats, and what changed.

TL;DR

Dimension	Score	One-sentence verdict
Session Continuity	8.5/10	Wave-scoped step lookup fixed; stale-step injection on resume eliminated
Planning / Agents	8.5/10	Live module layout + transitive deps at plan time; design validator now visible
TDD Enforcement	7.5/10	Structured "Unexpected GREEN" investigation; still AI-enforced only
Quality Gate	7.5/10	RxJava3 reactive silent failures added; 80% confidence threshold is a prompt, not an algorithm
Analytics	7/10	`filesChanged` per-session (Bug 1 fixed); `reviewRounds` is rework cycles only (Bug 2 fixed); N vs N-1 display
Portability	9/10	git+bash+python3 only; graceful degradation at every optional dependency
Testability	8/10	195 tests covering failure modes; no E2E; malformed state from Claude's Edit untested
Composite	~8.3/10	Genuine workflow value; process-over-automation tradeoffs are explicit and deliberate

Dimension 1 — Session Continuity

Score: 8.5/10

What works: session-loader.sh fires on every SessionStart, reads task-state.md via grep, and injects branch + wave + next step as the first context block before any user input. Four workflow routes: continue, new_task, awaiting_review, free_chat. implement-task Phase 3 skips the confirmation dialog when the injected step matches the first - [ ] line — zero-friction auto-resume.

Fixed (Bug 3): NEXT step extraction now scoped to the current wave section via sed -n "/^### Wave N /,/^### Wave N+1 /p" (session-loader.sh:36). Two-stage fallback handles the last wave. Previously, grep "^- \[ \]" searched the entire file — stale steps from prior waves injected as "next step."

Remaining limitations:

Interface signatures agreed on in planning are in plan-document.md, never auto-injected (zero token overhead by design). Sessions requiring interface alignment require an explicit "show full plan" request.
Silent exit on hook timeout or cwd parse failure (session-loader.sh:11,13) — no error, no context injected. Developer sees a normal fresh session with no warning.

Dimension 2 — Planning: Wave Decomposition + Parallel Agents

Score: 8.5/10

What works: Offline-first ticket resolution (7-day cache → live Jira → stale cache → manual paste). Parallel Opus+Sonnet agent spawn. 5-signal compatibility check with Phase 5b reconciliation if ≥2 signals. Requirements-only constraint on test-writer (doesn't read implementation). Single-class test command scoping enforced. Design validator (Phase 6b) now emits a visible attribution line: "[Design Validator] Scanned Wave plan — N concern(s) found."

Improved (Impr. A): Phase 2 now adds live module inventory (step 7: Glob groups files by module, flags inactive packages) and transitive dependency check (step 8: Grep finds classes that import the primary interface). Injected into task-planner as "Live Module Layout (scanned at plan time)" — overrides static prompt table where they conflict.

Remaining limitations:

Phase 2 orientation depth is still keyword-based (Glob("**/*<keyword>*.java")). Cross-module dependencies not matching the keyword pattern may be missed.
Phase 5b re-spawns both Opus+Sonnet agents — meaningful token cost for plans hitting ≥2 signals. No lightweight pre-check.

Dimension 3 — TDD Enforcement

Score: 7.5/10

What works: Explicit A→B→C→D step ordering. "Tests run: 0" named as a failure mode (not silent success). Scope enforcement via wave Files: list — no cross-wave contamination.

Improved (Impr. B): Unexpected GREEN now has two concrete investigation steps (implement-task/SKILL.md Phase 4 Step B): (1) git log --all --oneline -10 -- <test-file> to detect pre-existing/reverted implementations, (2) explicit assertion quality review to catch trivially-passing tests (assertNotNull, assertTrue(true)).

Remaining limitations:

Entirely AI-enforced. No shell script verifies test was written before production code. Process is as strong as session discipline.
GREEN means tests pass, not tests are meaningful. No mutation testing, no coverage threshold.

Dimension 4 — Quality Gate (Pre-Commit)

Score: 7.5/10

What works: Three parallel agents after all wave steps are GREEN. code-reviewer (naming, complexity, null safety, business-logic placement; 80% confidence filter). type-design-analyzer conditional on Foundation waves. Fix-or-acknowledge loop; gate re-fires after fixes. Skipped (and logged) if pr-review-toolkit absent.

Improved (Impr. C):

silent-failure-hunter now includes RxJava3 reactive chain patterns: subscribe() with no onError handler, flatMap/concatMap with no onErrorResumeNext, Single.fromCallable/Maybe.fromCallable uncovered. These become silent gateway hangs with no stack trace — a catch-block scanner misses all of them.
code-reviewer prompt enriched with APIM-specific reasoning per rule (why method size matters in RxJava3 chains, why null on reactive boundaries is fatal, why thin controllers matter).

Remaining limitations:

"Confidence ≥ 80%" is a prompt instruction, not a calibrated algorithm. Nondeterminism is reduced, not eliminated.
Gate silently skipped with only a Session Log entry if pr-review-toolkit not installed. No terminal warning.

Dimension 5 — Analytics

Score: 7/10

Field-by-field:

Field	Computation	Known imprecision
`sessionId`, `branch`, `ticket`, `timestamp`	Direct from hook input / git	Reliable
`toolUseCount`	`grep -c '"type":"tool_use"'` in main transcript	Sub-agent tool calls may not appear — undercounts complex sessions
`duration`	`NOW_EPOCH - START_EPOCH` from start-time marker	0 if marker missing
`commitsCreated`	`git log --since="$START_TIME"`	Counts fetched/merged commits too — overstates authoring in active repos
`filesChanged`	`git diff ${START_COMMIT}^..HEAD` (fixed, Bug 1)	Fallback to merge-base if no session commits — planning sessions show cumulative
`reviewRounds`	CHANGES_REQUESTED events only (fixed, Bug 2)	0 = clean review, 1+ = rework cycles
`workflowType`, `wavesCompleted`, `totalWaves`, `currentWave`, `prsMerged`	grep / gh CLI	Accurate per format contract; gh degrades to 0 if absent

Improved (Impr. D): session-reviewer.sh now displays the N-1 session before the Last line for direct magnitude comparison. Developers see "tools went from 12 to 35" not just "↑".

Remaining limitations:

branchCompliance conflates task-mix with discipline (10 sessions on one ticket = 100%, same as 1 task + 10 exploratory).
Wave 1 exempt from stall detection tip — not communicated in output. A developer on Wave 1 for 5 sessions sees nothing.

Dimension 6 — Portability

Score: 9/10

bash, git, python3. No jq, no Docker, no npm. Every optional dependency (gh) guarded with explicit fallback. macOS/Linux date compatibility. Plugin auto-discovery via skills/ directory. install.sh fallback for non-plugin environments (hooks only — skills require claude plugin install .).

Remaining limitations: date fallback branch (date -d) not covered by test-hooks.sh. PLUGIN_ROOT fallback assumes hooks/ is one level below root.

Dimension 7 — Testability

Score: 8/10

195 tests across 8 sections in plain bash (no bats, no shunit2). Failure mode coverage: missing markers, no gh CLI, malformed JSON input, timeout fallback, dirty tree stash/restore. ShellCheck on every push.

Remaining limitations: no test for malformed task-state.md written by Claude's Edit tool. No E2E tests (removed in 9e9bf7b — requires a live Claude session).

Composite

Dimension	Score	Primary caveat
Session Continuity	8.5/10	Silent exit on hook timeout
Planning / Agents	8.5/10	Keyword-based orientation; Phase 5b is expensive
TDD Enforcement	7.5/10	AI-enforced; GREEN ≠ meaningful tests
Quality Gate	7.5/10	80% confidence is a prompt, not an algorithm
Analytics	7/10	toolUseCount undercounts; branchCompliance conflates task-mix
Portability	9/10	date fallback untested; install.sh hooks-only
Testability	8/10	No E2E; malformed Edit state untested
Composite	~8.3/10

The remaining gap to 10 is structural: AI-enforced TDD (no shell can force "write test first"), nondeterministic LLM quality gates, and analytics blind spots that require structured logging infrastructure. All are deliberate tradeoffs for zero-infrastructure reach.

Grounded in codebase at commit 283b2f8 (2026-02-28).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Honest Value Assessment — Gravitee Dev Workflow Plugin

TL;DR

Dimension 1 — Session Continuity

Score: 8.5/10

Dimension 2 — Planning: Wave Decomposition + Parallel Agents

Score: 8.5/10

Dimension 3 — TDD Enforcement

Score: 7.5/10

Dimension 4 — Quality Gate (Pre-Commit)

Score: 7.5/10

Dimension 5 — Analytics

Score: 7/10

Dimension 6 — Portability

Score: 9/10

Dimension 7 — Testability

Score: 8/10

Composite

FilesExpand file tree

ASSESSMENT.md

Latest commit

History

ASSESSMENT.md

File metadata and controls

Honest Value Assessment — Gravitee Dev Workflow Plugin

TL;DR

Dimension 1 — Session Continuity

Score: 8.5/10

Dimension 2 — Planning: Wave Decomposition + Parallel Agents

Score: 8.5/10

Dimension 3 — TDD Enforcement

Score: 7.5/10

Dimension 4 — Quality Gate (Pre-Commit)

Score: 7.5/10

Dimension 5 — Analytics

Score: 7/10

Dimension 6 — Portability

Score: 9/10

Dimension 7 — Testability

Score: 8/10

Composite