Architect: opus Date: 2026-04-27 Verdict: Continue with adjustments (DELIBERATE mode)
Planner v1 is architecturally sound at the macro level (CLI engine + skills + hooks + SQLite single-source-of-truth), but 5 concrete adjustments must land before Executor begins Task 6.
- Hook settings.json schema is a moving target — the merge script must detect and emit both legacy flat and nested-array forms (current user-level
~/.claude/settings.jsonmixes both; pushing only one form silently breaks the dispatcher chain). See §4 Risk #1 + §5 Adjustment A. - Cross-tool single-source-of-truth is partially false advertising — Codex/Gemini have no hook layer, so "FSRS 排期共享" only works for
jpCLI calls, not for spontaneously injected practice. Need explicit principle restatement + UX honesty in onboarding. See §3 Violation P2 + §5 Adjustment B. recordAnswerinsrs.ts:69-93is two non-transactional writes (UPDATE reviews + INSERT attempts). With cross-process WAL contention, a Ctrl-C between statements leaves attempt-without-review-update or vice versa. Plan §4 Scenario 2 mentions this, but Task 8 step plan doesn't list the patch. See §4 Risk #3 + §5 Adjustment C.
Strongest defense: Anki has 15 years of SRS research, ~200k vocab/grammar decks for JLPT, mobile sync, audio TTS, image occlusion, hint masking, custom card templates with HTML/JS. AnkiConnect exposes the entire DB as JSON-RPC. A Claude Code skill could literally be 80 lines: /jp → call findCards "is:due deck:N2" → read card → render → user answers → call addReview cardID ease. Same FSRS-5 (Anki has it native since 23.10), same single source of truth, same cross-tool (any LLM can hit AnkiConnect over localhost:8765).
Architect rebuttal: not fatal because:
- AnkiConnect requires Anki desktop running — violates Principle 1 (zero friction); user has to launch Anki
- Anki cards are static rendered HTML — kills D1's mixed-mode advantage (LLM-generated question variation per attempt)
- AnkiConnect is HTTP localhost only — hooks would still need a bridge
- Decks are user-curated and noisy; same N2 grammar exists in 5 decks with different IDs → no stable concept_id
- License/distribution: shipping a curated deck with AnkiConnect dependency is heavier than shipping a 25MB single-binary CLI
Verdict: not fatal, but Planner should add ADR §8 "Alternatives considered" line: Anki + AnkiConnect → rejected because requires Anki running (zero-friction violation) + cards are HTML-rendered (loses D1 mixed-mode benefit).
Strongest defense: bun cold-start (30-50ms) + SQLite open + FSRS query + JSON serialize ≈ 100-200ms. Even at inject_rate=0.15, the hook always runs (the dice roll happens inside the hook); only injection is gated. Three guaranteed bun spawns per turn × 100 prompts/day × 50ms = 15s/day pure overhead.
Architect rebuttal: not fatal but mandates Adjustment D: hook entry-point is .sh with shell-side gate, only invokes bun when the gate passes. Plan must amend Task 8 step 8.5 from "hook 都 chmod +x,每个开头 #!/usr/bin/env bun" to "shell wrapper does pre-roll dice + work-hours check; bun called only when needed".
A3. "LLM-generated questions are inconsistent — same concept might be MCQ one turn, free recall the next, with wildly different difficulty calibration → FSRS rating becomes meaningless noise"
Strongest defense: FSRS-5 trusts that "rating 3" always means roughly the same level of recall confidence. If LLM evaluates differently across sessions (lenient vs strict), card stability/difficulty trajectory is garbage. Static decks (Anki, Bunpro) have calibrated rubrics per card type — you don't.
Architect rebuttal: this is the deepest unfixed flaw. Plan acknowledges abstractly (likelihood M, impact L) but proposed mitigation "concept_id hash 做题型轮换种子" is wrong:
- Hashing makes question-type stable per concept forever, killing D1's variation benefit
- Even if type is stable, LLM grading rubric isn't anchored
Mandates Adjustment E: store evaluation rubric per concept type, skill prompt enforces "rate using rubric: rating_3 = ...".
None of the three antitheses kills the project. Each surfaces a real adjustment:
- A1 → ADR §8 must explicitly reject Anki with reasoning
- A2 → Task 8 wrap with shell pre-gate
- A3 → Task 7 skill markdown must include rubric template per type
Hook design is opt-out by inject_rate; success means user doesn't notice training is happening. Collides with Driver 3 (N2 通过率): without felt progress, motivation drops and user disables hooks before training accumulates. jp daily-push only sends "今日 N 题待复习" (cli.ts:243-255), which is task list not progress.
Trap: design biased toward minimizing intrusion at expense of motivation feedback. Phase 1 will technically work and emotionally fail.
Adequacy: insufficient. Need weekly digest mechanism — cron pushes Friday "本周已答 23 题,正确率 68%, 攻克 N2 概念 +5".
Already covered in A3. Cleanest reading: D1 chose mixed mode for cross-tool stability of concept_id, but didn't extend the contract to rubric. If LLM acts as both renderer AND grader, it must do both consistently.
Trap: plan conflates "题型多样" (good UX) with "评分自由" (bad signal). They can be decoupled: question type randomized per attempt, but grading rubric anchored.
Plan §1 Principle 2 + §9 Codex section claim shared db = shared experience. Technically true but functionally misleading — value prop is "any tool nudges you", but only Claude users get nudges.
Trap: marketing/principle drift. Cross-tool selling point is half real.
Adequacy: marginal. Principle 2 should be downgraded to "reviews.db is single source for any tool that runs jp; only Claude Code has automatic injection".
Principle 5 says hooks must silent-fail (exit 0, empty additionalContext). Risk Scenario 3 mentions debugging — but how would user even know hooks fired? ~/.config/jp-trainer/jp.log mentioned in §5.4 Observability with NDJSON events. Task 8 step 8.2-8.4 doesn't explicitly list "log every hook entry/exit event".
Trap: silent-fail without observable trail = unfixable production bug class.
Adequacy: enforce hooks/lib/common.ts:safeFail() always logs to NDJSON before returning. Add jp logs --hook stop filter.
| # | Principle | Violation | Severity | Where |
|---|---|---|---|---|
| P1 | "依附工作流不打断" | Bun cold-start unconditionally on every Stop/PromptSubmit/PostToolUse — 85%+ invocations do nothing but cost 50-150ms each | High | Task 8 step 8.5 — Adjustment D |
| P2 | "错题本是单一真相 跨 Claude/Codex/Gemini/CLI 共享" | True for db reads, false for automatic injection. Misleading framing. | Medium | Plan §1 Principle 2 — Adjustment B |
| P3 | "CLI 是核心引擎,LLM 是渲染层" | LLM is also evaluation layer (rates 1-4) but not constrained by CLI-side rubric. Drift breaks FSRS calibration. | High | Task 7 — Adjustment E |
| P4 | "概率触发 + 工作时间感知" | inject_max_per_hour / inject_max_per_session promised in Plan §3 D2 but missing from Profile schema profile.ts:9-25 |
Medium | profile.ts gap — Adjustment F |
| P5 | "失败安全 + 可一键关闭" | "可一键关闭" interpreted as "edit settings.json delete entry" — install.sh §12 doesn't ship jp uninstall-hooks CLI subcommand |
Low | Task 12 — minor polish |
P3 and P1 are the meaningful ones. P2 is honesty. P4 is a schema-drift bug. P5 is polish.
~/.claude/settings.json (user-level) uses flat form: { "matcher": "Bash", "command": "..." }. customer-im-client/.claude/settings.json (project) uses nested form: { "matcher": "...", "hooks": [{"type":"command",...}] }. Mixing schemas inside the same event array is undefined behavior.
scripts/install-hooks.mjs (Task 12 step 12.2) describes "push a hook entry" with no schema discriminator. If existing settings.json has flat-form hooks for the same event, appended nested-form entry may either silently disable existing hooks or be silently skipped.
- Mitigation in plan: D5 says "幂等检测 hook 已存在则 skip" — checks
commandsubstring. Doesn't address schema-form coexistence. - Strengthen: install-hooks.mjs must (a) detect existing schema form per event, (b) emit matching form, (c) refuse if mixed. Add
tests/integration/install.test.tscovering both schemas + mixed.
WAL works most of the time across processes when all writers use the same SQLite library. bun:sqlite is a custom build. When jp runs from ~/.local/bin/jp (compiled binary) vs bun run src/cli.ts (dev mode), they may bind two libsqlite versions to the same .db.
- Mitigation in plan: busy_timeout + transaction + retry + daily backup. 80% sufficient.
- Strengthen: db.ts:13 add
PRAGMA synchronous=NORMAL; PRAGMA wal_autocheckpoint=1000. Document bun upgrade may requiresqlite3 reviews.db .recover.
db.run(`UPDATE reviews ...`); // line 78-84
db.run(`INSERT INTO attempts ...`); // line 86-90If process killed between, reviews shows new schedule but attempts table missing the row, OR vice versa. Poisons FSRS data subtly.
- Mitigation in plan: Scenario 2 mentions transaction wrap.
- Adequacy: mentioned but not in step plan. Task 8 doesn't say "patch srs.ts:69".
- Strengthen: new Task 5.5 "data integrity hardening" — wrap recordAnswer in
db.transaction(), add busy_timeout=5000.
Plan binds tightly to hook contract: hookSpecificOutput.additionalContext, hookEventName, exit codes, tool_response.duration_ms. Anthropic shifts hook schema across Claude Code versions. Plan §10 Open Question #3 acknowledges PostToolUse "tool 失败 skip" detection is open.
- Mitigation in plan: none.
- Strengthen: (a)
hooks/lib/contract.tswraps stdin parsing + version detection + safe defaults; (b) install-time "I support Claude Code ≥ 1.0.x" assertion; (c)jp doctorCLI command pings hooks API surface and reports incompatibilities.
(a) macOS Sequoia (26.x) quarantines binaries downloaded via curl/network; curl -L install.sh | bash may produce Gatekeeper-blocked jp. (b) bun version drift: user updates bun → re-runs install → new binary uses different ABI for bun:sqlite.
- Mitigation in plan: Risk row mentions fallback to
bun run src/cli.ts. - Strengthen: install.sh runs
xattr -d com.apple.quarantine ~/.local/bin/jpafter build. Pinengines.bun: ">=1.1.40 <2"in package.json.
readProfile() profile.ts:49-58 does { ...DEFAULT_PROFILE, ...parsed } which silently drops unknown keys.
- Strengthen: enforce
version: 1field check; if profile.version > current code version, warn user.patchProfilewritesprofile.yaml.bak.<timestamp>for rollback.
Driver 3 says "N2 通过率... 6-12 个月后能不能过 JLPT N2... 覆盖率不足 / 题面与考试形式偏差大 = 项目失败". Plan has zero way to measure this before the actual N2 exam.
- Mitigation in plan: none.
- Strengthen: Phase 1.1 must add periodic mock test (
/jp mock-n2serves 30 official-format questions from held-out set, projects "predicted N2 score"). Held-out set indata/seeds/n2-mock-pool.yaml(excluded from default seed-import).
getNextDue (concepts.ts:38) checks todayNewIntroducedCount() >= profile.daily_new_count. Two simultaneous jp next calls (Claude + Codex same minute) can both read count=4, both pick a new concept. Off-by-one.
- Mitigation: accept as-is. Low impact, do not over-engineer.
Where: Task 12 step 12.2 (scripts/install-hooks.mjs).
Change:
- Before write, scan each event array (Stop, UserPromptSubmit, PostToolUse) and detect schema: flat (
{matcher, command}) vs nested ({matcher, hooks:[{type,command,timeout}]}). - If event array non-empty + flat-form, emit jp entry in flat form.
- If empty, prefer nested form.
- If mixed, abort with: "Existing hooks for event X mix schemas. Manually clean up first or run
jp install --force-nested". - Add
tests/integration/install.test.tscases: empty / flat-only / nested-only / jp-already-present (idempotent).
Where: Plan §1 Principle 2 + Task 7 jp-setup.skill.md + README §3. Change:
- Rewrite Principle 2: "reviews.db is the single source of truth across
jpinvocations from any tool. Automatic injection only available in Claude Code (hook layer); Codex/Gemini users invoke/jpmanually." - jp-setup.skill.md step 5 adds "你主要在哪个 tool 用 jp?(Claude / Codex / Gemini / 多个)" → if not Claude, set
inject_rate=0+ explain. - README §3 prominent table: Tool | Auto-inject | Manual /jp | Shared db.
Where: New Task 5.5 inserted before Task 6 OR fold into Task 8 step 8.0. Change:
srs.ts:69-93recordAnswer: wrap UPDATE+INSERT indb.transaction(() => {...})().db.ts:13addPRAGMA busy_timeout = 5000; PRAGMA synchronous = NORMAL; PRAGMA wal_autocheckpoint = 1000;.- New
hooks/lib/common.ts:safeFail(reason)always appends NDJSON{ts, event:"hook_silent_fail", reason}to~/.config/jp-trainer/jp.logbefore exit 0. - Add
jp logs --tail N --event <name>CLI command.
Where: Task 8 step 8.5.
Change: instead of #!/usr/bin/env bun at top, write hooks/stop.sh:
#!/bin/bash
exec 3<&0
PROFILE=~/.config/jp-trainer/profile.yaml
[ -f "$PROFILE" ] || exit 0
RATE=$(grep '^inject_rate:' "$PROFILE" | awk '{print $2}')
ROLL=$(awk -v seed=$RANDOM 'BEGIN{srand(seed); print rand()}')
awk -v r=$ROLL -v rate=$RATE 'BEGIN{exit !(r < rate)}' || exit 0
exec bun ~/.claude/jp-trainer-hooks/stop.ts <&3- Same pattern for user-prompt-submit.sh (work-hours + immersion flag stat) and post-tool-use.sh.
- Bun cold-start happens only on the ~15% of turns that will inject. 100ms × 15 = 15ms/day overhead vs 100ms × 100 = 10s/day.
Where: Task 7 step 7.1 jp.skill.md. Change: Define rubric per concept type:
# vocab rubric
rating_4_easy: word + correct reading + correct meaning, no hesitation
rating_3_good: word correct, minor reading hesitation OR partial meaning
rating_2_hard: meaning correct but had to be prompted, OR wrong reading
rating_1_again: didn't recall, needed full disclosure
# grammar rubric
rating_4_easy: produced form correctly + understood subtle meaning + can give counter-example
rating_3_good: form correct, meaning right, no counter-example
rating_2_hard: form correct but meaning vague, or vice versa
rating_1_again: form wrong or didn't recognize- jp.skill.md instruction: "You MUST evaluate using the rubric above; quote the rubric line you matched in
--feedbackargument when callingjp answer." - Question type rotation: random within rubric-stable types.
Where: profile.ts:9-25 + Task 8 step 8.1.
Change: add inject_max_per_hour, inject_max_per_session, do_not_disturb_until, respect_work_hours. hooks/lib/common.ts:incrementInjectCount(sessionId) writes ~/.config/jp-trainer/sessions/<id>/inject-count.json. New jp dnd <duration> CLI.
Shell pre-gate saves ~85ms × 100 turns = ~8.5s/day per hook. Cost: 3 small bash files, awk dependency. Recommendation: yes.
Without periodic mock tests, Driver 3 ("N2 通过率") unfalsifiable for 6-12 months. Adding held-out 200-question pool + jp mock-n2 is ~1 day. Recommendation: Phase 1.1 minimum.
Planner internally inconsistent. Recommendation: 0.10 for v0.1.0 unknown-user-base; user can jp config inject_rate=0.15 to escalate.
docs/ralplan-planner-v1.md:36-50— Principles 1-5 (subject to P2 rephrase)docs/ralplan-planner-v1.md:184-200— Scenario 1 mitigation (0.10 vs 0.15 inconsistency)docs/ralplan-planner-v1.md:202-215— Scenario 2 mitigation transaction wrapdocs/ralplan-planner-v1.md:332-344— Task 8 step plan, missing shell pre-gatedocs/ralplan-planner-v1.md:457-466— ADR-001 alternatives missing Anki rebuttalsrc/srs.ts:69-93—recordAnswernon-transactional (Risk #3)src/db.ts:13— only journal_mode=WAL (Risk #2)src/profile.ts:9-25— Profile interface missing fields (P4)src/profile.ts:49-58—readProfilesilent default-merge (Risk #6)src/cli.ts:198-210—inject-decideexit-code pattern templatesrc/concepts.ts:38—daily_new_countrace condition (Risk #8, accept)~/.claude/settings.json— flat-form evidencecustomer-im-client/.claude/settings.json— nested-form evidence
Hand-off to Critic: 5 adjustments + 3 open questions. Critic should focus on Q1-Q3 binding decisions, completeness of Risk #7 mitigation, and whether Adjustment E rubric design is detailed enough.