Skip to content

[analyze 1/3] analyze-gather: collect git/PR/codebase/session signal into JSON #75

@willwashburn

Description

@willwashburn

Part of the agentworkforce analyze feature. Issue 1 of 3. Foundation for #76 (persona-discoverer) and #77 (CLI subcommand).

Depends on #71 — start this after the persona-kit migration (#64#71) ships. Pre-migration PersonaSpec / runAgentSelector go away and any work built on top of them will need to be rewritten.

Goal

Add a pure-TS (no LLM) signal-gathering module that produces a single bounded JSON document describing how the team actually works in a given repo. This JSON becomes the input the persona-discoverer persona will read in issue 2.

The synthesizer is a judgment task best done by an LLM, but the gathering is mechanical and deterministic — keep it in TS so it's fast, cheap, testable, and works the same every run.

Files to touch

New:

  • packages/cli/src/analyze-gather.ts — the gather module.
  • packages/cli/src/analyze-gather.test.ts — Node test runner (match packages/cli/src/cli.test.ts style).

Modify:

  • listSessionTranscriptsForCwd() — a new shared helper. Post-migration location depends on who ends up owning session-transcript discovery:
    • If session lifecycle follows the spawn flow into @agentworkforce/persona-kit (the handle returned by executePersonaSpawnPlan per [persona-kit 4/8] Migrate workforce CLI to use @agentworkforce/persona-kit #67 is the natural owner), add the helper there and import it from analyze-gather.ts. This is the more likely landing spot.
    • If it stays CLI-layer (because burn-stamp ledger + filesystem walk feel like CLI concerns), extract from the auto-improve flow that currently lives near the bottom of packages/cli/src/cli.ts into packages/cli/src/session-transcripts.ts and import from there.
    • Either way: today's findSessionTranscriptViaStamps + cwd-content scan return one transcript for a just-ended run; the new helper is an enumerator returning all transcripts for a cwd. Both auto-improve and analyze should consume it.

JSON shape

interface AnalyzeGatherResult {
  cwd: string;
  generatedAt: string; // ISO8601
  commits: Array<{ sha: string; author: string; date: string; subject: string; files: Array<{ path: string; added: number; deleted: number }> }>;
  hotFiles: Array<{ path: string; commits: number; added: number; deleted: number }>;
  prs: Array<{ number: number; title: string; author: string; labels: string[]; mergedAt: string; additions: number; deletions: number; changedFiles: number }> | { skipped: true; reason: string };
  codebase: {
    tree: Array<{ dir: string; fileCounts: Record<string /* ext */, number>; isPackageRoot: boolean }>;
    packages: Array<{ path: string; name: string; scripts: string[]; depCount: number; devDepCount: number }>;
  };
  sessions: Array<{ harness: 'claude' | 'codex' | 'opencode'; sessionId: string; transcriptPath: string; startedAt: string; headLines: string[]; tailLines: string[] }> | { skipped: true; reason: string };
}

Bounds (do not exceed)

Section Bound
commits min(--max-commits flag, last --lookback-days window). Defaults: 500 / 90.
hotFiles top 80 by total churn (added+deleted)
prs 200 most recent merged
codebase.tree walk depth 6; skip node_modules, dist, .git, anything matched by repo's .gitignore
codebase.packages every package.json found within depth bound
sessions 30 most recent for this cwd; head 40 lines + tail 40 lines of each transcript

These bounds keep the eventual analyzer prompt under ~100k tokens.

Tasks

  • Public API: export async function gather(opts: GatherOptions): Promise<AnalyzeGatherResult> plus export interface GatherOptions { cwd: string; lookbackDays: number; maxCommits: number; includePrs: boolean; includeSessions: boolean; runner?: CommandRunner }.
  • Inject the command runner (so tests can stub git / gh deterministically). Default runner uses child_process.spawnSync — match the pattern in packages/harness-kit/src/detect.ts and cli.ts:748.
  • git log --since=<lookback> --no-merges --numstat --pretty=format:'…' — single invocation, parse all commits + file deltas in one pass.
  • Aggregate hotFiles from commit deltas (do not re-shell out).
  • gh pr list --state merged --json number,title,author,labels,mergedAt,additions,deletions,changedFiles --limit 200. Detect gh absence (spawnSync ENOENT) and unauthenticated state (gh auth status non-zero) — return { skipped: true, reason } rather than crashing.
  • Codebase walk via readdirSync({ withFileTypes: true }) (existing pattern in cli.ts:1307). Skip directories per the bounds table. Tag any dir containing package.json as a package root.
  • Per package: read package.json, extract name, scripts keys, count dependencies + devDependencies. Don't include version strings or dep names (too much noise for too little value at this stage).
  • Sessions: implement listSessionTranscriptsForCwd(cwd) by extracting/generalizing the existing logic at cli.ts:2540-2815. For each transcript file, read head -40 + tail -40 lines without slurping the whole file (sessions can be huge).
  • Pass --no-prs / --no-sessions plumbing through GatherOptions.
  • Write the result to a caller-supplied path (issue 3 will manage the path lifecycle); export the in-memory result too so tests can assert without disk I/O.

Tests

  • Stub a CommandRunner returning canned git log output; assert the parsed commit + numstat structure matches expected shape, including multiline subjects and renamed files ({old => new} syntax).
  • Stub gh returning the canned JSON shape; assert prs parses correctly.
  • Stub gh runner that throws ENOENT → assert prs: { skipped: true, reason: /not installed/i }.
  • Stub gh auth status returning non-zero → assert prs: { skipped: true, reason: /not authenticated/i }.
  • Build a temp directory tree (real fs, but in os.tmpdir()) with nested package.json files and a node_modules to skip → assert codebase walk shape and bounds.
  • includePrs: falseprs: { skipped: true, reason: /disabled/i }, no gh invocation.
  • includeSessions: falsesessions: { skipped: true, reason: /disabled/i }, no session enumeration.
  • Bounds enforced: feed 1000 fake commits → result has exactly maxCommits entries; feed 200 hot file candidates → result has top 80.

Verification

  • corepack pnpm --filter @agentworkforce/cli test passes (this issue's tests included).
  • Build is clean: corepack pnpm -r build.
  • Manual: write a small harness script that calls gather({ cwd: process.cwd(), … }) against this repo and pretty-prints the result. Eyeball it for sanity — commits cover the lookback window, hot files match what you'd expect, no node_modules in the codebase tree.

Constraints

  • No new runtime deps. Use child_process + fs + path — matches the rest of the CLI.
  • No gh requirement. Gather must produce a useful result on a machine without gh installed.
  • Don't slurp giant files. Some session JSONL transcripts are tens of MB; use streaming reads for head/tail extraction.
  • Don't leak secrets. Skip env files, .npmrc, .git/config from any sampling. We only read package.json (script names + dep counts, not values).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions