Skip to content

Reconsider diff size scoring: smaller isn't always better #97

@that-github-user

Description

@that-github-user

Summary

The current recommendation scoring awards up to 10 points for smaller diffs:

diffSizePoints = (1 - agentLines / maxLines) * 10

This biases toward the agent with the fewest lines changed. But a larger diff might be more thorough — better error handling, more comprehensive tests, edge cases covered, documentation added. Smallest change != best change.

Observed during dogfooding

When running thinktank on itself, I consistently defaulted to describing the recommended agent as "smallest diff" when presenting results. But in several cases, larger-diff agents had more robust solutions (e.g., Agent #2 with +204 lines might have better test coverage than Agent #5 with +101 lines).

Proposed alternatives

Option A: Remove diff size from scoring entirely

Let tests + convergence decide. If both are equal, present a tie rather than auto-picking the smaller one.

Option B: Flip the bias — reward thoroughness

Instead of penalizing larger diffs, reward agents that added more tests or handled more edge cases. Could detect test file additions as a positive signal.

Option C: Make it configurable

--prefer compact (current behavior) vs --prefer thorough (reward larger diffs with more test coverage).

Option D: Neutral — only penalize outlier-large diffs

Don't reward small diffs, but penalize diffs that are significantly larger than the median (e.g., 3x the median size). This catches "agent went off the rails" without biasing toward minimal changes.

Acceptance criteria

  • Decide on approach
  • Update scoring formula
  • Update docs/architecture.md scoring section
  • Update tests

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions