Summary
The current recommendation scoring awards up to 10 points for smaller diffs:
diffSizePoints = (1 - agentLines / maxLines) * 10
This biases toward the agent with the fewest lines changed. But a larger diff might be more thorough — better error handling, more comprehensive tests, edge cases covered, documentation added. Smallest change != best change.
Observed during dogfooding
When running thinktank on itself, I consistently defaulted to describing the recommended agent as "smallest diff" when presenting results. But in several cases, larger-diff agents had more robust solutions (e.g., Agent #2 with +204 lines might have better test coverage than Agent #5 with +101 lines).
Proposed alternatives
Option A: Remove diff size from scoring entirely
Let tests + convergence decide. If both are equal, present a tie rather than auto-picking the smaller one.
Option B: Flip the bias — reward thoroughness
Instead of penalizing larger diffs, reward agents that added more tests or handled more edge cases. Could detect test file additions as a positive signal.
Option C: Make it configurable
--prefer compact (current behavior) vs --prefer thorough (reward larger diffs with more test coverage).
Option D: Neutral — only penalize outlier-large diffs
Don't reward small diffs, but penalize diffs that are significantly larger than the median (e.g., 3x the median size). This catches "agent went off the rails" without biasing toward minimal changes.
Acceptance criteria
Summary
The current recommendation scoring awards up to 10 points for smaller diffs:
This biases toward the agent with the fewest lines changed. But a larger diff might be more thorough — better error handling, more comprehensive tests, edge cases covered, documentation added. Smallest change != best change.
Observed during dogfooding
When running thinktank on itself, I consistently defaulted to describing the recommended agent as "smallest diff" when presenting results. But in several cases, larger-diff agents had more robust solutions (e.g., Agent #2 with +204 lines might have better test coverage than Agent #5 with +101 lines).
Proposed alternatives
Option A: Remove diff size from scoring entirely
Let tests + convergence decide. If both are equal, present a tie rather than auto-picking the smaller one.
Option B: Flip the bias — reward thoroughness
Instead of penalizing larger diffs, reward agents that added more tests or handled more edge cases. Could detect test file additions as a positive signal.
Option C: Make it configurable
--prefer compact(current behavior) vs--prefer thorough(reward larger diffs with more test coverage).Option D: Neutral — only penalize outlier-large diffs
Don't reward small diffs, but penalize diffs that are significantly larger than the median (e.g., 3x the median size). This catches "agent went off the rails" without biasing toward minimal changes.
Acceptance criteria