Reconsider diff size scoring: smaller isn't always better

## Summary
The current recommendation scoring awards up to 10 points for smaller diffs:
```
diffSizePoints = (1 - agentLines / maxLines) * 10
```

This biases toward the agent with the fewest lines changed. But a larger diff might be **more thorough** — better error handling, more comprehensive tests, edge cases covered, documentation added. Smallest change != best change.

## Observed during dogfooding
When running thinktank on itself, I consistently defaulted to describing the recommended agent as "smallest diff" when presenting results. But in several cases, larger-diff agents had more robust solutions (e.g., Agent #2 with +204 lines might have better test coverage than Agent #5 with +101 lines).

## Proposed alternatives

### Option A: Remove diff size from scoring entirely
Let tests + convergence decide. If both are equal, present a tie rather than auto-picking the smaller one.

### Option B: Flip the bias — reward thoroughness
Instead of penalizing larger diffs, reward agents that added more tests or handled more edge cases. Could detect test file additions as a positive signal.

### Option C: Make it configurable
`--prefer compact` (current behavior) vs `--prefer thorough` (reward larger diffs with more test coverage).

### Option D: Neutral — only penalize outlier-large diffs
Don't reward small diffs, but penalize diffs that are significantly larger than the median (e.g., 3x the median size). This catches "agent went off the rails" without biasing toward minimal changes.

## Acceptance criteria
- [ ] Decide on approach
- [ ] Update scoring formula
- [ ] Update docs/architecture.md scoring section
- [ ] Update tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsider diff size scoring: smaller isn't always better #97

Summary

Observed during dogfooding

Proposed alternatives

Option A: Remove diff size from scoring entirely

Option B: Flip the bias — reward thoroughness

Option C: Make it configurable

Option D: Neutral — only penalize outlier-large diffs

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Reconsider diff size scoring: smaller isn't always better #97

Description

Summary

Observed during dogfooding

Proposed alternatives

Option A: Remove diff size from scoring entirely

Option B: Flip the bias — reward thoroughness

Option C: Make it configurable

Option D: Neutral — only penalize outlier-large diffs

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions