Skip to content

Feature: compare_file_contents tool with semantic diffs #1973

@SamMorrowDrums

Description

@SamMorrowDrums

Problem

When AI models review diffs, line-based unified diffs can be noisy and token-inefficient. Common scenarios where this hurts:

  • JSON/YAML reformatting: A single value change plus auto-formatting creates a huge diff
  • Config file updates: Version bumps or reordering keys produce misleading diffs
  • CSV/data files: Row shifts make line-based diffs nearly unreadable

Models struggle to identify the actual change amid formatting noise, wasting context tokens and reducing comprehension accuracy.

Proposed Solution

Add a new tool compare_file_contents that:

  1. Takes two refs (base and head) plus a file path
  2. For supported formats (JSON, YAML, CSV, TOML), produces a semantic diff showing only value changes
  3. For unsupported formats, falls back to unified diff
  4. Always shows the format used and whether fallback was applied

Example: Semantic vs Line-based

Line-based diff (noisy):

-{"users":[{"id":1,"name":"Alice"},{"id":2,"name":"Bob"}]}
+{
+  "users": [
+    {"id": 1, "name": "Alice"},
+    {"id": 2, "name": "Bobby"}
+  ]
+}

Semantic diff (clear):

users[1].name: "Bob" → "Bobby"

Tool Signature

compare_file_contents(
  owner: string,
  repo: string,
  path: string,
  base: string,    // commit SHA, branch, or tag
  head: string,    // commit SHA, branch, or tag
)

Use Cases

  1. Change verification: Model edits a file, uses this tool to confirm only intended changes were made
  2. PR review: Quickly understand what actually changed in config/data files
  3. Debugging: Compare file across commits without formatting noise

Implementation Notes

  • Start behind a feature flag
  • Semantic diff enabled by default for supported formats (no opt-out needed initially)
  • Pure Go implementation using standard library JSON + yaml.v3
  • Supported formats to start: JSON, YAML
  • Future: CSV, TOML, other structured formats

Why This Helps Models

  • Fewer tokens = more room for reasoning
  • Unambiguous output = clearer before/after semantics
  • Path notation (e.g., users[1].name) is already familiar to models
  • Self-verification = models can check their own edits efficiently

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions