Skip to content

Fix leakage risk in GSO #189

@hiydavid

Description

@hiydavid

This is a public repository. Do not include customer names, workspace URLs, access tokens, or any customer-identifiable information. If your bug involves a customer environment, report it in #genie-workbench on Slack instead.

Describe the bug

Where the leakage happens:

  1. Split exists — benchmarks are split 85/15 train/held_out (config.py:146, benchmarks.py:928). Curated benchmarks are forced into train.
  2. Mining function — _mine_benchmark_example_sqls() at optimizer.py:7116 iterates over benchmarks passed in and takes each question's expected_sql as a ready-made example_sql proposal (lines 7170, 7237). There is no check to exclude questions that also appear in the eval set.
  3. Called with train set — _run_enrichment() receives train_benchmarks and passes them straight to _mine_benchmark_example_sqls(...) (harness.py:2319). So training questions' expected SQL can be pasted into the space config.
  4. Re-evaluated on the same train set — the lever loop calls run_evaluation(..., train_benchmarks, ..., "full", ...) at harness.py:4045. When Genie sees a question that matches an example_sql entry added from that same question, it has a trivial shortcut to the "right" answer — the accuracy delta is not evidence the space got better.
  5. Lever 5 LLM prompt (LEVER_5_INSTRUCTION_PROMPT, config.py:1299–1392) is fed failures_context which contains the original question text and expected SQL. It tells the LLM "do not duplicate existing example SQL questions" but never "do not copy the benchmark question verbatim."
  6. Held-out check is too late — the held-out evaluation (harness.py:6151) runs only after the lever loop has already accepted patches based on inflated
  7. training scores.

Steps to reproduce

  1. Go to '...'
  2. Click on '...'
  3. See error

Expected behavior

SQL examples should not have any overlap with benchmark questions.

Screenshots

If applicable, add screenshots to help explain the problem.

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions