Fix leakage risk in GSO

> **This is a public repository.** Do not include customer names, workspace URLs, access tokens, or any customer-identifiable information. If your bug involves a customer environment, report it in **#genie-workbench** on Slack instead.

## Describe the bug

Where the leakage happens:
1. Split exists — benchmarks are split 85/15 train/held_out (config.py:146, benchmarks.py:928). Curated benchmarks are forced into train.
2. Mining function — _mine_benchmark_example_sqls() at optimizer.py:7116 iterates over benchmarks passed in and takes each question's expected_sql as a ready-made example_sql proposal (lines 7170, 7237). There is no check to exclude questions that also appear in the eval set.
3. Called with train set — _run_enrichment() receives train_benchmarks and passes them straight to _mine_benchmark_example_sqls(...) (harness.py:2319). So training questions' expected SQL can be pasted into the space config.
4. Re-evaluated on the same train set — the lever loop calls run_evaluation(..., train_benchmarks, ..., "full", ...) at harness.py:4045. When Genie sees a question that matches an example_sql entry added from that same question, it has a trivial shortcut to the "right" answer — the accuracy delta is not evidence the space got better.
5. Lever 5 LLM prompt (LEVER_5_INSTRUCTION_PROMPT, config.py:1299–1392) is fed failures_context which contains the original question text and expected SQL. It tells the LLM "do not duplicate existing example SQL questions" but never "do not copy the benchmark question verbatim."
6. Held-out check is too late — the held-out evaluation (harness.py:6151) runs only after the lever loop has already accepted patches based on inflated
7. training scores.

## Steps to reproduce

1. Go to '...'
2. Click on '...'
3. See error

## Expected behavior

SQL examples should not have any overlap with benchmark questions.

## Screenshots

If applicable, add screenshots to help explain the problem.

## Additional context

Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix leakage risk in GSO #189

Describe the bug

Steps to reproduce

Expected behavior

Screenshots

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fix leakage risk in GSO #189

Description

Describe the bug

Steps to reproduce

Expected behavior

Screenshots

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions