This is a public repository. Do not include customer names, workspace URLs, access tokens, or any customer-identifiable information. If your bug involves a customer environment, report it in #genie-workbench on Slack instead.
Describe the bug
Where the leakage happens:
- Split exists — benchmarks are split 85/15 train/held_out (config.py:146, benchmarks.py:928). Curated benchmarks are forced into train.
- Mining function — _mine_benchmark_example_sqls() at optimizer.py:7116 iterates over benchmarks passed in and takes each question's expected_sql as a ready-made example_sql proposal (lines 7170, 7237). There is no check to exclude questions that also appear in the eval set.
- Called with train set — _run_enrichment() receives train_benchmarks and passes them straight to _mine_benchmark_example_sqls(...) (harness.py:2319). So training questions' expected SQL can be pasted into the space config.
- Re-evaluated on the same train set — the lever loop calls run_evaluation(..., train_benchmarks, ..., "full", ...) at harness.py:4045. When Genie sees a question that matches an example_sql entry added from that same question, it has a trivial shortcut to the "right" answer — the accuracy delta is not evidence the space got better.
- Lever 5 LLM prompt (LEVER_5_INSTRUCTION_PROMPT, config.py:1299–1392) is fed failures_context which contains the original question text and expected SQL. It tells the LLM "do not duplicate existing example SQL questions" but never "do not copy the benchmark question verbatim."
- Held-out check is too late — the held-out evaluation (harness.py:6151) runs only after the lever loop has already accepted patches based on inflated
- training scores.
Steps to reproduce
- Go to '...'
- Click on '...'
- See error
Expected behavior
SQL examples should not have any overlap with benchmark questions.
Screenshots
If applicable, add screenshots to help explain the problem.
Additional context
Add any other context about the problem here.
Describe the bug
Where the leakage happens:
Steps to reproduce
Expected behavior
SQL examples should not have any overlap with benchmark questions.
Screenshots
If applicable, add screenshots to help explain the problem.
Additional context
Add any other context about the problem here.