Background
Follow-up from the Term Mapping v1 work (PR #483, PRD #469). The PRD listed multiple suggester engines, but only HeuristicSuggester ships in v1. The engines/ directory and engine_metadata JSON column on term_mapping_suggestions are already scaffolded for additional engines.
Goal
Add an LLM-as-judge re-rank engine that runs on top of the heuristic suggester to disambiguate mid-confidence candidates.
Design (from original PRD)
- Fires only for heuristic confidences in the mid-range (configurable threshold) and only when explicitly enabled on the run.
- Uses the existing Databricks Foundation Model client (no new SDK dependency).
- Output: confidence ∈ [0, 1] plus a short rationale.
- The judge's confidence multiplies the heuristic confidence; the product is rounded to preserve the high-confidence auto-mark threshold semantics.
- Engine name
llm_judge (or similar); rationale stored on the existing reason field; engine-specific bookkeeping in engine_metadata.
Scope
- New
engines/llm_judge.py implementing the Suggester protocol (or a complementary Reranker protocol — TBD during design).
- Wire into
TermMappingManager.create_run orchestration: heuristic runs first, then the judge re-ranks the mid-confidence slice.
- Per-run toggle in the run-config dialog (
engines: ['heuristic', 'llm_judge']).
- LLM consent flow reuses the existing app-wide one-time consent dialog.
- Cost guardrails: hard cap on number of judge calls per run; configurable in Settings.
- Unit tests for engine glue (mock the FM client).
Out of scope (separate follow-ups)
- Embedding-based semantic engine.
- Per-engine confidence calibration UI.
Acceptance
- Run-config dialog exposes the engine toggle.
- A run with
llm_judge enabled produces suggestions whose engine_metadata records both heuristic and judge confidences plus the judge's rationale.
- Auto-apply threshold behavior is preserved (judge can only lower the final confidence, never raise it above 1.0).
- Cost cap is enforced; runs that hit the cap surface a per-run warning rather than failing.
Background
Follow-up from the Term Mapping v1 work (PR #483, PRD #469). The PRD listed multiple suggester engines, but only
HeuristicSuggesterships in v1. Theengines/directory andengine_metadataJSON column onterm_mapping_suggestionsare already scaffolded for additional engines.Goal
Add an LLM-as-judge re-rank engine that runs on top of the heuristic suggester to disambiguate mid-confidence candidates.
Design (from original PRD)
llm_judge(or similar); rationale stored on the existingreasonfield; engine-specific bookkeeping inengine_metadata.Scope
engines/llm_judge.pyimplementing theSuggesterprotocol (or a complementaryRerankerprotocol — TBD during design).TermMappingManager.create_runorchestration: heuristic runs first, then the judge re-ranks the mid-confidence slice.engines: ['heuristic', 'llm_judge']).Out of scope (separate follow-ups)
Acceptance
llm_judgeenabled produces suggestions whoseengine_metadatarecords both heuristic and judge confidences plus the judge's rationale.