Skip to content

[Feature]: Term Mapping — LLM-as-judge suggester engine #485

@larsgeorge-db

Description

@larsgeorge-db

Background

Follow-up from the Term Mapping v1 work (PR #483, PRD #469). The PRD listed multiple suggester engines, but only HeuristicSuggester ships in v1. The engines/ directory and engine_metadata JSON column on term_mapping_suggestions are already scaffolded for additional engines.

Goal

Add an LLM-as-judge re-rank engine that runs on top of the heuristic suggester to disambiguate mid-confidence candidates.

Design (from original PRD)

  • Fires only for heuristic confidences in the mid-range (configurable threshold) and only when explicitly enabled on the run.
  • Uses the existing Databricks Foundation Model client (no new SDK dependency).
  • Output: confidence ∈ [0, 1] plus a short rationale.
  • The judge's confidence multiplies the heuristic confidence; the product is rounded to preserve the high-confidence auto-mark threshold semantics.
  • Engine name llm_judge (or similar); rationale stored on the existing reason field; engine-specific bookkeeping in engine_metadata.

Scope

  • New engines/llm_judge.py implementing the Suggester protocol (or a complementary Reranker protocol — TBD during design).
  • Wire into TermMappingManager.create_run orchestration: heuristic runs first, then the judge re-ranks the mid-confidence slice.
  • Per-run toggle in the run-config dialog (engines: ['heuristic', 'llm_judge']).
  • LLM consent flow reuses the existing app-wide one-time consent dialog.
  • Cost guardrails: hard cap on number of judge calls per run; configurable in Settings.
  • Unit tests for engine glue (mock the FM client).

Out of scope (separate follow-ups)

  • Embedding-based semantic engine.
  • Per-engine confidence calibration UI.

Acceptance

  • Run-config dialog exposes the engine toggle.
  • A run with llm_judge enabled produces suggestions whose engine_metadata records both heuristic and judge confidences plus the judge's rationale.
  • Auto-apply threshold behavior is preserved (judge can only lower the final confidence, never raise it above 1.0).
  • Cost cap is enforced; runs that hit the cap surface a per-run warning rather than failing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    scope/ontologyOntology related featuretech/pythonPull requests that update python codetype/featureFeature requests

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions