MLflow run status and rapidfire_evals.db pipeline status disagree: runs marked FINISHED while their pipeline is still 'ongoing'

## Summary

After a FIQA evals grid search (8 configs × 4 shards) was interrupted mid-run, the two authoritative stores RapidFire maintains about run state diverged. For pipelines 6, 7, 8 — none of which ever completed their 4th shard — MLflow reports their runs as `FINISHED` with ~18 min duration, while `rapidfire_evals.db` correctly shows them as `ongoing` with `shards_completed = 3`. These two stores are supposed to describe the same thing; they currently don't, which confuses every downstream consumer.

## Data (from a real interrupted run)

Setup: evals-mode grid search with 8 configs × 4 shards, `num_shards=4`, actor 0 was the only worker. Controller died at 22:36:57 (mid-vLLM-reinit for pipeline 5 shard 3) — see the related orphan-state bug.

**`rapidfire_evals.db.pipelines`:**

| pipeline_id | status | current_shard_id | shards_completed | metric_run_id |
| --- | --- | --- | --- | --- |
| 1 | completed | 4 | 4 | `47a39f2f…` |
| 2 | completed | 4 | 4 | `692cc126…` |
| 3 | completed | 4 | 4 | `a3170f47…` |
| 4 | completed | 4 | 4 | `b834db58…` |
| 5 | ongoing | 3 | 3 | `1d83ac8d…` |
| 6 | ongoing | 3 | 3 | `8dc7feb5…` |
| 7 | ongoing | 3 | 3 | `bbbb74b6…` |
| 8 | ongoing | 3 | 3 | `8c8dd365…` |

**MLflow (`mlflow experiments/search` for experiment id 1), same moment:**

| run_name | run_id | status | duration |
| --- | --- | --- | --- |
| 1 | `47a39f2f…` | FINISHED | 19.7 min |
| 2 | `692cc126…` | FINISHED | 20.0 min |
| 3 | `a3170f47…` | FINISHED | 20.3 min |
| 4 | `b834db58…` | FINISHED | 21.7 min |
| 5 | `1d83ac8d…` | RUNNING → KILLED (manually) | — |
| 6 | `8dc7feb5…` | **FINISHED** | 17.4 min |
| 7 | `bbbb74b6…` | **FINISHED** | 17.9 min |
| 8 | `8c8dd365…` | **FINISHED** | 19.4 min |

Pipelines 6, 7, 8 never ran their 4th shard (the actor got through shards 0-2 for each, then the kernel died while initializing pipeline 5's 4th-shard engine), yet MLflow's `set_terminated(..., status='FINISHED')` fired for all three. This creates permanent disagreement between `rapidfire_evals.db` (says they're still ongoing) and MLflow (says they're done). Whichever you trust is wrong about something.

## Why this matters

- Anything reading MLflow (dashboards, tools, the UI's MLflow tab) sees "FINISHED" runs with partial metrics and no way to know the pipeline never actually completed.
- Anything reading the dispatcher / evals DB (e.g. `rapidfireaipro/converge/backend/core/autopilot_agent/datafetch.py` which hits `/dispatcher/get-all-runs`) sees the pipeline as still ongoing forever.
- The two sources aren't even useful to cross-check: they just confuse users into thinking different things are true.

## Where the divergence probably comes from

Grep of `rapidfireai/evals/` shows two places that end MLflow runs and one place that flips `pipelines.status`:

- `rapidfireai/evals/scheduling/controller.py:805` — `self.metric_manager.end_run(metric_run_id)` inside the "Phase 8" per-pipeline final-metrics block (inside `run_multi_pipeline_inference`).
- `rapidfireai/evals/actors/query_actor.py:309` — `mlflow.end_run()` in an actor path (looks like inference-engine cleanup between configs).
- `rapidfireai/evals/scheduling/controller.py:1195` — `db.set_pipeline_status(pipeline_id, PipelineStatus.COMPLETED)` — ONLY runs when `shards_completed >= num_shards`.

Hypothesis: `query_actor.py:309` is called during actor cleanup / inference-engine swap **before** the pipeline is actually finished. That would close the MLflow run (→ MLflow shows FINISHED) without touching `pipelines.status` (→ evals DB keeps `ongoing`). When the actor / controller then dies before getting back to shard 3, `pipelines.status` never advances and the divergence is baked in permanently.

I can't prove this is the code path without stepping through, but the bug's shape matches: exactly the pipelines whose MLflow run ran on an actor that later switched to a different config are the ones marked FINISHED-but-not-COMPLETED.

## Fix hints

### Most robust: single source of truth, single writer

Only `controller.py` ending a pipeline should be allowed to call MLflow `end_run` on that pipeline's `metric_run_id`, and it should happen immediately before or after `set_pipeline_status(..., PipelineStatus.COMPLETED)`. Remove `mlflow.end_run()` from the actor cleanup path — actors shouldn't finalize tracking for a run whose lifecycle they don't own.

### Safer incremental

If the actor needs to drop its MLflow context between configs (because it's mutating some active-run state in the MLflow client), use `mlflow.end_run(status='RUNNING')` is not a thing, but you can:

- Call `mlflow.set_tag(run_id, 'actor_released', True)` and detach without terminating the run, or
- Keep the run object open and simply stop logging to it from this actor's session, or
- Use `MlflowClient` APIs that let you log to a run from any context without tying it to a "currently active" run.

### Defensive: reconcile on completion

In `set_pipeline_status(..., PipelineStatus.COMPLETED)` (controller.py:1195), also set `MlflowClient.set_terminated(metric_run_id, 'FINISHED', end_time)`. In `set_pipeline_status(..., PipelineStatus.FAILED)` (1320), set `'FAILED'`. Guarantees the two stores agree at transition time.

### Auditing helper

Add a small CLI `rapidfireai doctor --reconcile` (or new subcommand) that detects and prints rows where `pipelines.status ∈ {'ongoing', 'completed'}` disagrees with the corresponding MLflow run status (via `metric_run_id`). Users could run it after any interrupted experiment.

## Related

- RapidFireAI/rapidfireai#222 — orphaned `ongoing` rows after controller death. Same family; probably wants a unified fix.
- RapidFireAI/rapidfireai-pro#(TBD for Converge hallucination caused by trusting the dispatcher-side view of this).

## Environment

- Ubuntu 24.04.3 LTS on GCP (kernel 6.14.0-1021-gcp)
- Python 3.12.3
- `rapidfireai` 0.15.3rc5 + `rapidfireai-pro` 0.15.3rc7
- MLflow 3.9.0 (sqlite backend: `~/rapidfireai/db/rapidfire_mlflow.db`)
- Evals DB: `~/rapidfireai/db/rapidfire_evals.db`
- Mode: `--evals`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLflow run status and rapidfire_evals.db pipeline status disagree: runs marked FINISHED while their pipeline is still 'ongoing' #223

Summary

Data (from a real interrupted run)

Why this matters

Where the divergence probably comes from

Fix hints

Most robust: single source of truth, single writer

Safer incremental

Defensive: reconcile on completion

Auditing helper

Related

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

pipeline_id	status	current_shard_id	shards_completed	metric_run_id
1	completed	4	4	`47a39f2f…`
2	completed	4	4	`692cc126…`
3	completed	4	4	`a3170f47…`
4	completed	4	4	`b834db58…`
5	ongoing	3	3	`1d83ac8d…`
6	ongoing	3	3	`8dc7feb5…`
7	ongoing	3	3	`bbbb74b6…`
8	ongoing	3	3	`8c8dd365…`

run_name	run_id	status	duration
1	`47a39f2f…`	FINISHED	19.7 min
2	`692cc126…`	FINISHED	20.0 min
3	`a3170f47…`	FINISHED	20.3 min
4	`b834db58…`	FINISHED	21.7 min
5	`1d83ac8d…`	RUNNING → KILLED (manually)	—
6	`8dc7feb5…`	FINISHED	17.4 min
7	`bbbb74b6…`	FINISHED	17.9 min
8	`8c8dd365…`	FINISHED	19.4 min

MLflow run status and rapidfire_evals.db pipeline status disagree: runs marked FINISHED while their pipeline is still 'ongoing' #223

Description

Summary

Data (from a real interrupted run)

Why this matters

Where the divergence probably comes from

Fix hints

Most robust: single source of truth, single writer

Safer incremental

Defensive: reconcile on completion

Auditing helper

Related

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions