Skip to content

MLflow run status and rapidfire_evals.db pipeline status disagree: runs marked FINISHED while their pipeline is still 'ongoing' #223

@kamran-rapidfireAI

Description

@kamran-rapidfireAI

Summary

After a FIQA evals grid search (8 configs × 4 shards) was interrupted mid-run, the two authoritative stores RapidFire maintains about run state diverged. For pipelines 6, 7, 8 — none of which ever completed their 4th shard — MLflow reports their runs as FINISHED with ~18 min duration, while rapidfire_evals.db correctly shows them as ongoing with shards_completed = 3. These two stores are supposed to describe the same thing; they currently don't, which confuses every downstream consumer.

Data (from a real interrupted run)

Setup: evals-mode grid search with 8 configs × 4 shards, num_shards=4, actor 0 was the only worker. Controller died at 22:36:57 (mid-vLLM-reinit for pipeline 5 shard 3) — see the related orphan-state bug.

rapidfire_evals.db.pipelines:

pipeline_id status current_shard_id shards_completed metric_run_id
1 completed 4 4 47a39f2f…
2 completed 4 4 692cc126…
3 completed 4 4 a3170f47…
4 completed 4 4 b834db58…
5 ongoing 3 3 1d83ac8d…
6 ongoing 3 3 8dc7feb5…
7 ongoing 3 3 bbbb74b6…
8 ongoing 3 3 8c8dd365…

MLflow (mlflow experiments/search for experiment id 1), same moment:

run_name run_id status duration
1 47a39f2f… FINISHED 19.7 min
2 692cc126… FINISHED 20.0 min
3 a3170f47… FINISHED 20.3 min
4 b834db58… FINISHED 21.7 min
5 1d83ac8d… RUNNING → KILLED (manually)
6 8dc7feb5… FINISHED 17.4 min
7 bbbb74b6… FINISHED 17.9 min
8 8c8dd365… FINISHED 19.4 min

Pipelines 6, 7, 8 never ran their 4th shard (the actor got through shards 0-2 for each, then the kernel died while initializing pipeline 5's 4th-shard engine), yet MLflow's set_terminated(..., status='FINISHED') fired for all three. This creates permanent disagreement between rapidfire_evals.db (says they're still ongoing) and MLflow (says they're done). Whichever you trust is wrong about something.

Why this matters

  • Anything reading MLflow (dashboards, tools, the UI's MLflow tab) sees "FINISHED" runs with partial metrics and no way to know the pipeline never actually completed.
  • Anything reading the dispatcher / evals DB (e.g. rapidfireaipro/converge/backend/core/autopilot_agent/datafetch.py which hits /dispatcher/get-all-runs) sees the pipeline as still ongoing forever.
  • The two sources aren't even useful to cross-check: they just confuse users into thinking different things are true.

Where the divergence probably comes from

Grep of rapidfireai/evals/ shows two places that end MLflow runs and one place that flips pipelines.status:

  • rapidfireai/evals/scheduling/controller.py:805self.metric_manager.end_run(metric_run_id) inside the "Phase 8" per-pipeline final-metrics block (inside run_multi_pipeline_inference).
  • rapidfireai/evals/actors/query_actor.py:309mlflow.end_run() in an actor path (looks like inference-engine cleanup between configs).
  • rapidfireai/evals/scheduling/controller.py:1195db.set_pipeline_status(pipeline_id, PipelineStatus.COMPLETED) — ONLY runs when shards_completed >= num_shards.

Hypothesis: query_actor.py:309 is called during actor cleanup / inference-engine swap before the pipeline is actually finished. That would close the MLflow run (→ MLflow shows FINISHED) without touching pipelines.status (→ evals DB keeps ongoing). When the actor / controller then dies before getting back to shard 3, pipelines.status never advances and the divergence is baked in permanently.

I can't prove this is the code path without stepping through, but the bug's shape matches: exactly the pipelines whose MLflow run ran on an actor that later switched to a different config are the ones marked FINISHED-but-not-COMPLETED.

Fix hints

Most robust: single source of truth, single writer

Only controller.py ending a pipeline should be allowed to call MLflow end_run on that pipeline's metric_run_id, and it should happen immediately before or after set_pipeline_status(..., PipelineStatus.COMPLETED). Remove mlflow.end_run() from the actor cleanup path — actors shouldn't finalize tracking for a run whose lifecycle they don't own.

Safer incremental

If the actor needs to drop its MLflow context between configs (because it's mutating some active-run state in the MLflow client), use mlflow.end_run(status='RUNNING') is not a thing, but you can:

  • Call mlflow.set_tag(run_id, 'actor_released', True) and detach without terminating the run, or
  • Keep the run object open and simply stop logging to it from this actor's session, or
  • Use MlflowClient APIs that let you log to a run from any context without tying it to a "currently active" run.

Defensive: reconcile on completion

In set_pipeline_status(..., PipelineStatus.COMPLETED) (controller.py:1195), also set MlflowClient.set_terminated(metric_run_id, 'FINISHED', end_time). In set_pipeline_status(..., PipelineStatus.FAILED) (1320), set 'FAILED'. Guarantees the two stores agree at transition time.

Auditing helper

Add a small CLI rapidfireai doctor --reconcile (or new subcommand) that detects and prints rows where pipelines.status ∈ {'ongoing', 'completed'} disagrees with the corresponding MLflow run status (via metric_run_id). Users could run it after any interrupted experiment.

Related

Environment

  • Ubuntu 24.04.3 LTS on GCP (kernel 6.14.0-1021-gcp)
  • Python 3.12.3
  • rapidfireai 0.15.3rc5 + rapidfireai-pro 0.15.3rc7
  • MLflow 3.9.0 (sqlite backend: ~/rapidfireai/db/rapidfire_mlflow.db)
  • Evals DB: ~/rapidfireai/db/rapidfire_evals.db
  • Mode: --evals

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions