Summary
After a FIQA evals grid search (8 configs × 4 shards) was interrupted mid-run, the two authoritative stores RapidFire maintains about run state diverged. For pipelines 6, 7, 8 — none of which ever completed their 4th shard — MLflow reports their runs as FINISHED with ~18 min duration, while rapidfire_evals.db correctly shows them as ongoing with shards_completed = 3. These two stores are supposed to describe the same thing; they currently don't, which confuses every downstream consumer.
Data (from a real interrupted run)
Setup: evals-mode grid search with 8 configs × 4 shards, num_shards=4, actor 0 was the only worker. Controller died at 22:36:57 (mid-vLLM-reinit for pipeline 5 shard 3) — see the related orphan-state bug.
rapidfire_evals.db.pipelines:
| pipeline_id |
status |
current_shard_id |
shards_completed |
metric_run_id |
| 1 |
completed |
4 |
4 |
47a39f2f… |
| 2 |
completed |
4 |
4 |
692cc126… |
| 3 |
completed |
4 |
4 |
a3170f47… |
| 4 |
completed |
4 |
4 |
b834db58… |
| 5 |
ongoing |
3 |
3 |
1d83ac8d… |
| 6 |
ongoing |
3 |
3 |
8dc7feb5… |
| 7 |
ongoing |
3 |
3 |
bbbb74b6… |
| 8 |
ongoing |
3 |
3 |
8c8dd365… |
MLflow (mlflow experiments/search for experiment id 1), same moment:
| run_name |
run_id |
status |
duration |
| 1 |
47a39f2f… |
FINISHED |
19.7 min |
| 2 |
692cc126… |
FINISHED |
20.0 min |
| 3 |
a3170f47… |
FINISHED |
20.3 min |
| 4 |
b834db58… |
FINISHED |
21.7 min |
| 5 |
1d83ac8d… |
RUNNING → KILLED (manually) |
— |
| 6 |
8dc7feb5… |
FINISHED |
17.4 min |
| 7 |
bbbb74b6… |
FINISHED |
17.9 min |
| 8 |
8c8dd365… |
FINISHED |
19.4 min |
Pipelines 6, 7, 8 never ran their 4th shard (the actor got through shards 0-2 for each, then the kernel died while initializing pipeline 5's 4th-shard engine), yet MLflow's set_terminated(..., status='FINISHED') fired for all three. This creates permanent disagreement between rapidfire_evals.db (says they're still ongoing) and MLflow (says they're done). Whichever you trust is wrong about something.
Why this matters
- Anything reading MLflow (dashboards, tools, the UI's MLflow tab) sees "FINISHED" runs with partial metrics and no way to know the pipeline never actually completed.
- Anything reading the dispatcher / evals DB (e.g.
rapidfireaipro/converge/backend/core/autopilot_agent/datafetch.py which hits /dispatcher/get-all-runs) sees the pipeline as still ongoing forever.
- The two sources aren't even useful to cross-check: they just confuse users into thinking different things are true.
Where the divergence probably comes from
Grep of rapidfireai/evals/ shows two places that end MLflow runs and one place that flips pipelines.status:
rapidfireai/evals/scheduling/controller.py:805 — self.metric_manager.end_run(metric_run_id) inside the "Phase 8" per-pipeline final-metrics block (inside run_multi_pipeline_inference).
rapidfireai/evals/actors/query_actor.py:309 — mlflow.end_run() in an actor path (looks like inference-engine cleanup between configs).
rapidfireai/evals/scheduling/controller.py:1195 — db.set_pipeline_status(pipeline_id, PipelineStatus.COMPLETED) — ONLY runs when shards_completed >= num_shards.
Hypothesis: query_actor.py:309 is called during actor cleanup / inference-engine swap before the pipeline is actually finished. That would close the MLflow run (→ MLflow shows FINISHED) without touching pipelines.status (→ evals DB keeps ongoing). When the actor / controller then dies before getting back to shard 3, pipelines.status never advances and the divergence is baked in permanently.
I can't prove this is the code path without stepping through, but the bug's shape matches: exactly the pipelines whose MLflow run ran on an actor that later switched to a different config are the ones marked FINISHED-but-not-COMPLETED.
Fix hints
Most robust: single source of truth, single writer
Only controller.py ending a pipeline should be allowed to call MLflow end_run on that pipeline's metric_run_id, and it should happen immediately before or after set_pipeline_status(..., PipelineStatus.COMPLETED). Remove mlflow.end_run() from the actor cleanup path — actors shouldn't finalize tracking for a run whose lifecycle they don't own.
Safer incremental
If the actor needs to drop its MLflow context between configs (because it's mutating some active-run state in the MLflow client), use mlflow.end_run(status='RUNNING') is not a thing, but you can:
- Call
mlflow.set_tag(run_id, 'actor_released', True) and detach without terminating the run, or
- Keep the run object open and simply stop logging to it from this actor's session, or
- Use
MlflowClient APIs that let you log to a run from any context without tying it to a "currently active" run.
Defensive: reconcile on completion
In set_pipeline_status(..., PipelineStatus.COMPLETED) (controller.py:1195), also set MlflowClient.set_terminated(metric_run_id, 'FINISHED', end_time). In set_pipeline_status(..., PipelineStatus.FAILED) (1320), set 'FAILED'. Guarantees the two stores agree at transition time.
Auditing helper
Add a small CLI rapidfireai doctor --reconcile (or new subcommand) that detects and prints rows where pipelines.status ∈ {'ongoing', 'completed'} disagrees with the corresponding MLflow run status (via metric_run_id). Users could run it after any interrupted experiment.
Related
Environment
- Ubuntu 24.04.3 LTS on GCP (kernel 6.14.0-1021-gcp)
- Python 3.12.3
rapidfireai 0.15.3rc5 + rapidfireai-pro 0.15.3rc7
- MLflow 3.9.0 (sqlite backend:
~/rapidfireai/db/rapidfire_mlflow.db)
- Evals DB:
~/rapidfireai/db/rapidfire_evals.db
- Mode:
--evals
Summary
After a FIQA evals grid search (8 configs × 4 shards) was interrupted mid-run, the two authoritative stores RapidFire maintains about run state diverged. For pipelines 6, 7, 8 — none of which ever completed their 4th shard — MLflow reports their runs as
FINISHEDwith ~18 min duration, whilerapidfire_evals.dbcorrectly shows them asongoingwithshards_completed = 3. These two stores are supposed to describe the same thing; they currently don't, which confuses every downstream consumer.Data (from a real interrupted run)
Setup: evals-mode grid search with 8 configs × 4 shards,
num_shards=4, actor 0 was the only worker. Controller died at 22:36:57 (mid-vLLM-reinit for pipeline 5 shard 3) — see the related orphan-state bug.rapidfire_evals.db.pipelines:47a39f2f…692cc126…a3170f47…b834db58…1d83ac8d…8dc7feb5…bbbb74b6…8c8dd365…MLflow (
mlflow experiments/searchfor experiment id 1), same moment:47a39f2f…692cc126…a3170f47…b834db58…1d83ac8d…8dc7feb5…bbbb74b6…8c8dd365…Pipelines 6, 7, 8 never ran their 4th shard (the actor got through shards 0-2 for each, then the kernel died while initializing pipeline 5's 4th-shard engine), yet MLflow's
set_terminated(..., status='FINISHED')fired for all three. This creates permanent disagreement betweenrapidfire_evals.db(says they're still ongoing) and MLflow (says they're done). Whichever you trust is wrong about something.Why this matters
rapidfireaipro/converge/backend/core/autopilot_agent/datafetch.pywhich hits/dispatcher/get-all-runs) sees the pipeline as still ongoing forever.Where the divergence probably comes from
Grep of
rapidfireai/evals/shows two places that end MLflow runs and one place that flipspipelines.status:rapidfireai/evals/scheduling/controller.py:805—self.metric_manager.end_run(metric_run_id)inside the "Phase 8" per-pipeline final-metrics block (insiderun_multi_pipeline_inference).rapidfireai/evals/actors/query_actor.py:309—mlflow.end_run()in an actor path (looks like inference-engine cleanup between configs).rapidfireai/evals/scheduling/controller.py:1195—db.set_pipeline_status(pipeline_id, PipelineStatus.COMPLETED)— ONLY runs whenshards_completed >= num_shards.Hypothesis:
query_actor.py:309is called during actor cleanup / inference-engine swap before the pipeline is actually finished. That would close the MLflow run (→ MLflow shows FINISHED) without touchingpipelines.status(→ evals DB keepsongoing). When the actor / controller then dies before getting back to shard 3,pipelines.statusnever advances and the divergence is baked in permanently.I can't prove this is the code path without stepping through, but the bug's shape matches: exactly the pipelines whose MLflow run ran on an actor that later switched to a different config are the ones marked FINISHED-but-not-COMPLETED.
Fix hints
Most robust: single source of truth, single writer
Only
controller.pyending a pipeline should be allowed to call MLflowend_runon that pipeline'smetric_run_id, and it should happen immediately before or afterset_pipeline_status(..., PipelineStatus.COMPLETED). Removemlflow.end_run()from the actor cleanup path — actors shouldn't finalize tracking for a run whose lifecycle they don't own.Safer incremental
If the actor needs to drop its MLflow context between configs (because it's mutating some active-run state in the MLflow client), use
mlflow.end_run(status='RUNNING')is not a thing, but you can:mlflow.set_tag(run_id, 'actor_released', True)and detach without terminating the run, orMlflowClientAPIs that let you log to a run from any context without tying it to a "currently active" run.Defensive: reconcile on completion
In
set_pipeline_status(..., PipelineStatus.COMPLETED)(controller.py:1195), also setMlflowClient.set_terminated(metric_run_id, 'FINISHED', end_time). Inset_pipeline_status(..., PipelineStatus.FAILED)(1320), set'FAILED'. Guarantees the two stores agree at transition time.Auditing helper
Add a small CLI
rapidfireai doctor --reconcile(or new subcommand) that detects and prints rows wherepipelines.status ∈ {'ongoing', 'completed'}disagrees with the corresponding MLflow run status (viametric_run_id). Users could run it after any interrupted experiment.Related
ongoingrows after controller death. Same family; probably wants a unified fix.Environment
rapidfireai0.15.3rc5 +rapidfireai-pro0.15.3rc7~/rapidfireai/db/rapidfire_mlflow.db)~/rapidfireai/db/rapidfire_evals.db--evals