Add staleness determination to backend and mermaid diagrams by petebachant · Pull Request #558 · calkit/calkit-cloud

petebachant · 2026-04-15T23:05:21Z

TODO

Don't slow down pipeline page load--maybe cache

Copilot

Pull request overview

Adds backend logic to determine DVC pipeline “staleness” per stage (and overall), and uses that information to enrich pipeline/figure/publication API responses and color Mermaid pipeline diagrams accordingly.

Changes:

Introduces app.pipeline utilities to compute per-stage staleness and color Mermaid diagrams by status.
Extends API response models (Pipeline/Figure/Publication) to include stage_status(es) and overall pipeline status.
Updates project routes to compute and attach stage status metadata (and infer stages from dvc.lock outs when missing).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File	Description
backend/app/pipeline.py	New staleness computation + Mermaid coloring + stage lookup helpers.
backend/app/tests/test_pipeline.py	Unit tests covering core staleness scenarios and stage lookup.
backend/app/models/core.py	Adds StageStatus model and exposes stage status fields in Pipeline/Figure/Publication responses.
backend/app/api/routes/projects/core.py	Computes stage statuses in pipeline/figures/publications endpoints and attaches status data to responses.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-16T03:15:19Z

+    if path in outs_index:
+        return outs_index[path]
+    ptr = _read_dvc_pointer_md5(tree, path)
+    if ptr is not None:
+        return ptr
+    if tree.is_file(path):
+        return _tree_file_md5(tree, path)


_resolve_current_dep_md5() returns outs_index[path] for any dep that is also listed as an out elsewhere in dvc.lock, without checking whether that out actually exists in the repo tree (as a file or .dvc pointer) or in object storage. This can mark downstream stages as up-to-date even when their input artifact is missing (the upstream stage would be stale, but the downstream stage’s deps wouldn’t reflect that). Consider resolving deps by checking the tree / pointer first and only falling back to outs_index if the artifact is actually available, or treat missing upstream outs as modified inputs.

Suggested change

if path in outs_index:

return outs_index[path]

ptr = _read_dvc_pointer_md5(tree, path)

if ptr is not None:

return ptr

if tree.is_file(path):

return _tree_file_md5(tree, path)

ptr = _read_dvc_pointer_md5(tree, path)

if ptr is not None:

return ptr

if tree.is_file(path):

return _tree_file_md5(tree, path)

if path in outs_index:

return outs_index[path]

Copilot · 2026-04-16T03:15:19Z

+def _md5_in_object_storage(
+    md5: str | None, owner_name: str, project_name: str, fs
+) -> bool:
+    if not md5:
+        return False
+    try:
+        return (
+            get_data_fpath_for_md5(
+                owner_name=owner_name,
+                project_name=project_name,
+                md5=md5,
+                fs=fs,
+            )
+            is not None
+        )
+    except Exception as e:
+        logger.warning(f"Object storage existence check failed for {md5}: {e}")
+        return False


_md5_in_object_storage() calls get_data_fpath_for_md5() (which can do multiple remote fs.exists checks) every time it’s invoked. compute_stage_statuses() may call this repeatedly for the same md5s across stages/outs/deps, which can add noticeable latency on pipeline-heavy repos. Consider memoizing md5→bool for the duration of compute_stage_statuses() (or functools.lru_cache on _md5_in_object_storage keyed by md5/owner/project) so each md5 triggers at most one object-store probe per request.

Copilot · 2026-04-16T03:15:19Z

    ck_info = get_ck_info_from_repo(repo)
    pipeline = get_dvc_pipeline_from_repo(repo)
    publications = ck_info.get("publications", [])
    overleaf_info = calkit.overleaf.get_sync_info(
        wdir=repo.working_dir, ck_info=ck_info, fix_legacy=False
    )
    resp = []
    tree = app.projects.get_repo_tree_for_ref(repo, ref)
    ck_info_full, dvc_lock_outs, zip_path_map = (
        app.projects.get_ck_info_and_dvc_outs_from_tree(project, tree)
    )
+    dvc_lock: dict = {}
+    if tree.is_file("dvc.lock"):
+        dvc_lock = ryaml.load(tree.read_bytes("dvc.lock").decode()) or {}
+    stage_statuses = {}
+    try:
+        stage_statuses = compute_stage_statuses(
+            dvc_yaml=pipeline,
+            dvc_lock=dvc_lock,
+            tree=tree,
+            owner_name=project.owner_account_name,
+            project_name=project.name,
+            fs=get_object_fs(),
+        )


compute_stage_statuses() is run against tree = get_repo_tree_for_ref(repo, ref) (so it reflects ref), but dvc_yaml is taken from pipeline = get_dvc_pipeline_from_repo(repo), which reads dvc.yaml from the working tree checkout. Since get_repo(..., ref=...) intentionally does not checkout/mutate the working tree for ref-based reads, this can compute stage statuses using a dvc.yaml that doesn't correspond to ref. Consider loading/parsing dvc.yaml from tree (similar to the figures route) when ref is provided, so the status computation is consistent.

Copilot · 2026-04-16T03:15:20Z

+        tree = app.projects.get_repo_tree_for_ref(repo, ref)
+        dvc_lock: dict = {}
+        if tree.is_file("dvc.lock"):
+            dvc_lock = ryaml.load(tree.read_bytes("dvc.lock").decode()) or {}
+        fs = get_object_fs()
+        stage_statuses = compute_stage_statuses(
+            dvc_yaml=dvc_pipeline,
+            dvc_lock=dvc_lock,
+            tree=tree,
+            owner_name=project.owner_account_name,
+            project_name=project.name,
+            fs=fs,
+        )


When ref is provided, tree = get_repo_tree_for_ref(repo, ref) represents that historical ref, but dvc_pipeline / mermaid / dvc_content were read from the working tree checkout (which get_repo(..., ref=...) intentionally does not checkout). This means compute_stage_statuses() can end up diffing a dvc.yaml from HEAD against a dvc.lock/tree from ref, producing incorrect staleness results. Consider loading dvc.yaml (and params/calkit.yaml if needed) from tree when ref is set, or otherwise document/disable ref for this endpoint.

Copilot · 2026-04-16T03:15:20Z

    return Pipeline(
        dvc_stages=dvc_pipeline["stages"],
        mermaid=mermaid,
        dvc_yaml=dvc_content,
        calkit_yaml=calkit_content,
+        stage_statuses=stage_statuses,


Pipeline.stage_statuses is typed as dict[str, models.core.StageStatus], but here you pass through the dict returned by compute_stage_statuses, whose values are app.pipeline.StageStatus (a different Pydantic model). With Pydantic v2 this is likely to raise a validation error (Input should be a valid dictionary or instance of StageStatus). Consider converting to the API model before constructing Pipeline (e.g., dump each status to a dict / validate into models.core.StageStatus), or unify the StageStatus model so both layers use the same type.

Suggested change

return Pipeline(

dvc_stages=dvc_pipeline["stages"],

mermaid=mermaid,

dvc_yaml=dvc_content,

calkit_yaml=calkit_content,

stage_statuses=stage_statuses,

pipeline_stage_statuses = {

stage_name: (

stage_status.model_dump()

if isinstance(stage_status, BaseModel)

else stage_status

)

for stage_name, stage_status in stage_statuses.items()

}

return Pipeline(

dvc_stages=dvc_pipeline["stages"],

mermaid=mermaid,

dvc_yaml=dvc_content,

calkit_yaml=calkit_content,

stage_statuses=pipeline_stage_statuses,

Copilot · 2026-04-16T03:15:20Z

+            if auto_stage is not None:
+                pub["stage"] = auto_stage
+        if pub.get("stage"):
            pub["stage_info"] = pipeline.get("stages", {}).get(pub["stage"])


When stage is auto-detected via find_stage_for_path, it may include @... expansions from dvc.lock, but pipeline.get('stages', ...) is keyed by base stage names from dvc.yaml. That can make stage_info unexpectedly None. Consider looking up stage info with the base name (e.g., pub['stage'].split('@')[0]) while still keeping the full stage name for stage_status lookup.

Suggested change

pub["stage_info"] = pipeline.get("stages", {}).get(pub["stage"])

base_stage = pub["stage"].split("@", 1)[0]

pub["stage_info"] = pipeline.get("stages", {}).get(base_stage)

petebachant added 4 commits April 15, 2026 10:14

Add pipeline staleness to backend endpoints

c6994c7

Color pipeline mermaid diagram based on stage status

4c285c7

Darken colors

eee8099

Merge branch 'main' into stale

93ea1e3

petebachant added this to Calkit Apr 16, 2026

petebachant moved this to In progress in Calkit Apr 16, 2026

petebachant requested a review from Copilot April 16, 2026 03:09

Copilot started reviewing on behalf of petebachant April 16, 2026 03:09 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

petebachant added 2 commits April 18, 2026 07:35

Merge in main

33b0dff

Merge branch 'stale' of github.com:calkit/calkit-cloud into stale

341f12e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add staleness determination to backend and mermaid diagrams#558

Add staleness determination to backend and mermaid diagrams#558
petebachant wants to merge 6 commits intomainfrom
stale

petebachant commented Apr 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	pub["stage_info"] = pipeline.get("stages", {}).get(pub["stage"])
	base_stage = pub["stage"].split("@", 1)[0]
	pub["stage_info"] = pipeline.get("stages", {}).get(base_stage)

Uh oh!

Conversation

petebachant commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

petebachant commented Apr 15, 2026 •

edited

Loading