Skip to content

fix(argo): persist zarr-source cache across retries via shared PVC #172

@lhoupert

Description

@lhoupert

Problem

Each Argo retry spawns a new pod with a fresh ephemeral `/tmp`. The fsspec `simplecache` populated during a failed run is lost, so every retry re-downloads the full source product from EODC HTTPS — wasting time and increasing the chance of hitting the same transient-failure window again.

Proposed fix

Add a workflow-scoped `volumeClaimTemplates` PVC (20 Gi) to `eopf-explorer-convert-v1-s2-template.yaml` and mount it at `/cache/zarr-source`. Set `ZARR_SOURCE_CACHE_DIR=/cache/zarr-source` on the convert pod.

No changes to `scripts/convert_v1_s2.py` — the script already reads `ZARR_SOURCE_CACHE_DIR`.

The PVC is scoped to one workflow execution and cleaned up by Argo GC after completion.

Expected outcome

On a transient-failure retry (OOM, network blip, SIGTERM), the new pod finds previously-fetched source chunks in the cache and only re-downloads the chunks it hasn't seen yet — reducing HTTP load on EODC and shortening retry time.

Pre-flight check

Before applying: confirm Argo PVC GC is enabled:
```bash
kubectl get configmap workflow-controller-configmap -n argo -o yaml | grep pvcAutoDelete
```

Verification

  1. Apply the updated workflow YAML
  2. Trigger a retry by deleting the running convert pod: `kubectl delete pod -n devseed `
  3. On the new pod, confirm cache is populated: `kubectl exec -n devseed -- ls -lh /cache/zarr-source/`
  4. After workflow completes, confirm PVC is cleaned up: `kubectl get pvc -n devseed | grep zarr-source-cache`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions