CMIP6 dataset support by mcgibbon · Pull Request #1115 · ai2cm/ace

mcgibbon · 2026-05-04T17:45:05Z

Short description of why the PR is needed and how it satisfies those requirements, in sentence form.

Changes:

symbol (e.g. fme.core.my_function) or script and concise description of changes or added feature
Can group multiple related symbols on a single bullet
Tests added
If dependencies changed, "deps only" image rebuilt and "latest_deps_only_image.txt" file updated

Resolves # (delete if none)

Adds scripts/cmip6_data/ with a planning README, config dataclasses (dacite + YAML), and two starter configs for the forthcoming inventory and processing scripts. No data-moving code yet.

Adds inventory.py that pulls the Pangeo GCS catalog as a gzipped CSV (no intake-esm dep), filters to our table/experiments/variables, parses variant_label into r/i/p/f, and prints a coverage summary. An --enrich-stores flag is wired in for future store-level enrichment. Inventory run reveals that Pangeo's daily coverage of ``ta`` is nil (3 models vs ~47 for ua/va/hus) and ``ps`` is zero at daily cadence. Drops both from the required core set; adds DIAGNOSTIC_VARIABLES so the inventory still reports their coverage. README documents the hypsometric derivation of layer-mean T from zg + hus that will substitute for ``ta`` in the processing step.

Atmosphere-only emulators need an SST / sea-ice / land-mask lower boundary. Pangeo publishes effectively no daily forcing data, so we ingest the monthly versions and interpolate onto the daily axis: - ts (Amon), siconc (SImon) -> linear time-interp to daily - sftlf, orog (fx) -> static per-model Changes: - config: FORCING_VARIABLES and STATIC_VARIABLES constants; a CatalogQuery dataclass so InventoryConfig holds one query per table_id; forcing and static variable knobs in DefaultsConfig threaded through resolve(). - inventory: loop over CatalogQuery list, concatenate, and summarise forcing coverage per source_id. - pilot config: time_subset narrowed to a single year per experiment (2010 historical, 2015 ssp585); per-label member cap (max_members_per_f=3) and require_i=1 documented as defaults. - README: forcings section, storage layout updates, processing steps now include forcing interpolation and static broadcasting. Inventory re-run confirms ~21 source_ids satisfy core + ts + siconc + sftlf + orog, ~25 if orog is optional.

Pressure-level fields get regridded to a rectangular (lat, lon) grid where the 45 latitudes are Gauss-Legendre quadrature nodes and the 90 longitudes are equispaced at 4 deg. SHTs on the processed data are exact, which matters for the spherical-harmonic model we're targeting. - Add scripts/cmip6_data/grid.py with make_target_grid("F22.5"), copied lightly from scripts/era5/pipeline/xr-beam-pipeline.py for self-containment. - Replace TargetGrid.lat_step/lon_step with TargetGrid.name = "F22.5". - Small test_grid.py covering shape, symmetry, bounds, and lon offset. - README notes the SHT-compat rationale and references the helper.

- One shared below_surface_mask(time, plev, lat, lon) as uint8 per dataset, derived at ingest. Primary source: NaN union across the 3D plev variables (captures day-to-day surface-pressure variation via the model's own masking). Fallback: zg < orog (also time-varying via zg). Datasets where neither is available are dropped. - Fill method renamed from "persistence_down" to "nearest_above" — a clearer name, since it's nearest-neighbor-in-the-vertical, not time persistence. Each column's below-surface levels inherit the lowest above-surface level's value, handling any number of consecutive masked bottom levels. - README restructured: resolved issues collapsed to one-liner summaries, Issue 7 summary added, remaining open issues listed in working order.

Follows the scripts/data_process convention: inner zarr v3 chunk of time=1 (per-timestep read granularity for training's short windows), outer shard of time=365 (~one shard per year per variable). At F22.5 x plev8, per-shard sizes are ~47 MB for 3D fields, ~6 MB for 2D, ~12 MB for the mask — all healthy GCS object sizes. - ChunkingConfig replaced `time: int = 32` with `chunk_time: int = 1` and `shard_time: Optional[int] = 365`. None = unsharded (debug). - README moves Issue 6 to Resolved.

Label for the embedding space is (source_id, p), with realization, initialization, and forcing treated as within-label variation. Stored both as primitive fields (label_source_id, label_physics_index) and as a composite string label of the form "{source_id}.p{p}" (always includes .pN even for single-p models, so it maps 1:1 to CMIP6). - config.make_label(source_id, p) centralizes the format. - README moves Issue 2 to Resolved; Issue 9 (index schema) is the last remaining decision before writing process.py.

All pre-processing decisions are now locked in. Remaining work is process.py itself (Tasks 11 + 12). - index.py: DatasetIndexRow dataclass with identity/provenance/ processing/output/status groups. rows_to_dataframe JSON-encodes the list/dict fields for flat tabular output. write_index writes CSV unconditionally and parquet when an engine (pyarrow/fastparquet) is available (logged-and-skipped otherwise). write_sidecar drops a metadata.json alongside each successful zarr. - test_index.py: round-trip + sidecar coverage, runs standalone or under pytest. - README: moved Issue 9 summary into Resolved; added a short Dependencies section covering the optional pyarrow install.

Per-dataset processing script. Reads the inventory + YAML config, applies the Issue 3 selection rules, and for each selected dataset runs the pipeline: open core + forcing + static zstores, validate cell_methods, regrid to F22.5 via xESMF, compute the time-varying below_surface_mask, nearest-above fill the 3D plev variables, derive layer-mean T from zg + hus, interpolate monthly forcings onto daily, write zarr with zarr v3 chunks+shards, drop metadata.json sidecar. Resumability: the sidecar is the done-marker. On re-run, a complete sidecar causes the dataset to be skipped; a zarr directory without one is deleted and re-processed. --force rebuilds everything; --dry-run / --source-ids / --max-datasets help iterate on subsets. Other: - ProcessConfig.index_path removed; index is derived as <output_directory>/index.{csv,parquet}. - pilot.yaml updated: inventory_path points at inventory.csv (parquet engine still optional). - README: storage-layout box updated, process.py contract documented, Open Issues collapsed to "end-to-end validation remains". Dry-run against the committed inventory.csv selects 219 tasks across our 53 day-table source_ids (modulo the max_members_per_f=3 cap), confirming selection logic round-trips.

Plain pip-installable dep list. esmpy (the ESMF python binding that xesmf needs) can be flaky to install via pip; README recommends ``conda install -c conda-forge xesmf esmpy`` if the pip path fails. Pyarrow is included so index.parquet output is enabled by default.

1. **Stale zarr v2 encoding propagated to v3 write.** The Pangeo source zarrs carry zarr v2 codec metadata (``compressors=Blosc(...)`` plus ``_FillValue=1e+20``) that xarray surfaces as per-variable encoding. When we later do ``ds.to_zarr(..., zarr_format=3)`` that encoding either errors outright ("Expected a BytesBytesCodec. Got Blosc.") or, with explicit chunk/shard overrides, silently corrupts the write. Added ``_clear_stale_encoding`` that strips ``compressors``, ``filters``, ``preferred_chunks``, ``chunks``, ``shards``, ``_FillValue``, ``missing_value``, ``dtype`` before writing. 2. **xesmf regridder is not thread-safe under dask.** Leaving the regrid lazy and letting dask parallelise the zarr write across timestep chunks caused per-chunk ESMF calls to race, producing non-deterministic NaN holes in 3D fields (ua ~57% NaN, va ~42%, hus ~63%, zg 0% in the first buggy run). Calling ``.load()`` on the dataset before writing forces regridding to run once, sequentially, on the main thread. Memory cost for a 1-year F22.5 dataset is ~400 MB, well within budget. 3. write_index now creates its output directory first, so failed-only runs don't hit "non-existent directory" on index emit. Verified on CanESM5 historical r1i1p1f1, 2010 full year: all core 3D and 2D fields end up NaN-free in the readback; the only remaining NaNs are in ts / siconc at the monthly-to-daily interp edges (first ~15 days of Jan, last ~15 of Dec) and in siconc over land. Those are real (ts interp can't extrapolate outside Jan-16..Dec-16; siconc is undefined on land). Fill or extrapolation for those is a follow-up.

- _interp_monthly_to_daily now does a ``.bfill("time").ffill("time")`` after the linear interp so days before the first monthly sample and after the last monthly sample take the nearest monthly value (constant extrapolation). Eliminates the ~30 days of NaN we were leaving at the Jan 1-15 / Dec 17-31 edges. - siconc is emitted alongside a new time-invariant ``siconc_mask`` (uint8, 2D lat-lon, 1=valid/ocean, 0=land/no-source-coverage) taken from the first timestep's pre-fill NaN pattern. The stored siconc values then get land NaN replaced with 0 so the data is clean; the mask tells downstream code which cells are semantically meaningful. Verified on CanESM5 historical r1i1p1f1 2010: every variable in the written zarr is NaN-free, and the extrapolated ts/siconc head and tail match the first/last interpolated day as expected.

Three changes surfaced by the first sanity-check run: 1. **Derive layer-T before the zg fill, not after.** Nearest-above fill on ``zg`` forces ``zg[plev=i] == zg[plev=i+1]`` in below-surface columns, so ``dz = 0`` there and the hypsometric formula collapsed ``ta_derived_layer_*`` to 0 K over Tibet / Antarctica / Greenland. Reorder: compute derived T from unfilled ``zg + hus``, then nearest-above fill the level variables, then cascading-fill the layer T values top-down (``_fill_derived_layer_T`` iterates from layer N-2 down to 0, inheriting the layer above where either bounding level is below surface). Global-mean ``ta_derived_layer_0`` now matches ``tas`` within a few K. 2. **Sanity-check pass.** ``_run_sanity_checks`` runs after the dataset is materialised, before write. It range-checks winds, humidity, temperatures, pressures, precipitation, fluxes, ocean fraction, and orography; it also checks the derived layer-T magnitudes and compares the global mean of layer 0 to ``tas``. Findings are appended to ``row.warnings`` and logged — advisory only, never fail the dataset. 3. **siconc rounding clip.** Conservative regridding from the ocean grid can leave sub-epsilon negatives and mild >100 overshoot; clip to [0, 100] after the fillna(0) for land cells. Also moves the ``.load()`` out of ``_write_zarr`` into ``process_one`` itself, so the sanity-check pass runs on materialised data for free. README gets a new section on masking the derived T layer grid: since ``ta_derived_layer_i`` lives on the layer between ``plev[i]`` and ``plev[i+1]`` (not on plev itself), downstream code needs to OR the two bounding levels of ``below_surface_mask`` to mask layer ``i``. Snippet included. Verified on CanESM5 historical r1i1p1f1 2010: all derived layer-T values are in a physically sensible range (layer 0 mean 273 K, cooling monotonically through the troposphere, stratospheric warm nose at layer 6), and the sanity-check pass is clean.

Three iterations on process.py surfaced by running 4 models: 1. **360_day calendar in ``_apply_time_subset``.** UKESM1-0-LL publishes with the 360_day calendar, where every month has 30 days. Our config uses ``2010-12-31`` as the pilot historical end; that day doesn't exist under 360_day and ``ds.sel(time=slice(..., "2010-12-31"))`` raises ``invalid day number``. All 7 UKESM1-0-LL tasks failed before this fix. Added a small ``_clip_date_for_calendar`` helper that caps the day to 30 under 360_day. Noleap / Gregorian are unaffected since Dec 31 exists in both. 2. **Preserve conservative regridding's integral — don't clip.** Radiation fluxes came out with sub-epsilon negatives (e.g. rsds min = -0.004 W/m^2) from conservative regridding rounding. Initial instinct was to clip, but that breaks the area-weighted integral conservation which is the whole point of conservative regridding. Instead: widen the sanity bounds by a small ``_EPS`` tolerance (0.01 in the variable's natural units) so noise near the physical boundary doesn't fire. Also widened ``hfss`` / ``hfls`` upper bounds to cover real tropical extremes (~1000-1200 W/m^2 at daily means), which were firing on MPI-ESM1-2-LR. 3. **Removed siconc clip.** We were clipping siconc to [0, 100] for the same reason I was about to clip radiation — this also breaks conservation. Dropped the clip; the sanity bound tolerates the small overshoot. README gets a short paragraph on the conservative-integral-is-sacred policy. Verified: re-ran UKESM1-0-LL and got 7 ok / 2 skipped / 0 failed. The 2 skipped are real data gaps (``psl`` missing from r5i1p1f3, ``ua`` missing from r2i1p1f2 ssp585). Known issue for follow-up: re-running process.py with --source-ids shrinks the task list and thus the in-memory rows list, which means index.csv gets rewritten to only cover the re-run subset. Sidecars on disk are unaffected and are canonical; next commit will have write_index scan all sidecars so the central index is always complete.

Previously, running process.py with --source-ids or --max-datasets rebuilt the central ``index.csv`` / ``index.parquet`` using only the rows for datasets attempted in that run — so every narrow re-run truncated the index to its own subset. The per-zarr metadata.json sidecars were unaffected (they are per-directory) but the central index stopped reflecting the true on-disk state. - ``_scan_all_sidecars`` walks ``<output_directory>/**/metadata.json`` with fsspec's glob and deserialises each into a ``DatasetIndexRow``. - ``_merge_rows_for_index`` unions this-run's rows with every sidecar on disk, with this-run's row winning for any overlapping ``(source_id, experiment, variant_label)`` (so a fresh failure/skip overrides an older successful sidecar). Failed / skipped rows from this run are retained even without a sidecar. - main() calls ``_merge_rows_for_index`` before ``write_index``; the Done log now reports both this run's status breakdown and the total index size. Verified on a UKESM1-0-LL re-run: index went from 10 rows (UKESM1 only) to 27 rows (full CanESM5 / GFDL-CM4 / MPI-ESM1-2-LR / UKESM1 coverage, 25 ok + 2 skipped).

The 5-model batch-2 run surfaced three model-specific failure modes; two of them are now handled gracefully so one bad variable can't take down a whole dataset: 1. **Outer-join on xr.merge leaks mismatched axes** (ACCESS-CM2 historical r1 failed with 'DataArray' object has no attribute 'dt'). Different day-table variables for the same member can be published on subtly different grids (ACCESS-CM2: lat=145 for ua/va, lat=144 for hus/zg/tas; ACCESS-CM2 ssp585 r1: time axis differs across variables). Outer join expanded the merged axes with NaN-padding, which broke downstream ``.dt.calendar`` and would have introduced data NaN too. Switched to ``join="inner"`` — intersection of coord indexes, so we keep only the overlap that every variable actually has. 2. **Curvilinear ocean siconc regrid fails per-model** (all of CESM2 and CNRM-CM6-1's attempted tasks failed with ESMF ``rc = 506``). Their SImon siconc is on a tripolar ocean grid that xesmf's default bilinear/conservative can't handle the same way CanESM5's worked. Wrapped each forcing variable in a try/except: a single forcing failure logs a warning and skips that variable only. The dataset still writes with the other forcings + full core/optional state. 3. **``.dt.calendar`` guarded with a try/except** as a belt-and-suspenders against any remaining object-dtype time coords; failure just leaves native_calendar blank with a warning. Also: gitignore PET*.ESMF_LogFile debris emitted by xesmf on regrid failures. Ready to re-run the 10 failures from batch-2 to validate.

Three fixes surfaced by batch-2 retry. Each is independent and narrowly scoped. 1. **Force cftime-decoded times** in ``_open_zstore``. CMIP6 ssp585 runs extend to 2100 but some publishers emit ``time_bnds`` or tbnds that push decoded bounds past 2262 (numpy's ``datetime64[ns]`` ceiling). xarray's default falls back to cftime *only when it has to*, which means some variables for a single dataset get datetime64 times and others get cftime. The mixed-dtype ``time`` coord then breaks ``.interp(time=...)`` with ``UFuncTypeError`` during monthly-to-daily forcing interpolation (seen on MRI-ESM2-0 ssp585 r1). Pass ``decode_times=xr.coders.CFDatetimeCoder(use_cftime=True)`` unconditionally so everything is cftime, every time. 2. **``_apply_time_subset`` guards the calendar lookup** the same way ``process_one`` does. If the opened dataset's time coord can't produce ``.dt.calendar`` (object dtype with mixed members, etc.) fall back to ``standard`` and continue. ACCESS-CM2 ssp585 r1 still failed after batch-2 with a ``.dt`` attribute error in ``_apply_time_subset`` that sneaked past the earlier guard in process_one. 3. **Regrid-source normalizer handles curvilinear ocean grids** properly. Renames ``vertices_longitude``/``vertices_latitude`` and ``lon_bnds``/``lat_bnds`` to ``lon_b``/``lat_b`` (xesmf's expected names), AND converts CF-style ``(N, M, 4)``-vertex bounds to xesmf's shared-corner ``(N+1, M+1)`` mesh via ``cf_xarray.bounds_to_vertices``. Without this, conservative regridding from a tripolar ocean grid fails with ``KeyError: 'lon_b'`` or ``lat and lon should be both 1D or 2D``. 4. Keep the full monthly dataset (not ``[[var]]``) for forcing regrids so the ``(N, M, 4)`` bounds coords — which have an extra ``d2`` / ``vertices`` dim the variable itself doesn't — survive the open. Also: siconc / sftlf / sftof added to ``FLUX_LIKE_VARIABLES`` so they use conservative regridding (physically the right choice for area-fractions; bilinear on a {0,1}-ish field tends to drift). Known remaining: CESM2 SImon siconc still trips ESMF ``rc=506`` even with correct (N+1, M+1) bounds — likely a pole-handling issue in the tripolar grid. The try/except from the previous commit skips siconc for those datasets and writes the rest; adding proper pole-cell handling in ``_normalize_regrid_source`` is follow-up.

ACCESS-CM2 ssp585 r1i1p1f1 on Pangeo has variables backed by inconsistent zarr stores — some covering 2015-2100 (standard ssp585), some 2201-2300 (ssp585-over continuation), and others 2251-2300 (just the tail). With join="inner" the intersection is empty; .sel(time=slice(...)) then fails with a cryptic NoneType error because the CFTimeIndex has no date_type. Detect the empty-merge case explicitly and emit a descriptive skipped reason instead of a failed row. One-off Pangeo-side data inconsistency, not a modeling issue. Across a 9-model batch: 41/46 ok, 5 skipped (missing core vars for 4 members + this time-overlap issue), 0 failed.

Two specific follow-ups surfaced during the batch-2 run: - ACCESS-CM2 ssp585 r1i1p1f1's variables on Pangeo point at inconsistent underlying stores (some 2015-2100, some 2201-2300, some just 2251-2300). To be raised with Pangeo. - CESM2 SImon siconc trips ESMF rc=506 even with proper bounds. Likely tripolar-pivot cell issue; siconc gracefully skipped via the per-forcing try/except. Both documented in the Open Issues section so they're easy to find later.

…aries CESM2-WACCM historical r2, r3 and ssp585 r1 fail in xarray with ``cannot reindex or align ... the (pandas) index has duplicate values``. The naive fix is ``ds.drop_duplicates("time")``, but that quietly merges over simulation boundaries (e.g., a historical run spliced to an SSP continuation) — the two sides carry different data and the join creates a silent step change in the training target. Instead, ``_resolve_time_duplicates``: 1. Detects duplicate timestamps. 2. Loads the values at each duplicate pair and compares them with ``np.allclose(equal_nan=True, rtol=1e-5)``. 3. If every duplicate pair is data-identical, returns the deduplicated dataset and a warning noting how many duplicates were dropped (classic file-splice redundancy — safe to merge). 4. If any duplicate pair has materially different data, raises ``_SimulationBoundaryError``; process_one catches it and skips the dataset with a loud reason telling the user to republish the two halves in separate zarr stores. Applied on both the day-table per-variable open and the monthly forcing open.

Two cheap checks on each materialised dataset before write, logged as advisory warnings (never fail a dataset): 1. **Time stride uniformity.** Consecutive timestamps should be exactly 86400 s apart for daily data. Any gap that deviates by more than 1 s is flagged with its location — catches missing days, cadence switches, or calendar-conversion mishaps even when timestamps are otherwise clean. 2. **Day-to-day global-mean jumps on ``tas`` and ``psl``.** A real simulation-boundary discontinuity can survive our ``_resolve_time_duplicates`` pass (different timestamps, not duplicates) but still shows up as a dramatic overnight jump in global means. Tolerances: 2 K for tas, 500 Pa for psl — both well above any real climate-scale single-day swing in a global mean. Helper ``_time_delta_seconds`` is datetime-flavour-agnostic (works for both cftime and numpy datetime64).

HadGEM3-GC31-LL's day-table variables publish on a mix of staggered Arakawa C-grid points: ua/va on u-points and scalars on cell centres. When we merged them with ``join="inner"`` the intersection of their lat arrays was empty (different grids with no coincident values), and xesmf crashed on the empty-lat source with ``zero-size array to reduction operation maximum``. Silent "override" of the grid is unsafe — the staggering is physical. Refactor process_one to regrid each day-table variable independently to the common F22.5 target, *then* merge on time. With every piece already on the same uniform lat/lon, the merge's spatial join is a no-op and the time join correctly intersects different-time-window variables (e.g., ACCESS-CM2 ssp585 r1's non-overlapping variable time ranges). Side effects of the refactor: - cell_methods + native_calendar inspection moved into the per-var loop; they now use the source dataset directly instead of the (sometimes-fragile) merged dataset. - Per-variable time_subset so we don't regrid bytes we'd throw away. - Same _SimulationBoundaryError handling, empty-time guard, and regrid-methods accounting as before.

The previous commit's per-variable regrid was correct for staggered grids but unacceptably slow — it built one xesmf Regridder per (variable, method), so ~30 ESMF calls per dataset. CESM2-WACCM historical r2 hung for hours. Compromise: bucket variables by their *native grid fingerprint* (hash of lat + lon arrays) and regrid each bucket together. Models where every day variable shares a grid (the common case) collapse to a single bucket and 1-2 Regridder builds, matching the original fast path. Models with staggered grids end up with 2-3 buckets and a few extra Regridder builds — a small constant overhead, not a 30x explosion. Adds ``_grid_fingerprint(ds)`` returning a hashable identifier of ``ds.lat`` + ``ds.lon`` byte arrays. Bucket count is logged in ``row.warnings`` when more than one to make the staggered-grid case visible in the index. Verification on CESM2-WACCM and HadGEM3-GC31-LL coming next; the prior batch was killed before completion.

Two related fixes for performance + correctness on sources with duplicated time indices (CESM2-WACCM r2 psl had every timestamp duplicated, causing dedupe to hang): 1. ``_apply_time_subset`` no longer uses ``ds.sel(time=slice(...))`` — that path goes through pandas's slice_locs which raises ``Cannot get left slice bound for non-unique label`` when the index is duplicated. Use a boolean mask + ``isel`` instead; positional indexing doesn't care about duplicate labels. 2. ``_resolve_time_duplicates`` now loads the entire variable into memory once and compares duplicates with vectorised numpy slicing, rather than calling ``.load()`` per duplicate pair. For the CESM2-WACCM r2 psl test (730 timestamps after year subset, half duplicates), the rewrite drops dedupe from ~10 minutes (one GCS round-trip per pair) to ~2 seconds. Process_one runs the time-subset *before* the dedupe so dedupe sees a tiny window, not the full multi-decade dataset. The fast path lets us afford that order.

Previously ``_resolve_time_duplicates`` would silently dedupe any time-duplicates whose data was bit-identical, with only a warning. That was too lenient — a real publishing artefact and a real simulation-boundary stitch can both look "data-identical" if the overlap is short or the publisher rounded consistently. We want a human in the loop. - ``DefaultsConfig.allow_dedupe = False`` by default; encountering any duplicate timestamps now skips the dataset with a descriptive reason that points at the per-dataset override mechanism. - ``Override.allow_dedupe: bool | None`` flips the default to True for matching ``(source_id, experiment, variant_label)`` triples. ``ResolvedDatasetConfig`` carries the resolved value through to ``process_one`` and the helper. - New ``_DuplicateTimestampsError`` raised when duplicates are seen with ``allow_dedupe=False``; caught at the day-vars open loop and converted to a skipped-dataset row. Pilot config: enable dedupe explicitly for CESM2-WACCM ``historical r2i1p1f1`` (the case where Pangeo's mirror stitched ``psl`` 1850-2014 twice end-to-end, bit-identically). README's known-Pangeo-issues section gets that finding too. Also widen the ``sftlf`` sanity bound from 100% + ε to 105% — CMIP6 publishers' binary {0, 100} land masks come back at up to ~103% from xesmf's conservative regridder edge-weighting. Anything over 105% is still unphysical and trips the warning.

Sanity checks during the batch-4 ingest of these two models hit warnings that point at source-side data quality, not bugs in the pipeline: - INM-CM4-8 has a subset of grid cells where ``zg`` at 10 hPa is ~22 km instead of the expected ~32 km. Layer 6 (50-10 hPa) then collapses to ~2.9 km thick and the hypsometric ``ta_derived_layer_6`` reads ~61 K. Real polar stratosphere is 190-220 K, so this is clearly a published-data anomaly. Dataset still writes; the warning is recorded in its sidecar. - CESM2-FV2 ``sftlf`` reaches ~114% from conservative regridding, beyond the few-percent edge overshoot we'd expect from a binary source mask. Suggests source cells with values outside [0, 100] or unusual fill-values being weighted in. Both filed under "Known model-side data quirks" in the README so they can be reviewed alongside the Pangeo-side issues already listed.

Joins ``index.csv`` with the inventory to answer "what's in this dataset, what was available in the source, and what wasn't" — one row per attempted dataset across all 4 outputs: - ``presence.csv`` and ``presence.parquet``: wide pivot (~30 variable columns + identity/status/calendar/mask metadata). Cell encoding: 2 = written, 1 = available in source but not written (skipped/failed datasets), 0 = not in source. - ``presence.png``: 3-tone heatmap (gray / amber / green) with columns grouped by category (core → derived → forcing → static → extra → optional) and rows sorted by source_id. A side stripe shows dataset status (ok / skipped / failed). Skipped rows show up as walls of amber, making it obvious at a glance which models had data the pipeline couldn't ingest and why. - ``presence.md``: per-model rollup with one-line summaries + category coverage matrix. For not-ok datasets, derived layer-T and the masks are inferred from the source's input variables (``zg``+``hus`` for derived T, ``ua``/``va``/``hus``/``zg`` for ``below_surface_mask``, ``siconc`` for ``siconc_mask``). README updated with usage notes. Run on the current 21-model index reports 76 ok / 7 skipped / 0 failed and matches the per-source breakdown from the index.

For each ok dataset in index.csv: - Open the zarr. - For every data variable (skipping mask channels), compute area-weighted scalar statistics with Gauss-Legendre quadrature weights — the natural cell-area weighting on the F22.5 grid: * mean, std, p01/p50/p99 (mean ± std lies about heavy-tailed distributions; quantiles tell the truth) * skewness, excess kurtosis (flag variables that need log/Box-Cox transformation before normalisation; pr ends up at skew=3.8 and excess kurtosis=35 in CanESM5 historical r1 — exactly the pattern that warrants log(1+pr)) * mean and std of the one-step finite difference along time — the natural normalisation for tendency targets * lag-1 autocorrelation of the area-weighted spatial-mean time series (high values like 0.99 for huss say "tendency is the interesting signal here") * finite_fraction (sanity check) - 3D variables produce one row per plev. - Write a per-dataset stats.nc next to data.zarr (xarray-structured Dataset with one variable per <var>__<stat>; plev coord for 3D). - Append rows to a tidy aggregate written as <output_directory>/stats.csv (and stats.parquet when pyarrow is installed). Resumability via existing stats.nc files: subsequent runs reuse them unless --force is set; rows for the aggregate are reconstructed from the existing nc. Spatial maps (time_mean / time_std), time series, and cross-variable correlation matrices are intentional follow-ups; the scalars layer covers the immediate normalisation-stats need.

The schema change in the previous commit (eday_ts → surface_temperature, orog → HGTsfc) means v0/ partial outputs from the terminated cmip6-daily-pilot-hgf4t run are incompatible with the new code. Bump the version marker on prod configs (Pangeo + ESGF + zarr-to-netCDF mirror) so the next run writes to a fresh v1/ prefix; partial v0/ outputs stay where they are for now and are no longer in the path. ``external_forcings_directory`` keeps its version-independent path — the externals haven't changed, so they're reused. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When the on-disk schema changes, old archive data needs to be brought forward. This commit adds the framework + the first migration (captures this session's ``eday_ts → surface_temperature`` / ``orog → HGTsfc`` rename). Pieces: - ``schema_version.py``: ``SCHEMA_VERSION`` constant (currently ``"0.1.0"``), ``Migration`` dataclass, ``parse_version`` / ``version_lt`` helpers. - ``migrations/``: one module per schema step, plus an ``__init__.py`` that registers the ordered ``MIGRATIONS`` tuple and exposes ``chain_for(current, target)``. Adding a new migration is: new ``_<from>_to_<to>.py`` with a ``MIGRATION = Migration(...)`` instance, import + append in the registry, bump ``SCHEMA_VERSION``. - ``migrate.py``: CLI driver. Walks ``index.csv``, reads each sidecar's ``schema_version`` (missing → ``"0.0.0"``), composes the chain, applies each step, persists the updated sidecar after every step so partial failures resume cleanly. Supports ``--dry-run``, ``--source-ids``, ``--workers`` (spawn ProcessPoolExecutor to dodge the gcsfs/gRPC fork issue). - ``DatasetIndexRow.schema_version`` field, stamped by ``process.py`` / ``process_esgf.py`` at the same point ``row.status = "ok"`` is set. First migration 0.0.0 → 0.1.0: opens the zarr, renames any ``eday_ts``/``orog`` to ``surface_temperature``/``HGTsfc``, preserves existing ``original_name`` attrs, rewrites the zarr in place (chunking inferred from the source's encoding so layout is preserved), regenerates ``stats.nc`` (its ``<var>__<stat>`` keys reference variable names), and stamps the sidecar with the new version + an audit ``migrations`` list. No-op when neither variable is in the dataset. Tests (14 total, 3 layers): 1. Unit — version parsing, chain composition, sidecar dispatching. 2. Migration-specific — applies 0.0.0→0.1.0 to a hand-crafted minimal ``eday_ts``/``orog`` zarr; verifies rename, attrs, sidecar audit, stats regeneration, untouched-variable preservation, no-op path. 3. Integration with the processing pipeline — runs ``process_one`` end-to-end on a synthetic mini CMIP6 source tree (no GCS), then inverts the rename to simulate older output, then rolls the migration forward and verifies the result matches what ``process_one`` produced directly. This is the regression guard against "we updated the pipeline but forgot to update the migration" drift. 116 tests pass overall. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous ``day_regridded.load()`` step pulled the entire dataset into memory before ``to_zarr`` — at prod scale (75-86 years × all variables ≈ 25-30 GB plus xarray/dask overhead) it ran past the 32 GB highmem-sim-pool limit, OOMKill'ing every prod ``process-dataset`` pod. The full-load existed to dodge a real xesmf bug (the C library isn't thread-safe; threaded dask scheduling produces NaN chunks on concurrent Regridder calls), but the same outcome is reachable via the synchronous dask scheduler — chunk-by-chunk serial evaluation, which streams memory while keeping xesmf happy. Wraps the post-regrid pipeline (NaN count, flatten_plev, K-harmonization, rename, sanity checks, write_zarr) in ``with dask.config.set(scheduler="synchronous"):``. Inline stats now re-opens the just-written zarr (target resolution, small) rather than reusing the in-memory dataset, so it doesn't re-trigger any of the upstream regrid graph. Same applies in process_esgf.py. Peak memory per pod drops from ``Σ all-time × all-variables`` down to ``one time-chunk × max single variable``: roughly 20-30 GB → a few hundred MB on a 32 GB pod. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Even with the synchronous-scheduler streaming write, dask doesn't aggressively GC completed task results across independent variables in a single ``to_zarr(ds_with_all_vars)`` call. For the highest- resolution CMIP6 models (HadGEM3-GC31-MM, CNRM-CM6-1-HR, CESM2-WACCM etc.) at prod-scale time ranges, the accumulated intermediates still top the 32 GB pod limit on the trailing 7-of-444 datasets even after the streaming fix. ``write_zarr`` now writes ``data_vars`` in batches of ``cfg.chunking.variable_batch_size`` (default 8). First batch ``mode="w"`` creates the store and writes coords; subsequent batches ``mode="a"`` append new data_vars. Each batch is an independent dask compute, so the previous batch's intermediates are released before the next graph builds. A single ``zarr.consolidate_metadata`` pass runs once after all batches. Peak memory ceiling drops from ``num_vars × per-chunk`` to ``variable_batch_size × per-chunk`` — for a 50-var output and 8-var batches that's ~6× reduction. ``variable_batch_size = None`` falls back to the single-pass write (useful for debugging or for tiny datasets where the overhead of N mode="a" calls isn't worth it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

``compute_below_surface_mask`` calls ``bool(nan_union.any())``, which is the first all-data compute after regrid. It reduces 4 full 3D regridded variables (75-86 yr × 8 plev × 45 × 90) to a single boolean. Under the previous threaded scheduler dask materialises many regridded input chunks in parallel, and on native-resolution high-res sources (EC-Earth3 T255, GFDL-CM4 C192) that tops the 32 GB pod limit even though the result is a single bit. The existing ``with dask.config.set(scheduler="synchronous"):`` block started later, just before the NaN sanity count and write — so the below-surface-mask reduction wasn't covered. Move the config-set up to immediately after the per-grid regrid (in both process.py and process_esgf.py) so every all-data compute on the regridded dataset runs serially. Use a non-context-manager ``dask.config.set(...)`` call instead of a giant ``with`` block, because each Argo worker processes one dataset then exits — leaking the scheduler choice across the rest of the function is exactly what we want, and avoids re-indenting ~50 lines of pipeline code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The dedupe path used to call ``arr = ds[var_name].load().values`` to materialise the entire variable before comparing duplicate-pair values. On models like CESM2-WACCM (~hundreds of duplicate timestamps × 86 yr × 8 plev × 288 × 192 at native resolution) one variable is ~50+ GB and OOMs the pod just to compare a few hundred duplicate pairs. Iterate over the duplicate timestamps directly and ``isel(...).load()`` only the copies for each duplicate value. Memory per iteration is ``n_copies × per-timestep``, ~MB-scale even for high-res models. Equivalent semantics: every copy must match a reference (transitivity holds), and any mismatch still raises ``SimulationBoundaryError``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Builds a lazy dask-backed dataset whose total in-memory size dwarfs one batch-chunk's worth of data, runs ``write_zarr`` under the synchronous scheduler that prod uses, and asserts that peak Python-side allocation stays within 3× ``variable_batch_size × chunk_time × per-var-chunk``. Sized so num_time_chunks (4) and num_variable_batches (6) both exceed the tolerance factor, so a regression that loads "one whole time slab" or "all variables" instead of "one chunk-batch" would be caught. Uses tracemalloc rather than psutil RSS — on this build numpy's data buffers go through Python's allocator, so tracemalloc captures the dask chunk materialisations without RSS noise from BLAS pools, imports, or glibc not returning freed memory. Observed peak ~47 MiB vs 24.7 MiB per-batch-chunk (1.9×) on the total 593 MiB dataset — comfortable headroom under the 3× bound. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The broadcast variables added by ``causal_monthly_to_daily`` / ``causal_annual_to_daily`` / ``attach_external_forcings`` (``luh2_forest``, ``input4mips_so2``, ``input4mips_bc``, ``amon_ts`` and the rest of the monthly-causal surface variables) end up with per-day dask chunks (size 1 along ``time``), because xarray's ``isel(time=day_idx)`` with a large array index re-chunks per-element. Zarr v3 refuses to write a dataset where multiple dask chunks would land in the same zarr chunk — it raises: ValueError: Specified Zarr chunks encoding['chunks']=(365, 45, 90) for variable named 'luh2_forest' would overlap multiple Dask chunks ... could lead to corrupted data This blew up ~440 of 444 datasets in the prod run silently (``status="failed"`` rows have no sidecar, so the workflow's pod "completed" count didn't catch it). Fix: in ``write_zarr``, rechunk the dataset along ``time`` to ``cfg.chunking.chunk_time`` before encoding. Variables already at the target chunk size are a no-op; the small broadcast variables get coalesced into ~KB-scale chunks, which is trivial. Add an integration test that builds a dataset with mixed-chunk variables and verifies ``write_zarr`` produces a valid output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous fix (``ds.chunk({"time": chunk_time})`` before write) handles per-day broadcast chunks but doesn't catch every pattern. Production showed sfcWind / psl / TMP2m / PRATEsfc still failing with the same chunk-overlap ``ValueError`` even after rechunking — likely from leap-year boundaries shifting some upstream chunks off ``chunk_time`` multiples in ways ``ds.chunk(...)`` doesn't always coalesce. ``align_chunks=True`` lets ``to_zarr`` rechunk anything still misaligned right before writing, so the on-disk chunk grid is guaranteed correct regardless of upstream chunking history. Under the synchronous scheduler the extra in-place rechunk holds at most a few adjacent chunks (sub-MB at F22.5 resolution). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirror the Pangeo cloud config's excludes. AWI-CM-1-1-MR publishes 3D variables on a 19-level pressure grid (1, 5, 10, 20, 30, 50, 70, 100, 150, 200, 250, 300, 400, 500, 600, 700, 850, 925, 1000 hPa) rather than the plev8 set the pipeline assumes, which produces per-level variables (``ua5``, ``ua20``, ...) and derived layers (``ta_derived_layer_5_1``, ...) that don't match the rest of the cohort. Pilot data showed all 5 AWI-CM-1-1-MR variants ended up with 19 mislabeled per-level variables instead of 8 — see ``report/README.md`` §2 and §6. Pangeo config already excluded these two; this commit adds them to ESGF (where AWI-CM-1-1-MR has 316 inventory rows that would otherwise be processed via the ESGF path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Iteration-friendly pilot to validate the prod pipeline end-to-end on enough time to exercise prod-scale chunking and multi-chunk write paths, without committing to the full 1940-2100 horizon. Key differences from the prior 1-year pilot: * 2 years per scenario (2010-2011 hist, 2015-2016 ssp), so the streaming write crosses chunk boundaries and the cross-year stats reductions produce non-trivial values. * Explicit ``chunking.chunk_time = 365`` to match what prod will use. The previous pilot used ``chunk_time = 1`` (dataclass default) which happened to dodge the chunk-misalignment bugs that hit prod. * Includes AWI-CM-1-1-MR and KACE-1-0-G in the ESGF exclude list. Output lands in a fresh ``v0-pilot-v4/`` so prior pilots' artefacts stay readable for comparison. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

``exclude_source_ids`` is too blunt when only one publisher file of a given (model, experiment) is bad. The pilot turned up two such cases: * HadGEM3-GC31-MM historical r2i1p1f3 ships ~2.85e6 NaN cells in its 3D state pre-fill; sibling variants r1i1p1f3 and r3i1p1f3 of the same model+experiment are clean. * EC-Earth3 historical r4i1p1f1 ships ~1.68e6 NaN cells; every other EC-Earth3 variant across all 5 experiments is clean. Without a variant-level exclude we'd either drop the whole model (losing healthy realizations) or carry the bad slabs through the nearest-above fill, which produces uninformative flat bottom-plev columns in those grid cells. New ``Selection.exclude_variants: list[VariantKey]`` takes ``{source_id, experiment, variant_label}`` triples and drops only those datasets. Wired through both ``process.select_datasets`` and ``process_esgf.select_esgf_datasets``. Both prod configs and the new v4 pilot configs add the two known-bad variants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

``cf_xarray.bounds_to_vertices`` uses ``apply_ufunc`` with the 4-vertex dim as a core dim and refuses to run when that dim is chunked. CMIP6 ocean zarrs (Oday.tos, Omon.* on curvilinear grids) ship with ``vertices`` pre-chunked, which broke ~150 of 294 v0-pilot datasets with the cryptic error:: ValueError: dimension vertices on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. Calling ``.load()`` on the bounds DataArray before passing to ``bounds_to_vertices`` fixes the chunking-as-core-dim issue. The bounds array itself is tiny (lat × lon × 4 × 8 bytes ≈ a few MB at native resolution), so eagerly loading is cheap and runs once per source-table-load anyway. Verified locally: ACCESS-CM2 and CanESM5 Oday.tos sources now build conservative regridders successfully where they previously raised this error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Root cause for the unphysical ``simon_sitemptop`` values in the pilot report (cohort mean ~88 K under the ice mask, well below freezing). xesmf's default conservative / bilinear weighted average treats source NaN cells as 0, so a target cell that partially covers source NaN (no ice in part of the sub-cell, ice in the rest; or land NaN next to ocean for ``oday_tos``) gets a value pulled toward 0 by the NaN cells' implicit-zero contribution. ``skipna=True`` makes the regridder skip NaN cells in the weighting — target cells get the average of the *valid* source cells covering them, which is the right semantics for masked variables. xesmf still propagates NaN to target cells whose entire source footprint was NaN (``na_thres=1.0`` default), so the downstream ``emit_mask_and_fill`` still detects the mask pattern correctly. For atmospheric variables that are fully populated (most of ``ua/va/hus/zg/tas/...``) ``skipna`` is a no-op. The 3D below-surface NaN cases that did have NaN benefit from it too — target cells over high topography no longer get pulled toward 0 from below-surface contamination. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bumps the realization cap so models that publish 4+ historical members per (p, f) — CanESM5 (6 members), HadGEM3-GC31-LL (4), EC-Earth3 family — contribute their full ensemble to training instead of being clipped at 3. Also updates the markdown to reflect the current state: * README.md member-cap references (now 5) * README.md known-issues entries for EC-Earth3 r4i1p1f1 and HadGEM3-GC31-MM r2i1p1f3 now note they're excluded via ``exclude_variants``. * README.md adds AWI-CM-1-1-MR known-issues entry (non-plev8 pressure grid; excluded via ``exclude_source_ids``). * report/README.md adds a status banner and rewrites sections (a) sftlf scales, (c) heavy-NaN variants, (d) ocean regrid, (g) temperature units to point at the commits that landed fixes since the v0-pilot was generated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Just the v4 configs with v0-pilot-v5/ as the output prefix. Bumped so the next pilot run lands in a clean directory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

``compute_below_surface_mask`` falls back to ``ds["zg"] < orog`` when the NaN-union path doesn't find anything. CMCC-CM2-SR5 publishes only ``ua`` of the four 3D core variables and is accepted into the pipeline via ``max_core_missing = 3``. With no NaN pattern and no ``zg``, the function used to raise ``KeyError: 'zg'`` and fail the dataset entirely. Return ``(None, "none")`` instead — there's nothing to build a mask from, but the rest of the dataset (surface met, radiation, ua10 etc.) is still valid and worth keeping. Regression test added covering the missing-zg + present-orog combination. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI: * Add ``scripts/cmip6_data/conftest.py`` that skips collection of every test under that directory when ``cf_xarray``, ``xesmf``, ``dask`` or ``bottleneck`` aren't importable. The cmip6_data tests need those packages (they ship in ``cmip6_data/requirements.txt`` and the cmip6-processing image), but CI runs ``pytest .`` in the fme env which doesn't have them — producing ~25 spurious ``ModuleNotFoundError`` failures per CI run. With the conftest, the fme env stays green and the cmip6_data tests still run under the cmip6_data env / docker image. * Fix ``fme/ace/data_loading/test_cmip6.py::test_netcdf4_build`` unpacking — ``DatasetItem`` is now a 5-tuple (added ``source_ids: frozenset[str] | None``); the test was still unpacking 4. Pipeline: * Drop scalar non-dim auxiliary coords (e.g. CMIP6's ``type = "sea_ice"`` scalar string) in ``write_zarr`` before the batched mode="w"/mode="a" writes. xarray re-encodes the same string coord with different fixed-width dtypes between batches, producing ``ValueError: Mismatched dtypes for variable type ... |S6 vs |S255`` on IPSL-CM5A2-INCA historical. These coords are per-source-variable CF metadata, not whole-dataset facts. * Bump xarray ``file_cache_maxsize`` to 1024 in ``process_esgf.py main`` — the per-variable lazy-open path accumulates a NetCDF file handle per file across the all_day_vars loop, exceeding the 128 default for CMCC-ESM2-like cases (~30 vars × ~10 historical chunks) and raising the obscure ``KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ...]`` in ``file_manager._acquire_with_cache_info`` when the LRU evicts an entry mid-read. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Probing the source directly confirmed the bfzkw reshape failure is upstream on Pangeo's mirror, not in our code: >>> ds = xr.open_zarr( ... "gs://cmip6/CMIP6/ScenarioMIP/MOHC/HadGEM3-GC31-LL/" ... "ssp245/r4i1p1f3/day/ua/gn/v20190908/", ... consolidated=True) >>> ds.ua.shape, ds.ua.encoding["chunks"] ((2160, 8, 145, 192), (187, 8, 145, 192)) >>> ds.ua.isel(time=0, plev=0).load() ValueError: cannot reshape array of size 133632000 into shape (187,8,145,192) A single one-element read fails — the on-disk chunk payload contains 600 timesteps' worth of data (``600 × 8 × 145 × 192 = 133,632,000``) but the chunk-shape metadata declares 187. Sibling r1/r2/r3 variants of the same model+experiment are fine, so add only the bad triple to ``exclude_variants``. Applied to both v4 and v5 pilot configs and both prod configs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previous commit added the bad variant to *both* the Pangeo and ESGF exclude lists, which blocked ESGF from picking it up. The Pangeo mirror is the broken party — the ESGF inventory has all 18 day variables for HadGEM3-GC31-LL ssp245 r4i1p1f3 — so only the Pangeo configs should exclude it; ESGF stays clean and processes the variant via the normal backfill path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

1) ``clamp_static_fractions`` previously divided ``sftlf`` by 100 unconditionally. CMIP6 CMOR spec puts ``sftlf`` in percent (0-100), and the vast majority of publishers comply, but at least FGOALS-f3-L ships it as 0-1 fraction despite declaring ``units = %``. Without a scale check we then divide a fraction by 100 and end up with ``land_fraction`` ~ 0.003 (100× too small) — flagged in report/v5 as a cohort outlier (z = -16.8). New behaviour: inspect the input max. ``max <= 1.0 + _EPS`` means we have a fraction already and copy through; anything higher gets clipped to [0, 100] and divided by 100. A warning records which branch fired so the regridder defect / publisher anomaly stays visible. 2) Stats period ``1979-2015`` -> ``1979-2014``. The window is meant to align with ERA5's well-observed modern era; ending at 2014-12-31 keeps the stats free of SSP-scenario drift (the prior 2015 endpoint includes one year of ssp* data which muddies a "modern reanalysis era" interpretation). Pure config-default change — existing pilot v5 stats.nc files were generated with the prior period and aren't being touched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The pipeline's built-in ``_GLOBAL_MEAN_JUMP_TOL`` sanity check flagged 5 v5-pilot datasets with day-to-day psl jumps of ~5900 Pa — all of them in the AWI-ESM family. Investigation showed both ``AWI-ESM-1-1-LR`` and ``AWI-ESM-1-REcoM`` ship corrupted Jan 1 data **every year**: psl drops ~5900 Pa (vs ~53 Pa typical), TMP2m spikes 1.5 K, sfcWind 0.4 m/s, DSWRFsfc 3 W/m². PRATEsfc and Q2m are unaffected. Signature: variables with a large instantaneous-vs-daily-mean gap show the biggest jump; stochastic-mean variables don't. Consistent with each annual file's first day being an instantaneous snapshot rather than a daily mean (a CMIP6 publishing bug at AWI's side). 0.27% of timesteps would be many-sigma outliers in training, so both models are excluded from all configs (Pangeo + ESGF, pilot + prod). README's "Known model-side data quirks" section and the v5 partial report both updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…CM ssp585 r1 Two related fixes plus a third variant exclude. 1) ``emit_mask_and_fill`` returned ``(~nan_pattern).astype("uint8")``, which inherits attrs from the source DataArray. After ``harmonize_temperature_to_kelvin`` had set ``units = "K"`` on the parent temperature variable, the downstream harmonize loop ran a second time over every data_var (including the inherited-attr mask), saw ``units = "degC"`` on the mask (the pre-harmonize state for SST sources), and added 273.15 to its 0/1 values. The on-disk masks ended up holding 273.15 / 274.15 for every ``oday_tos_mask``, ``simon_*_mask``, etc. in v5 (44 of 44 sampled). Fix at source: explicitly set ``mask.attrs`` to ``{"units": "1", "long_name": "valid-cell mask (1 = data present, 0 = absent)"}``, stripping the inherited temperature metadata. Belt-and-braces: the harmonize loops in process.py and process_esgf.py now skip ``*_mask`` channels. Regression test added. 2) CESM2-WACCM ssp585 r1i1p1f1 excluded via ``exclude_variants`` — the source publishes 41 consecutive days of all-NaN data starting 2016-02-15 (publisher missing chunk). Uniform ``finite_fraction = 0.944`` across every plev level is the signature of "missing days", as opposed to the below-surface fill pattern. Same shape as the previously-excluded HadGEM3-GC31-MM r2 and EC-Earth3 r4 cases. 3) v5 report updated with results from the five additional checks the user asked for (time monotonicity, total-water-path budget, mask alignment, fraction conservation, spatial outlier sniff). Mask-alignment surfaced the bug in #1. The other four were clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ``training.md``: stats period ``1979-2015`` → ``1979-2014`` (matches the config change in commit 98897f3 that switched the modern-era window's end-date to keep stats free of SSP drift). * ``normalization/README.md``: stale references to raw CMIP6 names refreshed to the post-rename / post-clamp output names: ``sftlf`` (0–100 %) → ``land_fraction`` (0–1), ``orog`` → ``HGTsfc``, ``ts``/``siconc`` → ``amon_ts`` / ``simon_sea_ice_fraction``, ``below_surface_mask`` / ``siconc_mask`` → the actual per-plev / per-output mask names. * ``README.md``: stats period default list matches config (``1979-2014`` not ``1979-2015``). Spot-checked the other markdown files (``goals.md``, ``bias_correction.md``, ``physics_uncertainty.md``, ``benchmark/results.md``, ``report/README.md``, ``report/v5/README.md``, ``normalization/outputs/*.md``) — those use sftlf etc. as historical context (describing CMIP6 raw data, not pipeline output) and don't need updating, or are recent reports that already reflect current state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

``CMIP_TO_OUTPUT_RENAMES`` in config.py maps ``Eday.ts`` -> ``surface_temperature`` at write time, matching the SHIELD/ERA5 baseline. ``amon_ts`` stays prefixed because it's a different cadence (monthly causal vs daily). The README and training docs still talked about ``eday_ts`` as the output name in several places; those are now updated to ``surface_temperature`` with a note on the rename. The sibling renames (``Eday.prw`` -> ``water_vapor_path``, the sea-ice fraction rescales) are now listed alongside it in the rename-convention paragraph. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The README said pre-1959 CO2 used "constant-extrapolation to the 1959 NOAA value (315.97 ppm)". That description is stale — ``external_forcings.py`` actually backfills 1940-1958 from the hardcoded ``_GISS_PRE_MAUNA_LOA_CO2`` dict, which is the NASA GISS Etheridge-et-al.-Law-Dome + Scripps Mauna Loa composite record (rises 311.3 ppm in 1940 to 315.34 ppm in 1958, captures the WWII-era growth slowdown). Update both the bullet list and the source-substitution table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

``make_normalization.py`` previously ran once and wrote a single set of cross-source pooled stats files (``centering.nc`` etc.) at the root of ``output_directory``. Two gaps that block multi-period training: 1. No period awareness — stats were always computed over each dataset's full time range, so the per-dataset ``stats.nc`` files' ``period`` dim wasn't reflected in the pooled artefacts. 2. No per-source output — ``fme.core.per_source_normalizer. PerSourceNormalizationConfig`` reads ``{data_dir}/{subdirectory}/<source>/{centering,scaling}.nc``, but nothing in the cmip6_data scripts ever wrote that layout. This commit: * Adds a ``--period <name>`` flag that looks up a configured ``StatsPeriod`` and applies its ``[start, end]`` window to every zarr's time axis (calendar-aware via ``clip_date_for_calendar``). Datasets with no overlap are silently dropped from the contributing set. * Writes pooled files into a period-named subdir ``{output_directory}/normalization_{period}/`` so the three periods coexist without overwriting each other. * Adds ``_write_per_source`` that emits the same four scalar nc files per source under ``{output_directory}/per_source_normalization_{period}/ <source_id>/{centering,scaling,residual_*}.nc``. Per-source data was already being computed internally as ``per_model`` — we just persist it now so training can read it. * Updates ``_global_attrs`` to record the period name + window and the scope (``pooled`` vs ``per_source``) on every output file. Argo workflow: the ``make-normalization`` step now declares a ``period`` input parameter, and the entry-point ``withItems`` loop fans out into three parallel calls (``full``, ``1940-2014``, ``1979-2014``). A single ``--normalization`` submit produces all six output directories (3 periods × {pooled, per-source}). Training-side selection stays simple: set ``per_source_normalization.subdirectory: per_source_normalization_<period>`` in the train config to pick the period. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The b4nvt run sat at 73/603 for 7+ hours because of per-task overhead under the synchronous dask scheduler. With the dataclass default ``chunk_time=1``, an 86-yr ssp dataset has ~1.6M tasks per write pass (86 yr × 365 d × ~50 vars), each task ~5 ms of Python+dask overhead — ~2 hours just in graph traversal per dataset, on top of the actual I/O and regrid compute. The default ``chunk_time=1`` was inherited from ``scripts/data_process`` which doesn't shard — there it bought small-window read locality. With sharding (which we use), the GCS object granularity is the shard, not the inner chunk, so small inner chunks only generate more tasks with no read-side benefit. The training side will convert to netCDF anyway, so read amplification at training time isn't a concern. Prod v2 changes: * ``output_directory`` v1 → v2 (clean slate). * ``chunk_time``: 1 → 360 (matches 360_day calendars cleanly, ~year inner chunks for the others). * ``shard_time``: 365 → 7200 (~20-year shards; ~5 shards per variable on the 86-yr ssp datasets vs ~86 with shard_time=365). * ``variable_batch_size``: 8 (unchanged). Profiling logs added so future "why is it slow" questions are answerable from the pod logs: * ``processing.py write_zarr``: per-batch wall-clock + total + consolidate-metadata splits. Records batch index, var count, and the specific variables in each batch. * ``process.py`` / ``process_esgf.py process_one``: ``[stage X]`` splits at the heavy boundaries — regrid graph build, below-surface mask compute, surface-and-ocean loop, write_zarr, inline stats. Final ``Finished … in Xs total`` summary per dataset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Zarr v3's sharding codec requires the outer (shard) chunk size to be a multiple of the inner chunk size. Our ``_encoding_for`` was setting ``shards = min(shard_time, n_time)`` which is fine for the long historical/ssp datasets but fails when a publisher ships a shortened dataset. Concrete case from prod lnlqt: ``MRI-ESM2-0 ssp245`` r2-r5 ship only 5844 timesteps (16 years, not the full 86-year ssp245 run). With ``shard_time=7200`` and ``chunk_time=360`` the encoding came out as ``shards=5844``, and zarr raised:: ValueError: The array's chunk_shape (got (5844, 45, 90)) needs to be divisible by the shard's inner chunk_shape (got (360, 45, 90)). 4 of 603 datasets failed in lnlqt for this reason; all ``MRI-ESM2-0 ssp245`` variants in the cohort. Fix: round the requested shard size down to the nearest multiple of ``chunk_time``. For MRI-ESM2-0 ssp245 that's ``(5844 // 360) * 360 = 5760``; the trailing 84 timesteps live in a partial shard that zarr handles transparently. Regression test ``test_write_zarr_handles_non_divisible_time_length`` covers the exact MRI-ESM2-0 dimensions (5844 timesteps, shard_time=7200, chunk_time=360) and asserts the resulting encoding shards = 5760. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CESM2 historical r5i1p1f1 OOM-killed in prod lnlqt at batch 8/10 of write_zarr — pattern consistent with dask/zarr state accumulating across the synchronous ``to_zarr`` calls (file handles, codec buffers, async loop refs). At small unit-test scale the leak rate is mild (~3 MiB per batch on my workstation), but at prod scale (27000 timesteps × 77 vars × 45×90 grid) the same per-batch amplifies and pushes a 75-year dataset past the 32 GiB pod limit late in the write. Two changes: * ``write_zarr`` now logs ``rss A → B MiB`` on each batch line, so the memory trajectory is visible in pod logs. The cost is one ``psutil.Process().memory_info()`` call per batch; psutil is in ``cmip6_data/requirements.txt`` already. * ``test_write_zarr_no_rss_growth_across_batches`` reproduces the multi-batch pattern at small scale (64 vars × 8 batches × 1080 days × 22×44 grid) and asserts RSS at the last batch is within 50 MiB of the RSS at batch 2 (skipping batch 1's warmup). Won't catch a slow leak; will catch a runaway one. The 50 MiB cap at this scale corresponds to many GiB at prod scale, so it's a useful regression guard even if not exhaustive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mcgibbon added 30 commits April 24, 2026 15:50

Scaffold CMIP6 daily pilot scripts directory

1d66e94

Adds scripts/cmip6_data/ with a planning README, config dataclasses (dacite + YAML), and two starter configs for the forthcoming inventory and processing scripts. No data-moving code yet.

Ignore diagnostic scratch zarrs in scripts/cmip6_data

798dfe2

Consolidate unresolved issues in README under Open Issues

cfdbb31

mcgibbon and others added 30 commits May 23, 2026 23:05

Add v5 pilot configs (clean output dir)

5a01611

Just the v4 configs with v0-pilot-v5/ as the output prefix. Bumped so the next pilot run lands in a clean directory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CMIP6 dataset support#1115

CMIP6 dataset support#1115
mcgibbon wants to merge 153 commits into
mainfrom
scripts/cmip6-daily-pilot

mcgibbon commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mcgibbon commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants