era5_rechunking: add script to rechunk and concatenate ERA5 forcing files by ezhilsabareesh8 · Pull Request #131 · ACCESS-NRI/om3-scripts

ezhilsabareesh8 · 2026-06-15T03:31:21Z

Adds era5_rechunking/ with a Python script and PBS script to rechunk ERA5 single-level monthly files from the rt52 archive into yearly files optimised for ACCESS-OM3 / DATM.

Related issue ACCESS-NRI/access-om3-configs#1381

What it does:

Reads directly from /g/data/rt52/era5/single-levels/reanalysis/{stream}/{year}/
Rechunks from [93, 91, 180] (ERA5 default) → [1, 721, 1440] (one full-resolution timestep per chunk)
Concatenates 12 monthly files into one yearly file per stream
Preserves raw int16 values bit-for-bit (no unpack/repack)
Covers all 13 ERA5 streams, 1940–2026

Test output is here /g/data/tm70/ek4684/era5_rechunk_test/

anton-seaice

This appears somewhat complicated

are using xarray (with Dask) or nco/ncks not good options ?
do we need this level of configuration in input options ?

anton-seaice · 2026-06-17T23:36:20Z

I was thinking something like this:

import xarray as xr

from dask.distributed import Client

client = Client()

client.dashboard_link

ds = xr.open_mfdataset(
    '/g/data/rt52/era5/single-levels/reanalysis/2t/1940/*.nc',
    chunks={'latitude':-1,'longitude':-1}
)

ds_new = ds.t2m.chunk({'time':1})

ds_new.to_netcdf('2t_1940_era5.nc')

is that too slow ?

ezhilsabareesh8 · 2026-06-19T00:39:53Z

is that too slow ?

Changed to xarray in this commit 4b0efbf, it is taking 3 minutes per variable now

anton-seaice

ill get back to this tomorrow

anton-seaice · 2026-06-24T01:48:35Z

+#PBS -l storage=gdata/tm70+gdata/xp65+gdata/rt52
+#PBS -l wd
+#PBS -j oe
+#PBS -o /g/data/tm70/ek4684/era5_rechunked_1h/logs/yearly/


for sharability, probably best not to set this

Suggested change

#PBS -o /g/data/tm70/ek4684/era5_rechunked_1h/logs/yearly/

anton-seaice · 2026-06-24T01:49:40Z

+# ── user configuration ─────────────────────────────────────────────────────────
+# Directory where yearly rechunked files will be written:
+#   {OUTPUT_DIR}/{stream}/{stream}_era5_oper_sfc_{YYYYMMDD}-{YYYYMMDD}.nc
+OUTPUT_DIR="/g/data/tm70/ek4684/era5_rechunked_1h_yearly"


Suggested change

OUTPUT_DIR="/g/data/tm70/ek4684/era5_rechunked_1h_yearly"

OUTPUT_DIR="/g/data/${PROJECT}/${USER}/era5_rechunked_1h_yearly"

or similar might me more shareable

I think put it on /scratch rather than /g/data/ ?

I changed the default submit-script output path to /scratch/${PROJECT}/${USER}/era5_rechunked_1h_yearly

anton-seaice · 2026-06-24T01:50:23Z

@@ -0,0 +1,31 @@
+#!/usr/bin/bash
+# Copyright 2025 ACCESS-NRI and contributors. See the top-level COPYRIGHT file for details.


Suggested change

# Copyright 2025 ACCESS-NRI and contributors. See the top-level COPYRIGHT file for details.

# Copyright 2026 ACCESS-NRI and contributors. See the top-level COPYRIGHT file for details.

anton-seaice · 2026-06-24T02:00:34Z

is submit_era5_yearly_rechuncked.sh a clearer filename ? It's consistent with https://github.com/ACCESS-NRI/om3-scripts/blob/main/external_tidal_generation/submit_bottom_roughness.sh then ?

Yes, agreed. I renamed it to submit_era5_yearly_rechunked.sh for consistency with the other submit scripts

anton-seaice · 2026-06-24T02:02:02Z

+module use /g/data/xp65/public/modules
+module load conda/analysis3-26.02
+
+python3 "${SCRIPT_DIR}/make_era5_yearly_rechunked.py" \


SCRIPT_DIR is only used in one place, so may as just use PBS_O_WORKDIR directly here

Removed SCRIPT_DIR and used ${PBS_O_WORKDIR} directly

anton-seaice · 2026-06-24T02:02:35Z

@@ -0,0 +1,372 @@
+#!/usr/bin/env python3
+# Copyright 2025 ACCESS-NRI and contributors. See the top-level COPYRIGHT file for details.


Suggested change

# Copyright 2025 ACCESS-NRI and contributors. See the top-level COPYRIGHT file for details.

# Copyright 2026 ACCESS-NRI and contributors. See the top-level COPYRIGHT file for details.

Changed in this commit 4086207

anton-seaice · 2026-06-24T02:13:15Z

+    m0 = re.search(r"(\d{8})-(\d{8})", source_files[0].name)
+    ml = re.search(r"(\d{8})-(\d{8})", source_files[-1].name)
+    start = m0.group(1) if m0 else f"{year}0101"
+    end = ml.group(2) if ml else f"{year}1231"


end = ml.group(2) if ml else f"{year}1231"

In what case does this else statement get reached ?

That fallback was defensive for filenames that did not match the expected YYYYMMDD-YYYYMMDD pattern, but it is no longer needed because the output dates now come from the dataset time coordinate.

anton-seaice · 2026-06-24T02:15:17Z

+        logging.error(f"No .nc files in {source_dir}")
+        return False
+
+    # Derive output filename from first/last source file date stamps


Suggested change

# Derive output filename from first/last source file date stamps

# In case this year is a partially year, find first/last source file date stamps

anton-seaice · 2026-06-24T06:26:43Z

+    m0 = re.search(r"(\d{8})-(\d{8})", source_files[0].name)
+    ml = re.search(r"(\d{8})-(\d{8})", source_files[-1].name)
+    start = m0.group(1) if m0 else f"{year}0101"
+    end = ml.group(2) if ml else f"{year}1231"


Why are we getting dates from the filenames rather than the data ?

e.g. would

start = str(ds.time.min().dt.strftime("%Y%m%d").values) end = str(ds.time.max().dt.strftime("%Y%m%d").values)

be better ?

Yes, using the data is better. I changed this to derive start and end from ds.time.min() and ds.time.max(), so partial-year files reflect their actual time coverage instead of relying on filename parsing

ezhilsabareesh8 · 2026-06-25T02:10:36Z

@anton-seaice Test output of the latest commit is here /scratch/tm70/ek4684/era5_rechunk_test/csf/csf_era5_oper_sfc_19790101-19791231.nc

anton-seaice

This didn't work for me (using the submission script).

Requesting one node on sapphirerapids and then only running 12 workers at once was ok - but i don't really undersand why it needs that much RAM to succeed

anton-seaice · 2026-06-25T00:38:48Z

+    args = parser.parse_args()
+
+    # ── single-task mode ──────────────────────────────────────────────────────
+    if args.stream and args.year:


Do we need this section?

Can args.stream be a list of streams, and args.year be a range of years to process? And then just treat them the same as "multi-task mode"

anton-seaice · 2026-06-25T00:39:29Z

+    if args.workers == 1:
+        for task in tasks:
+            stream, year, rc, _ = _subprocess_task(task)
+            if rc != 0:
+                failures.append((stream, year))


for task in tasks: stream, year, rc, _ = _subprocess_task(task) if rc != 0: failures.append((stream, year))

Do we need this?

It appears to work fine to comment this out and use multiprocessing.Pool(args.workers) with 1 worker

anton-seaice · 2026-06-25T00:42:52Z

+ALL_STREAMS = [
+    "10u",
+    "10v",
+    "2d",
+    "2t",
+    "ci",
+    "cp",
+    "csf",
+    "lsf",
+    "lsp",
+    "msl",
+    "ssr",
+    "ssrd",
+    "strd",
+]
+
+STREAM_TO_VARNAME = {
+    "10u": "u10",
+    "10v": "v10",
+    "2d": "d2m",
+    "2t": "t2m",
+    "ci": "siconc",
+    "cp": "cp",
+    "csf": "csf",
+    "lsf": "lsf",
+    "lsp": "lsp",
+    "msl": "msl",
+    "ssr": "ssr",
+    "ssrd": "ssrd",
+    "strd": "strd",
+}
+


Suggested change

ALL_STREAMS = [

"10u",

"10v",

"2d",

"2t",

"ci",

"cp",

"csf",

"lsf",

"lsp",

"msl",

"ssr",

"ssrd",

"strd",

]

STREAM_TO_VARNAME = {

"10u": "u10",

"10v": "v10",

"2d": "d2m",

"2t": "t2m",

"ci": "siconc",

"cp": "cp",

"csf": "csf",

"lsf": "lsf",

"lsp": "lsp",

"msl": "msl",

"ssr": "ssr",

"ssrd": "ssrd",

"strd": "strd",

}

# Streams used by ERA5 datamode in OM3

# Stream : variable name mapping in ERA5 data

STREAM_TO_VARNAME = {

"10u": "u10",

"10v": "v10",

"2d": "d2m",

"2t": "t2m",

"ci": "siconc",

"cp": "cp",

"csf": "csf",

"lsf": "lsf",

"lsp": "lsp",

"msl": "msl",

"ssr": "ssr",

"ssrd": "ssrd",

"strd": "strd",

}

ALL_STREAMS = list(STREAM_TO_VARNAME.keys())

anton-seaice · 2026-06-25T00:52:35Z

+    "10v": "v10",
+    "2d": "d2m",
+    "2t": "t2m",
+    "ci": "siconc",


why is siconc in this list ?

anton-seaice · 2026-06-25T01:04:27Z

+                )
+
+    print(f"Tasks to run:  {len(tasks)}")
+    print(f"Tasks skipped: {skipped}  (output already exists)")


Suggested change

print(f"Tasks skipped: {skipped} (output already exists)")

if not args.overwrite:

print(f"Tasks skipped: {skipped} (output already exists)")

anton-seaice · 2026-06-25T01:10:11Z

+    status = "OK" if result.returncode == 0 else "FAIL"
+    if result.returncode != 0:
+        tail = (result.stderr or result.stdout or "").strip()[-500:]
+        print(f"[{status}] {stream}/{year} ({elapsed:.0f}s)\n  {tail}", flush=True)


When it fails it didn't give enough useful information to understand why:

This is the log extract

[FAIL] 10u/1954 (161s) /g/data/tm70/as2285/making_inputs/om3-scripts/scripts_common.py:118: UserWarning: /g/data/tm70/as2285/making_inputs/om3-scripts/era5_rechunking/make_era5_yearly_rechunked.py contains uncommitted changes! Commit and push your changes before generating any production output. warn(

anton-seaice · 2026-06-25T01:43:47Z

+        [str(f) for f in source_files],
+        combine="by_coords",
+        mask_and_scale=False,
+        chunks={"time": 744, "latitude": 721, "longitude": 1440},


There isn't enough RAM in the qsub job to do this operation - it would need >10GB per task. by the time each whole file is uncompressed.

Not really sure what to suggest

anton-seaice · 2026-06-25T01:47:01Z

+
+    # Provenance
+    this_file = os.path.normpath(__file__)
+    now_iso = datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")


Suggested change

now_iso = datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")

date_str = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ")

we don't normally use utc time for stuff

anton-seaice · 2026-06-25T03:49:46Z

+def _output_exists(stream, year, output_base):
+    return bool(
+        glob.glob(
+            os.path.join(output_base, stream, f"{stream}_era5_oper_sfc_{year}*-*.nc")


This method a bit dangerous because it doesn't detect if the file is complete or not, but i dont have a better suggestion

ezhilsabareesh8 added 2 commits June 15, 2026 13:18

Add ERA5 rechunking scripts

0cb1442

Add output path

66caef5

ezhilsabareesh8 requested a review from anton-seaice June 15, 2026 03:31

format script

8374e91

anton-seaice reviewed Jun 17, 2026

View reviewed changes

Optimise make_era5_yearly_rechunked.py

4b0efbf

anton-seaice reviewed Jun 24, 2026

View reviewed changes

ezhilsabareesh8 added 2 commits June 25, 2026 11:45

Incorporate review comments

4086207

Rename run_make_era5_yearly_rechunked.sh

2343b84

anton-seaice reviewed Jun 25, 2026

View reviewed changes

ezhilsabareesh8 added 2 commits June 29, 2026 10:52

Prevent incomplete ERA5 yearly output files

939f81f

Fix formatting

e86057f

	OUTPUT_DIR="/g/data/tm70/ek4684/era5_rechunked_1h_yearly"
	OUTPUT_DIR="/g/data/${PROJECT}/${USER}/era5_rechunked_1h_yearly"

		@@ -0,0 +1,31 @@
		#!/usr/bin/bash
		# Copyright 2025 ACCESS-NRI and contributors. See the top-level COPYRIGHT file for details.

	# Copyright 2025 ACCESS-NRI and contributors. See the top-level COPYRIGHT file for details.
	# Copyright 2026 ACCESS-NRI and contributors. See the top-level COPYRIGHT file for details.

		@@ -0,0 +1,372 @@
		#!/usr/bin/env python3
		# Copyright 2025 ACCESS-NRI and contributors. See the top-level COPYRIGHT file for details.

	# Derive output filename from first/last source file date stamps
	# In case this year is a partially year, find first/last source file date stamps

	print(f"Tasks skipped: {skipped} (output already exists)")
	if not args.overwrite:
	print(f"Tasks skipped: {skipped} (output already exists)")

	now_iso = datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
	date_str = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ")

Uh oh!

Conversation

ezhilsabareesh8 commented Jun 15, 2026

Uh oh!

anton-seaice left a comment

Choose a reason for hiding this comment

Uh oh!

anton-seaice commented Jun 17, 2026

Uh oh!

ezhilsabareesh8 commented Jun 19, 2026

Uh oh!

anton-seaice left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezhilsabareesh8 Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezhilsabareesh8 commented Jun 25, 2026

Uh oh!

anton-seaice left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

ezhilsabareesh8 Jun 25, 2026 •

edited

Loading