add CalAdapt WRF download function for CMIP6 hourly met data#3
add CalAdapt WRF download function for CMIP6 hourly met data#3divine7022 wants to merge 19 commits into
Conversation
dlebauer
left a comment
There was a problem hiding this comment.
Looks great! A few comments / suggestions.
| year_start <- paste0(year, "-01-01T00:00:00") | ||
| year_end <- paste0(year, "-12-31T23:00:00") | ||
|
|
||
| ##fetch each variable, using session cache to avoid redundant S3 reads |
There was a problem hiding this comment.
From what I understand, writing the rds then loading it 10k times will be less efficient than using a format that is indexed like parquet or netcdf etc.
There was a problem hiding this comment.
great point, ran the numbers before changing anything; tested ads vs netcdf (non-proxy and proxy) vs Parquet on WRF data from S3. proxy mode looks like the obvious win on paper, but the stars docs say it directly: "operations requiring data access automatically fetch the underlying data, defeating lazy evaluation benefits" st_extract is one of those, it materializes full grid every call, then pays the proxy overhead on top
the reason indexed formats don't help us here is that our access pattern is "load full grid, extract one cell, done" there's no partial read happening to optimize. rds deserialization is just faster than GDAL's netcdf parsing for stars objects at this size class. and bonus, rds preserves the units attribute on the stars object; netcdf write_stars and read_stars round trip strips it.
Keeping rds for now and open to revisiting if you want me to add an in memory env cache so the same (model, scenario, var, year) only hits disk once per R session
| model = model, | ||
| scenario = scenario, | ||
| timescale = "1hr", | ||
| resolution = "d01", |
There was a problem hiding this comment.
can grid size be a function argument?
There was a problem hiding this comment.
done, exposed as resolution param, defaults to "d01".
heads-up: WUS-D3 publishes d01 (45 km), d02 (9 km), and d03 (3 km), all at hourly/daily/monthly. coverage varies by model+scenario+variable combo, so caladaptaer::cae_check_variables() is the right thing to run before launching production at d02 or d03
| # precip: WRF hourly accumulation (mm) -> CF flux (kg/m2/s) | ||
| # 1 mm water = 1 kg/m2, divide by 3600s for hourly timestep | ||
| if ("prec" %in% names(dat.list)) { | ||
| dat.list[["prec"]] <- dat.list[["prec"]] / 3600 |
There was a problem hiding this comment.
set hour_to_second <- PEcAn.utils::ud_convert(1, 'h', 's') at top of file. Hard coded conversion factors are more likely to be in error, even when they seem straightforward it is better to be explicit.
There was a problem hiding this comment.
looked into this. there are two things:
-
"hardcoded factors are error-prone"
argument is real in general, agreed ud_convert is the cleaner pattern when there's any chance the unit could change
2. but for this specific code timescale = "1hr" is hardcoded right next to the conversion in the same function (lines apart). and verified the catalog too, WUS-D3 only publishes 1hr, day, mon for WRF activity, no 3hr table exists. so there's no realistic future state where the timescale changes without someone editing this exact block of code
the closest analog in download.NLDAS.R uses precipitation_flux / 3600 with same hardcoded value, so we're consistent with existing precedent
reverted my ud_convert commit and went back to / 3600. could be swap back if you would rather standardize on the named constant pattern across both functions
| testthat (>= 3.1.7), | ||
| withr | ||
| Remotes: | ||
| github::lebauerapproach/caladaptR, |
|
closing in favor of PR against PecanProject/pecan |
Description
adds
download.CalAdaptWRF()toPEcAn.data.atmospherenew met driver that pulls hourly WRF dynamically downscaled CMIP6 projections from the Cal-Adapt Analytics Engine (WUS-D3 dataset, Rahimi et al. 2024). Data sits on public AWS S3, no auth neededthis is implemented as a part of CCMMF where we need future climate forcing at ~200 California sites for SIPNET runs under multiple GCMs and SSPs.
looked at how pecan handles met downloads and this follows the same pattern as CRUNCEP/GFDL, the download function does everything (fetch, extract, convert, write CF) in one shot, so we skip the met2CF and extract.nc stages. Main reason: WRF uses a Lambert Conformal grid and
extract.nc/closest_xyassume lat-lon grids with NARR-style bounds, so they can't handle this projection. Adding CalAdaptWRF to skip from list in met.process.R was the cleanest path.the tricky part is that pecan's
papplycallsdownload.CalAdaptWRFonce per site, it never sees the full site list. Naively that means re-reading the same WRF grid from S3 for every site. The 45 km grid is small (~20-30 MB per var per year), so we cache the full grid as rds intempdir()on the first site. Sites 2 through N just doreadRDS()and extract their grid cell locally. For 200 sites x 8 vars x 20 years, that cuts S3 round trips from 32,000 to 160. The cache auto cleans when R exitsadded
caladapt_wrfcolumn topecan_standard_met_table9 output variables total:
data coverage:
implements -- ccmmf/organization#82 and ccmmf/organization#83
Motivation and Context
Review Time Estimate
Types of changes
Checklist: