-
Notifications
You must be signed in to change notification settings - Fork 171
Migrate to use Pooch for data ingestion and update example data source #1955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
3e8e485
Update varnames
VeckoTheGecko 9db3522
Add pooch
VeckoTheGecko 73d5ad1
Update data downloading to use pooch
VeckoTheGecko d3eec4e
Update example data host to parcels-data repo
VeckoTheGecko 224087b
Update dev docs for EXAMPLE_DATA_FILES
VeckoTheGecko 6580f42
remove platformdirs from dependencies
VeckoTheGecko 6a5a8d3
Update test name
VeckoTheGecko 9c967b3
Add v4 dev note
VeckoTheGecko 7a169a0
Fix mypy
VeckoTheGecko afa0ac5
Merge branch 'v4-dev' into pooch
VeckoTheGecko 9736dcd
update function name
VeckoTheGecko File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,4 +3,5 @@ channels: | |
| - conda-forge | ||
| dependencies: | ||
| - parcels | ||
| - pooch | ||
| - trajan | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,16 +1,33 @@ | ||
| import os | ||
| from datetime import datetime, timedelta | ||
| from pathlib import Path | ||
| from urllib.request import urlretrieve | ||
|
|
||
| import platformdirs | ||
| import pooch | ||
| import xarray as xr | ||
|
|
||
| from parcels.tools._v3to4 import patch_dataset_v4_compat | ||
|
|
||
| __all__ = ["download_example_dataset", "get_data_home", "list_example_datasets"] | ||
|
|
||
| example_data_files = { | ||
| __all__ = ["download_example_dataset", "list_example_datasets"] | ||
|
|
||
| # When modifying existing datasets in a backwards incompatible way, | ||
| # make a new release in the repo and update the DATA_REPO_TAG to the new tag | ||
| DATA_REPO_TAG = "main" | ||
|
|
||
| DATA_URL = f"https://github.com/OceanParcels/parcels-data/raw/{DATA_REPO_TAG}/data" | ||
|
|
||
| # Keys are the dataset names. Values are the filenames in the dataset folder. Note that | ||
| # you can specify subfolders in the dataset folder putting slashes in the filename list. | ||
| # e.g., | ||
| # "my_dataset": ["file0.nc", "folder1/file1.nc", "folder2/file2.nc"] | ||
| # my_dataset/ | ||
| # ├── file0.nc | ||
| # ├── folder1/ | ||
| # │ └── file1.nc | ||
| # └── folder2/ | ||
| # └── file2.nc | ||
| # | ||
| # See instructions at https://github.com/OceanParcels/parcels-data for adding new datasets | ||
|
Comment on lines
+18
to
+29
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @fluidnumerics-joe are these instructions clear? Do you think you can add your datasets now? |
||
| EXAMPLE_DATA_FILES: dict[str, list[str]] = { | ||
| "MovingEddies_data": [ | ||
| "moving_eddiesP.nc", | ||
| "moving_eddiesU.nc", | ||
|
|
@@ -79,24 +96,32 @@ | |
| } | ||
|
|
||
|
|
||
| example_data_url = "http://oceanparcels.org/examples-data" | ||
| def _create_pooch_registry() -> dict[str, None]: | ||
| """Collapses the mapping of dataset names to filenames into a pooch registry. | ||
|
|
||
| Hashes are set to None for all files. | ||
| """ | ||
| registry: dict[str, None] = {} | ||
| for dataset, filenames in EXAMPLE_DATA_FILES.items(): | ||
| for filename in filenames: | ||
| registry[f"{dataset}/{filename}"] = None | ||
| return registry | ||
|
|
||
| def get_data_home(data_home=None): | ||
| """Return a path to the cache directory for example datasets. | ||
|
|
||
| This directory is used by :func:`load_dataset`. | ||
| POOCH_REGISTRY = _create_pooch_registry() | ||
|
|
||
| If the ``data_home`` argument is not provided, it will use a directory | ||
| specified by the ``PARCELS_EXAMPLE_DATA`` environment variable (if it exists) | ||
| or otherwise default to an OS-appropriate user cache location. | ||
| """ | ||
|
|
||
| def _get_pooch(data_home=None): | ||
| if data_home is None: | ||
| data_home = os.environ.get("PARCELS_EXAMPLE_DATA") | ||
| if data_home is None: | ||
|
erikvansebille marked this conversation as resolved.
|
||
| data_home = os.environ.get("PARCELS_EXAMPLE_DATA", platformdirs.user_cache_dir("parcels")) | ||
| data_home = os.path.expanduser(data_home) | ||
| if not os.path.exists(data_home): | ||
| os.makedirs(data_home) | ||
| return data_home | ||
| data_home = pooch.os_cache("parcels") | ||
|
|
||
| return pooch.create( | ||
| path=data_home, | ||
| base_url=DATA_URL, | ||
| registry=POOCH_REGISTRY, | ||
| ) | ||
|
|
||
|
|
||
| def list_example_datasets() -> list[str]: | ||
|
|
@@ -109,7 +134,7 @@ | |
| datasets : list of str | ||
| The names of the available example datasets. | ||
| """ | ||
| return list(example_data_files.keys()) | ||
| return list(EXAMPLE_DATA_FILES.keys()) | ||
|
|
||
|
|
||
| def download_example_dataset(dataset: str, data_home=None): | ||
|
|
@@ -133,26 +158,30 @@ | |
| Path to the folder containing the downloaded dataset files. | ||
| """ | ||
| # Dev note: `dataset` is assumed to be a folder name with netcdf files | ||
| if dataset not in example_data_files: | ||
| if dataset not in EXAMPLE_DATA_FILES: | ||
| raise ValueError( | ||
| f"Dataset {dataset!r} not found. Available datasets are: " + ", ".join(example_data_files.keys()) | ||
| f"Dataset {dataset!r} not found. Available datasets are: " + ", ".join(EXAMPLE_DATA_FILES.keys()) | ||
| ) | ||
| odie = _get_pooch(data_home=data_home) | ||
|
|
||
| cache_folder = get_data_home(data_home) | ||
| dataset_folder = Path(cache_folder) / dataset | ||
| cache_folder = Path(odie.path) | ||
| dataset_folder = cache_folder / dataset | ||
|
|
||
| if not dataset_folder.exists(): | ||
| dataset_folder.mkdir(parents=True) | ||
| for file_name in odie.registry: | ||
| if file_name.startswith(dataset): | ||
| should_patch = dataset == "GlobCurrent_example_data" | ||
|
erikvansebille marked this conversation as resolved.
|
||
| odie.fetch(file_name, processor=v4_compat_patch if should_patch else None) | ||
|
|
||
| for filename in example_data_files[dataset]: | ||
| filepath = dataset_folder / filename | ||
| if not filepath.exists(): | ||
| url = f"{example_data_url}/{dataset}/{filename}" | ||
| urlretrieve(url, str(filepath)) | ||
| return dataset_folder | ||
|
|
||
| should_patch = dataset == "GlobCurrent_example_data" | ||
|
|
||
| if should_patch: | ||
| xr.load_dataset(filepath).pipe(patch_dataset_v4_compat).to_netcdf(filepath) | ||
| def v4_compat_patch(fname, action, pup): | ||
| """ | ||
| Patch the GlobCurrent example dataset to be compatible with v4. | ||
|
|
||
| return dataset_folder | ||
| See https://www.fatiando.org/pooch/latest/processors.html#creating-your-own-processors | ||
| """ | ||
| if action == "fetch": | ||
| return fname | ||
| xr.load_dataset(fname).pipe(patch_dataset_v4_compat).to_netcdf(fname) | ||
| return fname | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.