Skip to content

Commit a7a76ad

Browse files
authored
feat: support generic http(s) urls (#8)
* feat: support generic http(s) urls * fix: add tests * update README and add CHANGELOG * improve test url * fix: bring back monkeypatching
1 parent 99a38cc commit a7a76ad

7 files changed

Lines changed: 253 additions & 144 deletions

File tree

CHANGELOG.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
<!--
2+
SPDX-FileCopyrightText: Contributors to PyPSA-Eur <https://github.com/pypsa/pypsa-eur>
3+
SPDX-License-Identifier: CC-BY-4.0
4+
-->
5+
6+
# Changelog
7+
8+
## Unreleased — **Breaking change**
9+
10+
This release accepts all HTTP(S) URLs and therefore conflicts with `snakemake-storage-plugin-http`.
11+
**You must uninstall `snakemake-storage-plugin-http`** before upgrading, otherwise Snakemake will
12+
raise *"Multiple suitable storage providers found"* for any HTTP(S) URL.
13+
14+
### Added
15+
16+
- Generic HTTP(S) fallback: any `http://` or `https://` URL is now accepted, with size and
17+
mtime read from `Content-Length` and `Last-Modified` response headers. Servers that do not
18+
support `HEAD` requests are handled gracefully (size and mtime default to 0). No checksum
19+
is available for generic URLs.
20+
21+
### Removed
22+
23+
- Dependency on `snakemake-storage-plugin-http` — this plugin now handles all HTTP(S) URLs
24+
directly, with no monkey-patching required.
25+
26+
## v0.4.0 — Google Cloud Storage support
27+
28+
### Added
29+
30+
- Support for `storage.googleapis.com` URLs with checksum verification via the GCS JSON API
31+
(`md5Hash` field) and mtime from GCS object metadata.
32+
33+
## v0.3.0 — data.pypsa.org support
34+
35+
### Added
36+
37+
- Support for `data.pypsa.org` URLs with checksum verification via `manifest.yaml` files
38+
discovered by searching up the directory tree.
39+
- Redirect support: manifest entries can specify a `redirect` field to point to another path.
40+
41+
## v0.2.0 — Dynamic versioning and zstd support
42+
43+
### Added
44+
45+
- Dynamic versioning via `setuptools-scm`.
46+
- `zstandard` dependency for decompressing Cloudflare-compressed responses.
47+
48+
## v0.1.0 — Initial release
49+
50+
### Added
51+
52+
- Snakemake storage plugin for Zenodo URLs (`zenodo.org`, `sandbox.zenodo.org`) with:
53+
- Local filesystem caching via `Cache` class
54+
- Checksum verification from Zenodo API
55+
- Adaptive rate limiting using `X-RateLimit-*` headers with exponential backoff retry
56+
- Concurrent download limiting via semaphore
57+
- Progress bars with `tqdm-loggable`

README.md

Lines changed: 23 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ A Snakemake storage plugin for downloading files via HTTP with local caching, ch
1111
- **zenodo.org** - Zenodo data repository (checksum from API)
1212
- **data.pypsa.org** - PyPSA data repository (checksum from manifest.yaml)
1313
- **storage.googleapis.com** - Google Cloud Storage (checksum from GCS JSON API)
14+
- **any http(s) URL** - Generic fallback with size/mtime from HTTP headers
1415

1516
## Features
1617

@@ -19,7 +20,7 @@ A Snakemake storage plugin for downloading files via HTTP with local caching, ch
1920
- **Rate limit handling**: Automatically respects Zenodo's rate limits using `X-RateLimit-*` headers with exponential backoff retry
2021
- **Concurrent download control**: Limits simultaneous downloads to prevent overwhelming servers
2122
- **Progress bars**: Shows download progress with tqdm
22-
- **Immutable URLs**: Returns mtime=0 for Zenodo and data.pypsa.org (persistent URLs); uses actual mtime for GCS
23+
- **Immutable URLs**: Returns mtime=0 for Zenodo and data.pypsa.org (persistent URLs); uses actual mtime for GCS and generic HTTP
2324
- **Environment variable support**: Configure via environment variables for CI/CD workflows
2425

2526
## Installation
@@ -67,7 +68,7 @@ If you don't explicitly configure it, the plugin will use default settings autom
6768

6869
## Usage
6970

70-
Use Zenodo, data.pypsa.org, or Google Cloud Storage URLs directly in your rules. Snakemake automatically detects supported URLs and routes them to this plugin:
71+
Use any HTTP(S) URL directly in your rules. Snakemake automatically routes all HTTP(S) URLs to this plugin:
7172

7273
```python
7374
rule download_zenodo:
@@ -93,6 +94,14 @@ rule download_gcs:
9394
"resources/cba_projects.zip"
9495
shell:
9596
"cp {input} {output}"
97+
98+
rule download_generic:
99+
input:
100+
storage("https://example.com/data/dataset.csv"),
101+
output:
102+
"resources/dataset.csv"
103+
shell:
104+
"cp {input} {output}"
96105
```
97106

98107
Or if you configured a tagged storage entity:
@@ -116,7 +125,7 @@ The plugin will:
116125
- Progress bar showing download status
117126
- Automatic rate limit handling with exponential backoff retry
118127
- Concurrent download limiting
119-
- Checksum verification (from Zenodo API, data.pypsa.org manifest, or GCS metadata)
128+
- Checksum verification where available (Zenodo API, data.pypsa.org manifest, GCS metadata)
120129
4. Store in cache for future use (if caching is enabled)
121130

122131
### Example: CI/CD Configuration
@@ -148,19 +157,19 @@ The plugin automatically:
148157

149158
## URL Handling
150159

151-
- Handles URLs from `zenodo.org`, `sandbox.zenodo.org`, `data.pypsa.org`, and `storage.googleapis.com`
152-
- Other HTTP(S) URLs are handled by the standard `snakemake-storage-plugin-http`
153-
- Both plugins can coexist in the same workflow
154-
155-
### Plugin Priority
160+
This plugin accepts **all HTTP(S) URLs** and replaces `snakemake-storage-plugin-http`. It provides
161+
enhanced support for specific sources:
156162

157-
When using `storage()` without specifying a plugin name, Snakemake checks all installed plugins:
158-
- **Cached HTTP plugin**: Only accepts zenodo.org, data.pypsa.org, and storage.googleapis.com URLs
159-
- **HTTP plugin**: Accepts all HTTP/HTTPS URLs (including zenodo.org)
163+
| Source | Checksum | mtime | Immutable |
164+
|---|---|---|---|
165+
| `zenodo.org`, `sandbox.zenodo.org` | ✓ (from API) | — | ✓ |
166+
| `data.pypsa.org` | ✓ (from manifest.yaml) | — | ✓ |
167+
| `storage.googleapis.com` | ✓ (from GCS API) | ✓ | — |
168+
| any other HTTP(S) | — | ✓ (Last-Modified) | — |
160169

161-
If both plugins are installed, supported URLs would be ambiguous - both plugins accept them.
162-
Typically snakemake would raise an error: **"Multiple suitable storage providers found"** if you try to use `storage()` without specifying which plugin to use, ie. one needs to explicitly call the Cached HTTP provider using `storage.cached_http(url)` instead of `storage(url)`,
163-
but we monkey-patch the http plugin to refuse zenodo.org, data.pypsa.org, and storage.googleapis.com URLs.
170+
Generic HTTP URLs are treated as mutable: size and mtime are read from `Content-Length` and
171+
`Last-Modified` response headers. Servers that do not support `HEAD` requests are handled
172+
gracefully (size and mtime default to 0).
164173

165174
## License
166175

pyproject.toml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,9 @@ dependencies = [
1717
"httpx ~= 0.27",
1818
"platformdirs ~= 4.0",
1919
"reretry ~= 0.11",
20-
"snakemake-interface-common ~= 1.14",
20+
"snakemake-interface-common >=1.14,<2.0",
2121
"snakemake-interface-storage-plugins >=4.2,<5.0",
22-
"snakemake-storage-plugin-http ~= 0.3",
23-
"tqdm-loggable ~= 0.2",
22+
"tqdm-loggable ~= 0.3",
2423
"typing-extensions ~= 4.15",
2524
"zstandard ~=0.25.0",
2625
]

0 commit comments

Comments
 (0)