Skip to content

Commit ea6b522

Browse files
EliEli
authored andcommitted
Add process_elev script and changed site_key to a hardwired name station_id.
This simplifies away site_id.
1 parent 7f9540d commit ea6b522

10 files changed

Lines changed: 270 additions & 223 deletions

File tree

AGENTS.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,15 @@ Ad hoc reading with pd.read_csv discouraged.
9393
* or they can take a string that evalues to a path using `config_file()` in `dbase_config.py`
9494
* for this reason the use of Path rather than str is often not preferred.
9595

96+
### Coordinates
97+
98+
Coordinates are the **single responsibility of the station registry** (`station_dbase.csv`).
99+
100+
- Registry columns: `agency_lat`, `agency_lon` (WGS84, agency-reported), `x`, `y` (EPSG:26910, adjusted)
101+
- Output file headers use: `latitude`, `longitude`, `projection_x_coordinate`, `projection_y_coordinate`
102+
- Dropbox recipes **must not** contain literal coordinate values — they are auto-populated from the registry during processing. Any of `lat`, `lon`, `latitude`, `longitude`, `agency_lat`, `agency_lon`, `x`, `y`, `projection_x_coordinate`, `projection_y_coordinate` in a recipe metadata section will raise an error.
103+
- To fix missing or wrong coordinates, update the registry CSV — not the recipe.
104+
96105
## Tests
97106

98107
- `tests/` — unit and integration tests with monkeypatched config; no real repo needed

README-dropbox.md

Lines changed: 129 additions & 113 deletions
Original file line numberDiff line numberDiff line change
@@ -2,149 +2,165 @@
22

33
## Overview
44

5-
The Dropbox Data Processing System is a component of the DMS Datastore package designed to facilitate the collection, transformation, and storage of time-series data. It provides a flexible configuration-based mechanism to process data files from various sources and integrate them into a standardized repository format.
5+
The dropbox system reads unformatted time-series data files from arbitrary sources,
6+
applies transforms, attaches standardized metadata, and writes formatted CSV files
7+
into a staging area. Optionally it reconciles staged files into a repository.
68

7-
## Key Components
9+
The entry point is a YAML specification file (a "recipe") that describes one or more
10+
data ingestion tasks. Recipes use [OmegaConf](https://omegaconf.readthedocs.io/)
11+
for variable interpolation.
812

9-
### 1. `dropbox_data.py`
13+
## CLI
1014

11-
This is the main processing script that handles data collection, metadata enrichment, and storage. It reads configuration from a YAML specification file and processes data according to the defined rules.
12-
13-
### 2. `dropbox_spec.yaml`
14-
15-
This YAML configuration file defines data sources, collection parameters, and metadata specifications. It serves as the blueprint for how data should be processed.
16-
17-
## How It Works
18-
19-
The system follows these steps:
20-
21-
1. Reads a YAML specification file
22-
2. For each data entry in the specification:
23-
- Locates source files based on patterns and locations
24-
- Reads time-series data
25-
- Augments with metadata (either directly specified or inferred)
26-
- Produces standardized output files in a designated location
27-
28-
## Usage
15+
```bash
16+
dms dropbox --input dropbox_spec.yaml # run all entries
17+
dms dropbox --input dropbox_spec.yaml --name ccfb # run one entry by name
18+
dms dropbox --input dropbox_spec.yaml --debug # verbose logging
19+
dms dropbox --input dropbox_spec.yaml --logdir ./logs --quiet
20+
```
2921

30-
### Basic Usage
22+
Options:
23+
- `--input` (required): Path to the YAML recipe file.
24+
- `--name` (repeatable): Run only the named recipe entry/entries.
25+
- `--logdir`: Directory for log files.
26+
- `--debug`: Enable debug-level logging.
27+
- `--quiet`: Suppress console output.
3128

32-
To process data according to the specification:
29+
## Programmatic Use
3330

3431
```python
3532
from dms_datastore.dropbox_data import dropbox_data
36-
37-
# Process data using the specification file
38-
dropbox_data("path/to/dropbox_spec.yaml")
33+
dropbox_data("dropbox_spec.yaml")
34+
dropbox_data("dropbox_spec.yaml", selected_names=["ccfb"])
3935
```
4036

41-
Alternatively, you can run the script directly:
37+
## Recipe Structure
4238

43-
```bash
44-
python -m dms_datastore.dropbox_data
39+
```yaml
40+
# Top-level variables available via ${...} interpolation
41+
dropbox_home: //cnrastore-bdo/Modeling_Data/repo_staging/dropbox
42+
target_tz: "Etc/GMT+8"
43+
44+
data:
45+
- name: <unique recipe entry name>
46+
skip: false # optional, set true to skip
47+
48+
collect:
49+
file_pattern: "*.csv" # glob or filename template (see below)
50+
location: "${dropbox_home}/subdir"
51+
recursive_search: false
52+
reader: read_ts # currently the only supported reader
53+
reader_args: {} # optional kwargs passed to reader
54+
selector: null # column name to select, or null
55+
wildcard: null # null | time_shard | time_overlap
56+
merge_method: ts_splice # ts_splice | ts_merge (for time_overlap)
57+
merge_args: {} # kwargs to merge function
58+
splice_args: {} # optional: {rename: value} or {rename: {old: new}}
59+
60+
transforms: # optional, applied in order
61+
- dst_tz # string form (no args)
62+
- name: coarsen # dict form (with args)
63+
args:
64+
grid: 2min
65+
preserve_vals: [0.0]
66+
67+
metadata:
68+
station_id: <id> # required (literal, infer_from_filename, or infer_from_agency_id)
69+
subloc: default # required
70+
source: <source> # required
71+
agency: <agency> # required (literal or registry_lookup)
72+
param: <param> # required
73+
unit: <unit> # required
74+
time_zone: Etc/GMT+8 # required
75+
freq: infer # required (literal freq string, "infer", or None for irregular)
76+
# Other fields as needed (station_name: registry_lookup, etc.)
77+
# Coordinates are NOT allowed here — they are auto-populated from the registry.
78+
79+
output:
80+
repo_name: formatted # must match a repo in dstore_config.yaml
81+
staging:
82+
dir: ./drop_staging # must exist; staged files written here
83+
write_args: # optional kwargs to write_ts_csv
84+
float_format: "%.4f"
85+
chunk_years: false
86+
reconcile: # optional; if present, staged files are reconciled into repo
87+
#repo_data_dir: ./fake_repo # override target dir (omit to use repo root from config)
88+
prefer: staged # staged | repo
89+
allow_new_series: true
90+
inspection:
91+
recent_years: 3
92+
p3: 0.15
93+
p10: 0.05
4594
```
4695
47-
### Configuration Specification
48-
49-
The `dropbox_spec.yaml` file has the following structure:
50-
51-
- `dropbox_home`: Base directory for data processing
52-
- `dest`: Destination folder for processed files
53-
- `data`: List of data sources to process, each with:
54-
- `name`: Descriptive name for the data source
55-
- `skip`: Optional flag to skip processing (True/False)
56-
- `collect`: Collection parameters including:
57-
- `name`: Collection method name
58-
- `file_pattern`: Pattern for matching files
59-
- `location`: Source directory path
60-
- `recursive_search`: Whether to search subdirectories
61-
- `reader`: Reading method (e.g., "read_ts")
62-
- `selector`: Column selector (optional)
63-
- `metadata`: Static metadata fields including:
64-
- `station_id`: Station identifier (or "infer_from_agency_id" for dynamic inference)
65-
- `source`: Data source name
66-
- `agency`: Agency name
67-
- `param`: Parameter type (flow, temp, etc.)
68-
- `sublocation`: Sub-location identifier
69-
- `unit`: Measurement unit
70-
- `metadata_infer`: Optional rules for inferring metadata from filenames:
71-
- `regex`: Regular expression pattern
72-
- `groups`: Mapping of regex groups to metadata fields
73-
74-
## Example Configuration
75-
76-
Below is an example entry from the configuration file:
96+
## Metadata Sentinels
7797
78-
```yaml
79-
- name: USGS Aquarius flows
80-
skip: False
81-
collect:
82-
name: file_search
83-
recursive_search: True
84-
file_pattern: "Discharge.ft^3_s.velq@*.EntireRecord.csv"
85-
location: "//cnrastore-bdo/Modeling_Data/repo_staging/dropbox/usgs_aquarius_request_2020/**"
86-
reader: read_ts
87-
metadata:
88-
station_id: infer_from_agency_id
89-
source: aquarius
90-
agency: usgs
91-
param: flow
92-
sublocation: default
93-
unit: ft^3/s
94-
metadata_infer:
95-
regex: .*@(.*)\.EntireRecord.csv
96-
groups:
97-
1: agency_id
98-
```
98+
Recipe metadata values can be:
99+
- **Literal**: `station_id: anh` — used as-is.
100+
- **`infer_from_filename`**: Parsed from the filename using the `file_pattern` template.
101+
- **`registry_lookup`**: Looked up from the station registry CSV by station_id or agency_id.
102+
Supported fields: `station_name`, `agency`, `agency_id`.
103+
- **`infer_from_agency_id`**: Special value for `station_id` — resolves station_id from the
104+
registry by matching `agency_id`.
99105

100-
## Key Classes and Functions
106+
## Coordinate Policy
101107

102-
### DataCollector
108+
Geospatial coordinates are **always auto-populated from the station registry** (e.g.
109+
`station_dbase.csv`). Recipe authors must not include coordinate fields in `metadata:`.
103110

104-
A class that handles file discovery based on specified patterns:
111+
The following keys are banned in recipe metadata sections:
105112

106-
```python
107-
collector = DataCollector(name, location, file_pattern, recursive)
108-
files = collector.data_file_list()
109-
```
113+
> `lat`, `lon`, `latitude`, `longitude`, `agency_lat`, `agency_lon`,
114+
> `x`, `y`, `projection_x_coordinate`, `projection_y_coordinate`
110115

111-
### get_spec
116+
If any of these appear, the recipe will fail with an error directing the user to
117+
add the station to the registry instead.
112118

113-
Loads and caches the YAML specification:
119+
The registry provides:
120+
- `agency_lat` / `agency_lon` — agency-reported WGS84 coordinates (written to file
121+
headers as `latitude` / `longitude`)
122+
- `x` / `y` — projected coordinates in EPSG:26910 (UTM Zone 10N), potentially
123+
adjusted for accuracy (written as `projection_x_coordinate` / `projection_y_coordinate`)
114124

115-
```python
116-
spec = get_spec("dropbox_spec.yaml")
117-
```
125+
## Wildcard Modes
118126

119-
### populate_meta
127+
The `collect.wildcard` field controls how multiple files matching `file_pattern` are handled:
120128

121-
Enriches metadata using the station database:
129+
- **omitted / null**: Pattern must match exactly one file.
130+
- **`time_shard`**: Pass the glob pattern directly to the reader (year-sharded/blocked files). Lexicographical sorting is assumed to match chronological.
131+
- **`time_overlap`**: Glob, read each file individually, then merge via `merge_method`.
122132

123-
```python
124-
meta_out = populate_meta(file_path, listing, metadata)
125-
```
133+
## Filename Templates (Inference Mode)
126134

127-
### infer_meta
135+
When `file_pattern` contains `{field}` placeholders (e.g.
136+
`{source}_{station_id}_{agency_id}_{param}_{syear}_{eyear}.csv`), the system enters
137+
"inference mode": each matched file's name is parsed to extract metadata fields marked
138+
`infer_from_filename`. In this mode, `wildcard` must be omitted — each file produces
139+
a separate output.
128140

129-
Extracts metadata from file names based on regex patterns:
141+
## Transforms
130142

131-
```python
132-
metadata = infer_meta(file_path, listing)
133-
```
143+
Transforms are applied to the time series after reading (and after merging if applicable).
144+
Built-in transforms:
134145

135-
## Output
146+
- **`dst_st` / `dst_tz`**: Convert from local (DST-aware) time to a fixed timezone.
147+
Args: `src_tz`, `target_tz`.
148+
- **`coarsen`**: Reduce irregular high-frequency data to a regular grid.
149+
Args: `grid`, `preserve_vals`, `qwidth`, `hyst`, `heartbeat_freq`.
136150

137-
Processed files are saved in the destination directory (`dest`) specified in the configuration. Each file is named according to the pattern:
151+
Custom transforms can be registered via `register_transform(name, func)`.
138152

139-
```
140-
{source}_{station_id}_{agency_id}_{param}.csv
141-
```
153+
## Failure Handling
154+
155+
Each recipe entry is processed independently. If one fails, the error is logged and
156+
processing continues with the next entry. At the end, if any entries failed, a
157+
`RuntimeError` is raised listing all failed entry names. Use `--name <entry>` to
158+
rerun individual failures.
142159

143-
Files may be chunked by year depending on the specified options.
160+
## Examples
144161

145-
## Additional Notes
162+
See `examples/dropbox/` for working recipes:
163+
- `dropbox_spec.yaml` — single-file and wildcard patterns
164+
- `dropbox_spec_ccf.yaml` — structure gate data with transforms (coarsen, DST)
165+
- `dropbox_daily.yaml` — template-based inference mode for daily NWIS data
146166

147-
- The system relies on a station database for lookup of station details
148-
- Time-series data is standardized with a "value" column
149-
- Metadata includes geospatial coordinates and projection information
150-
- Files can be chunked by year for easier management of large datasets

dms_datastore/_version.py

Lines changed: 9 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
1-
# file generated by setuptools-scm
1+
# file generated by vcs-versioning
22
# don't change, don't track in version control
3+
from __future__ import annotations
34

45
__all__ = [
56
"__version__",
@@ -10,25 +11,14 @@
1011
"commit_id",
1112
]
1213

13-
TYPE_CHECKING = False
14-
if TYPE_CHECKING:
15-
from typing import Tuple
16-
from typing import Union
17-
18-
VERSION_TUPLE = Tuple[Union[int, str], ...]
19-
COMMIT_ID = Union[str, None]
20-
else:
21-
VERSION_TUPLE = object
22-
COMMIT_ID = object
23-
2414
version: str
2515
__version__: str
26-
__version_tuple__: VERSION_TUPLE
27-
version_tuple: VERSION_TUPLE
28-
commit_id: COMMIT_ID
29-
__commit_id__: COMMIT_ID
16+
__version_tuple__: tuple[int | str, ...]
17+
version_tuple: tuple[int | str, ...]
18+
commit_id: str | None
19+
__commit_id__: str | None
3020

31-
__version__ = version = '0.5.4.dev5+g38eda8888.d20260212'
32-
__version_tuple__ = version_tuple = (0, 5, 4, 'dev5', 'g38eda8888.d20260212')
21+
__version__ = version = '0.6.26.dev1+g5a6f01d8c.d20260502'
22+
__version_tuple__ = version_tuple = (0, 6, 26, 'dev1', 'g5a6f01d8c.d20260502')
3323

34-
__commit_id__ = commit_id = 'g38eda8888'
24+
__commit_id__ = commit_id = 'g5a6f01d8c'

dms_datastore/config_data/dstore_config.yaml

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,16 @@ default_repo: screened
2020
# The registry files provide supplementary metadata that can be used for provider resolution,
2121
# geographical and other purposes.
2222
registries:
23-
continuous: station_dbase.csv
23+
continuous:
24+
file: station_dbase.csv
25+
# column_map: registry CSV column name -> canonical metadata key used in data file headers.
26+
# Unmapped columns pass through by name (e.g. agency_id stays agency_id).
27+
column_map:
28+
name: station_name
29+
lat: latitude
30+
lon: longitude
31+
x: projection_x_coordinate
32+
y: projection_y_coordinate
2433
processed_synthetic: processed_registry.csv
2534
structures: structures_registry.csv
2635
daily: station_dbase.csv

0 commit comments

Comments
 (0)