|
2 | 2 |
|
3 | 3 | ## Overview |
4 | 4 |
|
5 | | -The Dropbox Data Processing System is a component of the DMS Datastore package designed to facilitate the collection, transformation, and storage of time-series data. It provides a flexible configuration-based mechanism to process data files from various sources and integrate them into a standardized repository format. |
| 5 | +The dropbox system reads unformatted time-series data files from arbitrary sources, |
| 6 | +applies transforms, attaches standardized metadata, and writes formatted CSV files |
| 7 | +into a staging area. Optionally it reconciles staged files into a repository. |
6 | 8 |
|
7 | | -## Key Components |
| 9 | +The entry point is a YAML specification file (a "recipe") that describes one or more |
| 10 | +data ingestion tasks. Recipes use [OmegaConf](https://omegaconf.readthedocs.io/) |
| 11 | +for variable interpolation. |
8 | 12 |
|
9 | | -### 1. `dropbox_data.py` |
| 13 | +## CLI |
10 | 14 |
|
11 | | -This is the main processing script that handles data collection, metadata enrichment, and storage. It reads configuration from a YAML specification file and processes data according to the defined rules. |
12 | | - |
13 | | -### 2. `dropbox_spec.yaml` |
14 | | - |
15 | | -This YAML configuration file defines data sources, collection parameters, and metadata specifications. It serves as the blueprint for how data should be processed. |
16 | | - |
17 | | -## How It Works |
18 | | - |
19 | | -The system follows these steps: |
20 | | - |
21 | | -1. Reads a YAML specification file |
22 | | -2. For each data entry in the specification: |
23 | | - - Locates source files based on patterns and locations |
24 | | - - Reads time-series data |
25 | | - - Augments with metadata (either directly specified or inferred) |
26 | | - - Produces standardized output files in a designated location |
27 | | - |
28 | | -## Usage |
| 15 | +```bash |
| 16 | +dms dropbox --input dropbox_spec.yaml # run all entries |
| 17 | +dms dropbox --input dropbox_spec.yaml --name ccfb # run one entry by name |
| 18 | +dms dropbox --input dropbox_spec.yaml --debug # verbose logging |
| 19 | +dms dropbox --input dropbox_spec.yaml --logdir ./logs --quiet |
| 20 | +``` |
29 | 21 |
|
30 | | -### Basic Usage |
| 22 | +Options: |
| 23 | +- `--input` (required): Path to the YAML recipe file. |
| 24 | +- `--name` (repeatable): Run only the named recipe entry/entries. |
| 25 | +- `--logdir`: Directory for log files. |
| 26 | +- `--debug`: Enable debug-level logging. |
| 27 | +- `--quiet`: Suppress console output. |
31 | 28 |
|
32 | | -To process data according to the specification: |
| 29 | +## Programmatic Use |
33 | 30 |
|
34 | 31 | ```python |
35 | 32 | from dms_datastore.dropbox_data import dropbox_data |
36 | | - |
37 | | -# Process data using the specification file |
38 | | -dropbox_data("path/to/dropbox_spec.yaml") |
| 33 | +dropbox_data("dropbox_spec.yaml") |
| 34 | +dropbox_data("dropbox_spec.yaml", selected_names=["ccfb"]) |
39 | 35 | ``` |
40 | 36 |
|
41 | | -Alternatively, you can run the script directly: |
| 37 | +## Recipe Structure |
42 | 38 |
|
43 | | -```bash |
44 | | -python -m dms_datastore.dropbox_data |
| 39 | +```yaml |
| 40 | +# Top-level variables available via ${...} interpolation |
| 41 | +dropbox_home: //cnrastore-bdo/Modeling_Data/repo_staging/dropbox |
| 42 | +target_tz: "Etc/GMT+8" |
| 43 | + |
| 44 | +data: |
| 45 | + - name: <unique recipe entry name> |
| 46 | + skip: false # optional, set true to skip |
| 47 | + |
| 48 | + collect: |
| 49 | + file_pattern: "*.csv" # glob or filename template (see below) |
| 50 | + location: "${dropbox_home}/subdir" |
| 51 | + recursive_search: false |
| 52 | + reader: read_ts # currently the only supported reader |
| 53 | + reader_args: {} # optional kwargs passed to reader |
| 54 | + selector: null # column name to select, or null |
| 55 | + wildcard: null # null | time_shard | time_overlap |
| 56 | + merge_method: ts_splice # ts_splice | ts_merge (for time_overlap) |
| 57 | + merge_args: {} # kwargs to merge function |
| 58 | + splice_args: {} # optional: {rename: value} or {rename: {old: new}} |
| 59 | + |
| 60 | + transforms: # optional, applied in order |
| 61 | + - dst_tz # string form (no args) |
| 62 | + - name: coarsen # dict form (with args) |
| 63 | + args: |
| 64 | + grid: 2min |
| 65 | + preserve_vals: [0.0] |
| 66 | + |
| 67 | + metadata: |
| 68 | + station_id: <id> # required (literal, infer_from_filename, or infer_from_agency_id) |
| 69 | + subloc: default # required |
| 70 | + source: <source> # required |
| 71 | + agency: <agency> # required (literal or registry_lookup) |
| 72 | + param: <param> # required |
| 73 | + unit: <unit> # required |
| 74 | + time_zone: Etc/GMT+8 # required |
| 75 | + freq: infer # required (literal freq string, "infer", or None for irregular) |
| 76 | + # Other fields as needed (station_name: registry_lookup, etc.) |
| 77 | + # Coordinates are NOT allowed here — they are auto-populated from the registry. |
| 78 | + |
| 79 | + output: |
| 80 | + repo_name: formatted # must match a repo in dstore_config.yaml |
| 81 | + staging: |
| 82 | + dir: ./drop_staging # must exist; staged files written here |
| 83 | + write_args: # optional kwargs to write_ts_csv |
| 84 | + float_format: "%.4f" |
| 85 | + chunk_years: false |
| 86 | + reconcile: # optional; if present, staged files are reconciled into repo |
| 87 | + #repo_data_dir: ./fake_repo # override target dir (omit to use repo root from config) |
| 88 | + prefer: staged # staged | repo |
| 89 | + allow_new_series: true |
| 90 | + inspection: |
| 91 | + recent_years: 3 |
| 92 | + p3: 0.15 |
| 93 | + p10: 0.05 |
45 | 94 | ``` |
46 | 95 |
|
47 | | -### Configuration Specification |
48 | | - |
49 | | -The `dropbox_spec.yaml` file has the following structure: |
50 | | - |
51 | | -- `dropbox_home`: Base directory for data processing |
52 | | -- `dest`: Destination folder for processed files |
53 | | -- `data`: List of data sources to process, each with: |
54 | | - - `name`: Descriptive name for the data source |
55 | | - - `skip`: Optional flag to skip processing (True/False) |
56 | | - - `collect`: Collection parameters including: |
57 | | - - `name`: Collection method name |
58 | | - - `file_pattern`: Pattern for matching files |
59 | | - - `location`: Source directory path |
60 | | - - `recursive_search`: Whether to search subdirectories |
61 | | - - `reader`: Reading method (e.g., "read_ts") |
62 | | - - `selector`: Column selector (optional) |
63 | | - - `metadata`: Static metadata fields including: |
64 | | - - `station_id`: Station identifier (or "infer_from_agency_id" for dynamic inference) |
65 | | - - `source`: Data source name |
66 | | - - `agency`: Agency name |
67 | | - - `param`: Parameter type (flow, temp, etc.) |
68 | | - - `sublocation`: Sub-location identifier |
69 | | - - `unit`: Measurement unit |
70 | | - - `metadata_infer`: Optional rules for inferring metadata from filenames: |
71 | | - - `regex`: Regular expression pattern |
72 | | - - `groups`: Mapping of regex groups to metadata fields |
73 | | - |
74 | | -## Example Configuration |
75 | | - |
76 | | -Below is an example entry from the configuration file: |
| 96 | +## Metadata Sentinels |
77 | 97 |
|
78 | | -```yaml |
79 | | -- name: USGS Aquarius flows |
80 | | - skip: False |
81 | | - collect: |
82 | | - name: file_search |
83 | | - recursive_search: True |
84 | | - file_pattern: "Discharge.ft^3_s.velq@*.EntireRecord.csv" |
85 | | - location: "//cnrastore-bdo/Modeling_Data/repo_staging/dropbox/usgs_aquarius_request_2020/**" |
86 | | - reader: read_ts |
87 | | - metadata: |
88 | | - station_id: infer_from_agency_id |
89 | | - source: aquarius |
90 | | - agency: usgs |
91 | | - param: flow |
92 | | - sublocation: default |
93 | | - unit: ft^3/s |
94 | | - metadata_infer: |
95 | | - regex: .*@(.*)\.EntireRecord.csv |
96 | | - groups: |
97 | | - 1: agency_id |
98 | | -``` |
| 98 | +Recipe metadata values can be: |
| 99 | +- **Literal**: `station_id: anh` — used as-is. |
| 100 | +- **`infer_from_filename`**: Parsed from the filename using the `file_pattern` template. |
| 101 | +- **`registry_lookup`**: Looked up from the station registry CSV by station_id or agency_id. |
| 102 | + Supported fields: `station_name`, `agency`, `agency_id`. |
| 103 | +- **`infer_from_agency_id`**: Special value for `station_id` — resolves station_id from the |
| 104 | + registry by matching `agency_id`. |
99 | 105 |
|
100 | | -## Key Classes and Functions |
| 106 | +## Coordinate Policy |
101 | 107 |
|
102 | | -### DataCollector |
| 108 | +Geospatial coordinates are **always auto-populated from the station registry** (e.g. |
| 109 | +`station_dbase.csv`). Recipe authors must not include coordinate fields in `metadata:`. |
103 | 110 |
|
104 | | -A class that handles file discovery based on specified patterns: |
| 111 | +The following keys are banned in recipe metadata sections: |
105 | 112 |
|
106 | | -```python |
107 | | -collector = DataCollector(name, location, file_pattern, recursive) |
108 | | -files = collector.data_file_list() |
109 | | -``` |
| 113 | +> `lat`, `lon`, `latitude`, `longitude`, `agency_lat`, `agency_lon`, |
| 114 | +> `x`, `y`, `projection_x_coordinate`, `projection_y_coordinate` |
110 | 115 |
|
111 | | -### get_spec |
| 116 | +If any of these appear, the recipe will fail with an error directing the user to |
| 117 | +add the station to the registry instead. |
112 | 118 |
|
113 | | -Loads and caches the YAML specification: |
| 119 | +The registry provides: |
| 120 | +- `agency_lat` / `agency_lon` — agency-reported WGS84 coordinates (written to file |
| 121 | + headers as `latitude` / `longitude`) |
| 122 | +- `x` / `y` — projected coordinates in EPSG:26910 (UTM Zone 10N), potentially |
| 123 | + adjusted for accuracy (written as `projection_x_coordinate` / `projection_y_coordinate`) |
114 | 124 |
|
115 | | -```python |
116 | | -spec = get_spec("dropbox_spec.yaml") |
117 | | -``` |
| 125 | +## Wildcard Modes |
118 | 126 |
|
119 | | -### populate_meta |
| 127 | +The `collect.wildcard` field controls how multiple files matching `file_pattern` are handled: |
120 | 128 |
|
121 | | -Enriches metadata using the station database: |
| 129 | +- **omitted / null**: Pattern must match exactly one file. |
| 130 | +- **`time_shard`**: Pass the glob pattern directly to the reader (year-sharded/blocked files). Lexicographical sorting is assumed to match chronological. |
| 131 | +- **`time_overlap`**: Glob, read each file individually, then merge via `merge_method`. |
122 | 132 |
|
123 | | -```python |
124 | | -meta_out = populate_meta(file_path, listing, metadata) |
125 | | -``` |
| 133 | +## Filename Templates (Inference Mode) |
126 | 134 |
|
127 | | -### infer_meta |
| 135 | +When `file_pattern` contains `{field}` placeholders (e.g. |
| 136 | +`{source}_{station_id}_{agency_id}_{param}_{syear}_{eyear}.csv`), the system enters |
| 137 | +"inference mode": each matched file's name is parsed to extract metadata fields marked |
| 138 | +`infer_from_filename`. In this mode, `wildcard` must be omitted — each file produces |
| 139 | +a separate output. |
128 | 140 |
|
129 | | -Extracts metadata from file names based on regex patterns: |
| 141 | +## Transforms |
130 | 142 |
|
131 | | -```python |
132 | | -metadata = infer_meta(file_path, listing) |
133 | | -``` |
| 143 | +Transforms are applied to the time series after reading (and after merging if applicable). |
| 144 | +Built-in transforms: |
134 | 145 |
|
135 | | -## Output |
| 146 | +- **`dst_st` / `dst_tz`**: Convert from local (DST-aware) time to a fixed timezone. |
| 147 | + Args: `src_tz`, `target_tz`. |
| 148 | +- **`coarsen`**: Reduce irregular high-frequency data to a regular grid. |
| 149 | + Args: `grid`, `preserve_vals`, `qwidth`, `hyst`, `heartbeat_freq`. |
136 | 150 |
|
137 | | -Processed files are saved in the destination directory (`dest`) specified in the configuration. Each file is named according to the pattern: |
| 151 | +Custom transforms can be registered via `register_transform(name, func)`. |
138 | 152 |
|
139 | | -``` |
140 | | -{source}_{station_id}_{agency_id}_{param}.csv |
141 | | -``` |
| 153 | +## Failure Handling |
| 154 | + |
| 155 | +Each recipe entry is processed independently. If one fails, the error is logged and |
| 156 | +processing continues with the next entry. At the end, if any entries failed, a |
| 157 | +`RuntimeError` is raised listing all failed entry names. Use `--name <entry>` to |
| 158 | +rerun individual failures. |
142 | 159 |
|
143 | | -Files may be chunked by year depending on the specified options. |
| 160 | +## Examples |
144 | 161 |
|
145 | | -## Additional Notes |
| 162 | +See `examples/dropbox/` for working recipes: |
| 163 | +- `dropbox_spec.yaml` — single-file and wildcard patterns |
| 164 | +- `dropbox_spec_ccf.yaml` — structure gate data with transforms (coarsen, DST) |
| 165 | +- `dropbox_daily.yaml` — template-based inference mode for daily NWIS data |
146 | 166 |
|
147 | | -- The system relies on a station database for lookup of station details |
148 | | -- Time-series data is standardized with a "value" column |
149 | | -- Metadata includes geospatial coordinates and projection information |
150 | | -- Files can be chunked by year for easier management of large datasets |
|
0 commit comments