|
| 1 | +# dms_datastore — Workspace Instructions |
| 2 | + |
| 3 | +## Project Overview |
| 4 | + |
| 5 | +`dms_datastore` is a Python library and CLI toolkit for the Delta Modeling Section (DMS) that downloads, formats, screens, and manages continuous time-series data from water-quality and hydrological agencies (USGS, CDEC, NOAA, NCRO, DES, etc.). Data flows through four stages: **raw → formatted → screened → processed**. |
| 6 | + |
| 7 | +## Build and Test |
| 8 | + |
| 9 | +The `dms_datastore` conda environment is assumed to exist. Always activate it before running any tests or install commands. |
| 10 | + |
| 11 | +```bash |
| 12 | +# Install (development mode) |
| 13 | +conda activate dms_datastore |
| 14 | +pip install --no-deps -e . |
| 15 | + |
| 16 | +# Unit/integration tests (no real repo required) |
| 17 | +conda activate dms_datastore && pytest |
| 18 | + |
| 19 | +# Integration tests against a real repository |
| 20 | +conda activate dms_datastore && pytest test_repo/ --repo=<path_to_repo> |
| 21 | + |
| 22 | +# Single file |
| 23 | +conda activate dms_datastore && pytest tests/test_filename.py |
| 24 | +``` |
| 25 | + |
| 26 | +pytest is configured in `pyproject.toml` (`[tool.pytest.ini_options]`): strict markers, JUnit XML output, ignores `setup.py` and `build/`. |
| 27 | + |
| 28 | +## Architecture |
| 29 | + |
| 30 | +| Layer | Modules | Purpose | |
| 31 | +|---|---|---| |
| 32 | +| Public API | `__init__.py` | Re-exports `read_ts_repo`, `read_ts`, `write_ts_csv` | |
| 33 | +| CLI | `__main__.py` | Click group `dms` aggregating all subcommands | |
| 34 | +| Config | `dstore_config.py`, `config_data/dstore_config.yaml` | Repo roots, station DBs, variable/source mappings | |
| 35 | +| File naming | `filename.py` | Parse/render filenames via `interpret_fname` / `meta_to_filename` | |
| 36 | +| I/O | `read_ts.py`, `write_ts.py` | Low-level CSV read/write with YAML front-matter | |
| 37 | +| Multi-file read | `read_multi.py` | `read_ts_repo` — resolves source priority, merges year-sharded files | |
| 38 | +| Download | `download_*.py` | One module per agency (CDEC, NWIS, NOAA, NCRO, DES, HRRR, HYCOM, …) | |
| 39 | +| Pipeline | `populate_repo.py`, `update_repo.py` | Orchestrate download → format → screen | |
| 40 | +| QA/QC | `auto_screen.py`, `screeners.py` | YAML-driven screening; flags stored as `user_flag` column | |
| 41 | +| Utilities | `inventory.py`, `merge_files.py`, `coarsen_file.py`, `rationalize_time_partitions.py`, `reconcile_data.py` | Repo maintenance | |
| 42 | + |
| 43 | +## File Naming Convention |
| 44 | + |
| 45 | +Pattern: `{agency}_{station_id@subloc}_{agency_id}_{variable}_{syear}_{eyear}.csv` |
| 46 | + |
| 47 | +- `@subloc` is omitted when subloc is `default`/`None` |
| 48 | +- End year `9999` means open-ended (actively updated) |
| 49 | +- `variable@modifier` encodes e.g. `ec@daily` |
| 50 | + |
| 51 | +Examples: |
| 52 | +- `usgs_anh@north_11303500_flow_2024.csv` |
| 53 | +- `cdec_sac_11447650_flow_2020_9999.csv` |
| 54 | + |
| 55 | +See [dms_datastore/filename.py](../dms_datastore/filename.py) for `meta_to_filename` / `interpret_fname`. |
| 56 | + |
| 57 | +## Data File Format |
| 58 | + |
| 59 | +CSV files with `#`-commented YAML front-matter: |
| 60 | + |
| 61 | +```csv |
| 62 | +# format: dwr-dms-1.0 |
| 63 | +# date_formatted: 2024-01-15T12:00:00 |
| 64 | +# source_info: |
| 65 | +# siteName: MOKELUMNE R A ANDRUS ISLAND |
| 66 | +datetime,value,user_flag |
| 67 | +2020-01-01 00:00:00,1.5,0 |
| 68 | +``` |
| 69 | + |
| 70 | +- Index column: `datetime` |
| 71 | +- Always two data columns: `value` (float) and `user_flag` (`Int64`, nullable) |
| 72 | +- `user_flag != 0` → anomalous; masked by `read_ts` by default (`read_flagged=True`) |
| 73 | +- Files are year-sharded; wildcards handled automatically by `read_ts` |
| 74 | + |
| 75 | +## Key Conventions |
| 76 | + |
| 77 | +- **Station IDs with sublocation**: `station_id@subloc` (e.g. `anh@north`, `msd@bottom`) |
| 78 | +- **Variables with modifier**: `param@modifier` (e.g. `ec@daily`) |
| 79 | +- **Units**: SI for most variables; stage/flow in ft / cfs; salinity as specific conductivity at 25°C (µS/cm) |
| 80 | +- **Source priority** is declared per agency in `dstore_config.yaml` and resolved by `read_ts_repo` — do not hard-code provider preferences in code |
| 81 | +- **Config paths** are resolved by `dstore_config.config_file(label)` — checks cwd first, then `config_data/` |
| 82 | +- New download modules must register as a Click command in `__main__.py` and add an entry point in `pyproject.toml` |
| 83 | + |
| 84 | +## Tests |
| 85 | + |
| 86 | +- `tests/` — unit and integration tests with monkeypatched config; no real repo needed |
| 87 | +- `test_repo/` — integration tests; pass `--repo=<path>` to pytest |
| 88 | +- Use `tmp_path` and `monkeypatch` for config isolation |
| 89 | +- Do not couple unit tests to the shared repo path |
| 90 | + |
| 91 | +## Key Reference Files |
| 92 | + |
| 93 | +- [README.md](../README.md) — full data model, flags, units, configuration system |
| 94 | +- [README-dropbox.md](../README-dropbox.md) — Dropbox data ingestion via `dropbox_spec.yaml` |
| 95 | +- [README-commands.md](../README-commands.md) — CLI command reference |
| 96 | +- [dms_datastore/config_data/dstore_config.yaml](../dms_datastore/config_data/dstore_config.yaml) — central config |
0 commit comments