Skip to content

Commit 05d9daf

Browse files
committed
Add workspace instructions and project overview to documentation
1 parent 1b0d0b8 commit 05d9daf

File tree

1 file changed

+96
-0
lines changed

1 file changed

+96
-0
lines changed

.github/copilot-instructions.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# dms_datastore — Workspace Instructions
2+
3+
## Project Overview
4+
5+
`dms_datastore` is a Python library and CLI toolkit for the Delta Modeling Section (DMS) that downloads, formats, screens, and manages continuous time-series data from water-quality and hydrological agencies (USGS, CDEC, NOAA, NCRO, DES, etc.). Data flows through four stages: **raw → formatted → screened → processed**.
6+
7+
## Build and Test
8+
9+
The `dms_datastore` conda environment is assumed to exist. Always activate it before running any tests or install commands.
10+
11+
```bash
12+
# Install (development mode)
13+
conda activate dms_datastore
14+
pip install --no-deps -e .
15+
16+
# Unit/integration tests (no real repo required)
17+
conda activate dms_datastore && pytest
18+
19+
# Integration tests against a real repository
20+
conda activate dms_datastore && pytest test_repo/ --repo=<path_to_repo>
21+
22+
# Single file
23+
conda activate dms_datastore && pytest tests/test_filename.py
24+
```
25+
26+
pytest is configured in `pyproject.toml` (`[tool.pytest.ini_options]`): strict markers, JUnit XML output, ignores `setup.py` and `build/`.
27+
28+
## Architecture
29+
30+
| Layer | Modules | Purpose |
31+
|---|---|---|
32+
| Public API | `__init__.py` | Re-exports `read_ts_repo`, `read_ts`, `write_ts_csv` |
33+
| CLI | `__main__.py` | Click group `dms` aggregating all subcommands |
34+
| Config | `dstore_config.py`, `config_data/dstore_config.yaml` | Repo roots, station DBs, variable/source mappings |
35+
| File naming | `filename.py` | Parse/render filenames via `interpret_fname` / `meta_to_filename` |
36+
| I/O | `read_ts.py`, `write_ts.py` | Low-level CSV read/write with YAML front-matter |
37+
| Multi-file read | `read_multi.py` | `read_ts_repo` — resolves source priority, merges year-sharded files |
38+
| Download | `download_*.py` | One module per agency (CDEC, NWIS, NOAA, NCRO, DES, HRRR, HYCOM, …) |
39+
| Pipeline | `populate_repo.py`, `update_repo.py` | Orchestrate download → format → screen |
40+
| QA/QC | `auto_screen.py`, `screeners.py` | YAML-driven screening; flags stored as `user_flag` column |
41+
| Utilities | `inventory.py`, `merge_files.py`, `coarsen_file.py`, `rationalize_time_partitions.py`, `reconcile_data.py` | Repo maintenance |
42+
43+
## File Naming Convention
44+
45+
Pattern: `{agency}_{station_id@subloc}_{agency_id}_{variable}_{syear}_{eyear}.csv`
46+
47+
- `@subloc` is omitted when subloc is `default`/`None`
48+
- End year `9999` means open-ended (actively updated)
49+
- `variable@modifier` encodes e.g. `ec@daily`
50+
51+
Examples:
52+
- `usgs_anh@north_11303500_flow_2024.csv`
53+
- `cdec_sac_11447650_flow_2020_9999.csv`
54+
55+
See [dms_datastore/filename.py](../dms_datastore/filename.py) for `meta_to_filename` / `interpret_fname`.
56+
57+
## Data File Format
58+
59+
CSV files with `#`-commented YAML front-matter:
60+
61+
```csv
62+
# format: dwr-dms-1.0
63+
# date_formatted: 2024-01-15T12:00:00
64+
# source_info:
65+
# siteName: MOKELUMNE R A ANDRUS ISLAND
66+
datetime,value,user_flag
67+
2020-01-01 00:00:00,1.5,0
68+
```
69+
70+
- Index column: `datetime`
71+
- Always two data columns: `value` (float) and `user_flag` (`Int64`, nullable)
72+
- `user_flag != 0` → anomalous; masked by `read_ts` by default (`read_flagged=True`)
73+
- Files are year-sharded; wildcards handled automatically by `read_ts`
74+
75+
## Key Conventions
76+
77+
- **Station IDs with sublocation**: `station_id@subloc` (e.g. `anh@north`, `msd@bottom`)
78+
- **Variables with modifier**: `param@modifier` (e.g. `ec@daily`)
79+
- **Units**: SI for most variables; stage/flow in ft / cfs; salinity as specific conductivity at 25°C (µS/cm)
80+
- **Source priority** is declared per agency in `dstore_config.yaml` and resolved by `read_ts_repo` — do not hard-code provider preferences in code
81+
- **Config paths** are resolved by `dstore_config.config_file(label)` — checks cwd first, then `config_data/`
82+
- New download modules must register as a Click command in `__main__.py` and add an entry point in `pyproject.toml`
83+
84+
## Tests
85+
86+
- `tests/` — unit and integration tests with monkeypatched config; no real repo needed
87+
- `test_repo/` — integration tests; pass `--repo=<path>` to pytest
88+
- Use `tmp_path` and `monkeypatch` for config isolation
89+
- Do not couple unit tests to the shared repo path
90+
91+
## Key Reference Files
92+
93+
- [README.md](../README.md) — full data model, flags, units, configuration system
94+
- [README-dropbox.md](../README-dropbox.md) — Dropbox data ingestion via `dropbox_spec.yaml`
95+
- [README-commands.md](../README-commands.md) — CLI command reference
96+
- [dms_datastore/config_data/dstore_config.yaml](../dms_datastore/config_data/dstore_config.yaml) — central config

0 commit comments

Comments
 (0)