Skip to content

Commit 9799b78

Browse files
committed
Add documentation for Dropbox Data Processing System in README.md
1 parent 5af26f3 commit 9799b78

2 files changed

Lines changed: 158 additions & 0 deletions

File tree

README-dropbox.md

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
# Dropbox Data Processing System
2+
3+
## Overview
4+
5+
The Dropbox Data Processing System is a component of the DMS Datastore package designed to facilitate the collection, transformation, and storage of time-series data. It provides a flexible configuration-based mechanism to process data files from various sources and integrate them into a standardized repository format.
6+
7+
## Key Components
8+
9+
### 1. `dropbox_data.py`
10+
11+
This is the main processing script that handles data collection, metadata enrichment, and storage. It reads configuration from a YAML specification file and processes data according to the defined rules.
12+
13+
### 2. `dropbox_spec.yaml`
14+
15+
This YAML configuration file defines data sources, collection parameters, and metadata specifications. It serves as the blueprint for how data should be processed.
16+
17+
## How It Works
18+
19+
The system follows these steps:
20+
21+
1. Reads a YAML specification file
22+
2. For each data entry in the specification:
23+
- Locates source files based on patterns and locations
24+
- Reads time-series data
25+
- Augments with metadata (either directly specified or inferred)
26+
- Produces standardized output files in a designated location
27+
28+
## Usage
29+
30+
### Basic Usage
31+
32+
To process data according to the specification:
33+
34+
```python
35+
from dms_datastore.dropbox_data import dropbox_data
36+
37+
# Process data using the specification file
38+
dropbox_data("path/to/dropbox_spec.yaml")
39+
```
40+
41+
Alternatively, you can run the script directly:
42+
43+
```bash
44+
python -m dms_datastore.dropbox_data
45+
```
46+
47+
### Configuration Specification
48+
49+
The `dropbox_spec.yaml` file has the following structure:
50+
51+
- `dropbox_home`: Base directory for data processing
52+
- `dest`: Destination folder for processed files
53+
- `data`: List of data sources to process, each with:
54+
- `name`: Descriptive name for the data source
55+
- `skip`: Optional flag to skip processing (True/False)
56+
- `collect`: Collection parameters including:
57+
- `name`: Collection method name
58+
- `file_pattern`: Pattern for matching files
59+
- `location`: Source directory path
60+
- `recursive_search`: Whether to search subdirectories
61+
- `reader`: Reading method (e.g., "read_ts")
62+
- `selector`: Column selector (optional)
63+
- `metadata`: Static metadata fields including:
64+
- `station_id`: Station identifier (or "infer_from_agency_id" for dynamic inference)
65+
- `source`: Data source name
66+
- `agency`: Agency name
67+
- `param`: Parameter type (flow, temp, etc.)
68+
- `sublocation`: Sub-location identifier
69+
- `unit`: Measurement unit
70+
- `metadata_infer`: Optional rules for inferring metadata from filenames:
71+
- `regex`: Regular expression pattern
72+
- `groups`: Mapping of regex groups to metadata fields
73+
74+
## Example Configuration
75+
76+
Below is an example entry from the configuration file:
77+
78+
```yaml
79+
- name: USGS Aquarius flows
80+
skip: False
81+
collect:
82+
name: file_search
83+
recursive_search: True
84+
file_pattern: "Discharge.ft^3_s.velq@*.EntireRecord.csv"
85+
location: "//cnrastore-bdo/Modeling_Data/repo_staging/dropbox/usgs_aquarius_request_2020/**"
86+
reader: read_ts
87+
metadata:
88+
station_id: infer_from_agency_id
89+
source: aquarius
90+
agency: usgs
91+
param: flow
92+
sublocation: default
93+
unit: ft^3/s
94+
metadata_infer:
95+
regex: .*@(.*)\.EntireRecord.csv
96+
groups:
97+
1: agency_id
98+
```
99+
100+
## Key Classes and Functions
101+
102+
### DataCollector
103+
104+
A class that handles file discovery based on specified patterns:
105+
106+
```python
107+
collector = DataCollector(name, location, file_pattern, recursive)
108+
files = collector.data_file_list()
109+
```
110+
111+
### get_spec
112+
113+
Loads and caches the YAML specification:
114+
115+
```python
116+
spec = get_spec("dropbox_spec.yaml")
117+
```
118+
119+
### populate_meta
120+
121+
Enriches metadata using the station database:
122+
123+
```python
124+
meta_out = populate_meta(file_path, listing, metadata)
125+
```
126+
127+
### infer_meta
128+
129+
Extracts metadata from file names based on regex patterns:
130+
131+
```python
132+
metadata = infer_meta(file_path, listing)
133+
```
134+
135+
## Output
136+
137+
Processed files are saved in the destination directory (`dest`) specified in the configuration. Each file is named according to the pattern:
138+
139+
```
140+
{source}_{station_id}_{agency_id}_{param}.csv
141+
```
142+
143+
Files may be chunked by year depending on the specified options.
144+
145+
## Additional Notes
146+
147+
- The system relies on a station database for lookup of station details
148+
- Time-series data is standardized with a "value" column
149+
- Metadata includes geospatial coordinates and projection information
150+
- Files can be chunked by year for easier management of large datasets

README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,14 @@ Data is fetched through download scripts (`download_noaa`, `download_cdec`, `dow
163163

164164
The system handles cases where data for the same station comes from different sources. The `src_priority` mechanism in `read_ts_repo` ensures that data from higher-priority sources is preferred.
165165

166+
### Dropbox Data Processing
167+
168+
The Dropbox Data Processing System provides a mechanism for importing ad-hoc or one-off data files into the repository. This is particularly useful for integrating data that was sourced as files rather than through automated downloads, or for processing data from non-standard sources.
169+
170+
The system uses a YAML configuration file (`dropbox_spec.yaml`) to define data sources, collection patterns, and metadata handling rules. The `dropbox_data.py` script processes these configurations to locate, transform, and store the data in the standardized repository format.
171+
172+
See [README-dropbox.md](README-dropbox.md) for detailed documentation on this system.
173+
166174
## Configuration System
167175

168176
The datastore uses a configuration system based on YAML files and Python modules to manage various aspects of data handling, station metadata, and screening processes.

0 commit comments

Comments
 (0)