The Dropbox Data Processing System is a component of the DMS Datastore package designed to facilitate the collection, transformation, and storage of time-series data. It provides a flexible configuration-based mechanism to process data files from various sources and integrate them into a standardized repository format.
This is the main processing script that handles data collection, metadata enrichment, and storage. It reads configuration from a YAML specification file and processes data according to the defined rules.
This YAML configuration file defines data sources, collection parameters, and metadata specifications. It serves as the blueprint for how data should be processed.
The system follows these steps:
- Reads a YAML specification file
- For each data entry in the specification:
- Locates source files based on patterns and locations
- Reads time-series data
- Augments with metadata (either directly specified or inferred)
- Produces standardized output files in a designated location
To process data according to the specification:
from dms_datastore.dropbox_data import dropbox_data
# Process data using the specification file
dropbox_data("path/to/dropbox_spec.yaml")Alternatively, you can run the script directly:
python -m dms_datastore.dropbox_dataThe dropbox_spec.yaml file has the following structure:
dropbox_home: Base directory for data processingdest: Destination folder for processed filesdata: List of data sources to process, each with:name: Descriptive name for the data sourceskip: Optional flag to skip processing (True/False)collect: Collection parameters including:name: Collection method namefile_pattern: Pattern for matching fileslocation: Source directory pathrecursive_search: Whether to search subdirectoriesreader: Reading method (e.g., "read_ts")selector: Column selector (optional)
metadata: Static metadata fields including:station_id: Station identifier (or "infer_from_agency_id" for dynamic inference)source: Data source nameagency: Agency nameparam: Parameter type (flow, temp, etc.)sublocation: Sub-location identifierunit: Measurement unit
metadata_infer: Optional rules for inferring metadata from filenames:regex: Regular expression patterngroups: Mapping of regex groups to metadata fields
Below is an example entry from the configuration file:
- name: USGS Aquarius flows
skip: False
collect:
name: file_search
recursive_search: True
file_pattern: "Discharge.ft^3_s.velq@*.EntireRecord.csv"
location: "//cnrastore-bdo/Modeling_Data/repo_staging/dropbox/usgs_aquarius_request_2020/**"
reader: read_ts
metadata:
station_id: infer_from_agency_id
source: aquarius
agency: usgs
param: flow
sublocation: default
unit: ft^3/s
metadata_infer:
regex: .*@(.*)\.EntireRecord.csv
groups:
1: agency_idA class that handles file discovery based on specified patterns:
collector = DataCollector(name, location, file_pattern, recursive)
files = collector.data_file_list()Loads and caches the YAML specification:
spec = get_spec("dropbox_spec.yaml")Enriches metadata using the station database:
meta_out = populate_meta(file_path, listing, metadata)Extracts metadata from file names based on regex patterns:
metadata = infer_meta(file_path, listing)Processed files are saved in the destination directory (dest) specified in the configuration. Each file is named according to the pattern:
{source}_{station_id}_{agency_id}_{param}.csv
Files may be chunked by year depending on the specified options.
- The system relies on a station database for lookup of station details
- Time-series data is standardized with a "value" column
- Metadata includes geospatial coordinates and projection information
- Files can be chunked by year for easier management of large datasets