You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[Challenges and Exceptions](#challenges-and-exceptions)
15
17
-[Installation](#installation)
16
18
@@ -161,6 +163,175 @@ Data is fetched through download scripts (`download_noaa`, `download_cdec`, `dow
161
163
162
164
The system handles cases where data for the same station comes from different sources. The `src_priority` mechanism in `read_ts_repo` ensures that data from higher-priority sources is preferred.
163
165
166
+
## Configuration System
167
+
168
+
The datastore uses a configuration system based on YAML files and Python modules to manage various aspects of data handling, station metadata, and screening processes.
169
+
170
+
### Configuration Files
171
+
172
+
The main configuration files are:
173
+
174
+
-**dstore_config.yaml**: The central configuration file that defines paths to critical datasets, repository locations, source priorities, and screening configurations.
175
+
-**dstore_config.py**: Python module that reads the YAML configuration and provides functions to access various configuration elements.
176
+
177
+
#### Station Database and Variable Mappings
178
+
179
+
The configuration system points to several key data files:
180
+
181
+
-**station_dbase.csv**: Contains the master database of all stations with their metadata including:
182
+
-`id`: Internal unique identifier for each station
183
+
-`agency_id`: The ID used by the agency that operates the station
184
+
- Geographic coordinates, station name, and other metadata
185
+
186
+
-**variable_mappings.csv**: Maps between agency-specific variable codes/names and the standardized variable naming used within the datastore system.
187
+
188
+
-**variable_definitions.csv**: Defines standard variables used in the system along with their units and descriptive information.
189
+
190
+
-**station_subloc.csv**: Contains information about sublocations (e.g., depths, sensor positions) for stations where simple station ID is insufficient.
191
+
192
+
### Source Priority Configuration
193
+
194
+
The `source_priority` section in `dstore_config.yaml` defines the preferred data sources for each agency:
195
+
196
+
```
197
+
source_priority:
198
+
ncro: ['ncro','cdec']
199
+
dwr_ncro: ['ncro']
200
+
des: ['des']
201
+
dwr_des: ['des']
202
+
usgs: ['usgs']
203
+
noaa: ['noaa']
204
+
usbr: ['cdec']
205
+
dwr_om: ['cdec']
206
+
dwr: ['cdec']
207
+
ebmud: ['usgs','ebmud','cdec']
208
+
```
209
+
210
+
This configuration specifies the priority order for data sources when multiple sources exist for the same station. For example, for EBMUD stations, the system will first try to use USGS data, then EBMUD's own data, and finally fall back to CDEC if neither of the higher priority sources are available.
211
+
212
+
### Screen Configuration
213
+
214
+
The screening configuration (referenced by `screen_config` and `screen_config_v20230126` in dstore_config.yaml) specifies automated data quality checking rules. The screening configuration YAML file contains rule sets for:
215
+
216
+
1.**Bounds checking**: Defining acceptable minimum and maximum values for variables
217
+
2.**Spike detection**: Parameters for identifying and flagging data spikes
218
+
3.**Repetition checking**: Rules for flagging suspicious repetitions in data
219
+
4.**Custom screening functions**: Advanced screening algorithms for specific data types
220
+
221
+
The `auto_screen.py` module applies these rules to incoming data to flag potential quality issues automatically, which can later be reviewed by users.
222
+
223
+
### Using the Configuration System
224
+
225
+
The `dstore_config.py` module provides several functions to interact with the configuration:
226
+
227
+
-`station_dbase()`: Returns the station database as a pandas DataFrame
228
+
-`sublocation_df()`: Returns the sublocations database
229
+
-`configuration()`: Returns the entire configuration dictionary
230
+
-`get_config()` or `config_file()`: Returns the path to a specific configuration file
231
+
232
+
The module implements caching to avoid repeatedly loading the same configuration files.
233
+
234
+
## Accessing Datastore Data
235
+
236
+
The `read_ts_repo` function is the primary way to access data from the datastore. This function handles the complex task of finding the appropriate data files, prioritizing sources based on the configuration, and returning the data as a pandas DataFrame.
237
+
238
+
### Using the `read_ts_repo` Function
239
+
240
+
The `read_ts_repo` function requires station identification and variable information to retrieve data. It handles file path construction, source prioritization, and data consolidation automatically.
241
+
242
+
Basic syntax:
243
+
```python
244
+
from dms_datastore.read_multi import read_ts_repo
245
+
246
+
# Basic usage - retrieve data for a station and variable
247
+
data = read_ts_repo(station_id="sjj", variable="flow")
248
+
249
+
# With sublocation - for stations where position matters
250
+
data = read_ts_repo(station_id="msd", variable="elev", subloc="bottom")
251
+
252
+
# Specifying date ranges (after loading)
253
+
data = read_ts_repo(station_id="mrz", variable="elev", subloc="upper").loc[
plt.title("San Joaquin River Flow at Jersey Point (2020)")
299
+
plt.xlabel("Date")
300
+
plt.ylabel("Flow (cfs)")
301
+
plt.grid(True)
302
+
plt.tight_layout()
303
+
plt.show()
304
+
```
305
+
306
+
### Caching Data Access
307
+
308
+
For repeated access to the same data, especially when additional processing is involved, the datastore provides a caching mechanism through the `@cache_dataframe` decorator:
309
+
310
+
```python
311
+
from dms_datastore.read_multi import read_ts_repo
312
+
from dms_datastore.caching import cache_dataframe
313
+
314
+
@cache_dataframe()
315
+
defget_filtered_flow(station, variable):
316
+
"""Retrieve and process flow data with caching for improved performance."""
0 commit comments