Skip to content

Latest commit

Β 

History

History
418 lines (318 loc) Β· 14.2 KB

File metadata and controls

418 lines (318 loc) Β· 14.2 KB

Database structure

This document describes the SQLite database produced by digimuh-ingest. The schema follows a star schema design: small dimension tables hold entity metadata; large fact tables hold time-series measurements with foreign keys pointing back to the dimensions.

Design decisions

Why SQLite?

The dataset is large (~8.9 GB CSV, ~842 million rows) but single-user and read-heavy. SQLite handles this well: no server process, a single portable .db file, built-in date/time functions, and indexed queries that are orders of magnitude faster than scanning thousands of CSV files.

Why a star schema?

The raw data arrives as thousands of per-animal CSV files with no relational structure. A star schema gives us:

  • Normalised entity storage β€” animal IDs, sensor names, and barn names each appear once in their dimension table rather than repeated in every row.
  • Provenance β€” every row carries a file_id foreign key to the source_files dimension, so any datum can be traced back to its original CSV.
  • Extensibility β€” the animals dimension can later be enriched with breed, birth date, dam/sire, etc. without touching the fact tables.

Why timestamps stay inline (not normalised)?

Timestamps are stored directly in each fact table as ISO-8601 TEXT strings. A timestamp dimension table would itself contain hundreds of millions of rows (the gouna table alone has sub-second precision) and every query would need a join through it. SQLite's built-in date(), time(), and strftime() functions operate natively on ISO-8601 text, so inline storage is both simpler and faster.

Why the EU ear tag as INTEGER PRIMARY KEY?

EU ear tag numbers are 15-digit integers that are permanent, unique, and never change. SQLite's INTEGER PRIMARY KEY is special: it becomes an alias for the internal rowid, giving the fastest possible lookup. A surrogate key would add a layer of indirection (an extra join) for zero space savings β€” the 8-byte integer is stored in fact tables either way.

Dimension tables

animals

Column Type Description
animal_id INTEGER PRIMARY KEY EU ear tag number (15-digit); IS the rowid

Currently contains only the ID. Future columns: breed, birth date, sex, dam/sire lineage.

sensors

Column Type Description
sensor_id INTEGER PRIMARY KEY Auto-incrementing surrogate key
sensor_name TEXT NOT NULL UNIQUE LoRaWAN sensor name (e.g. "CU-1")

barns

Column Type Description
barn_id INTEGER PRIMARY KEY Auto-incrementing surrogate key
barn_name TEXT NOT NULL UNIQUE smaXtec barn sensor name (e.g. "NewBridge")

source_files

Column Type Description
file_id INTEGER PRIMARY KEY Auto-incrementing surrogate key
filename TEXT NOT NULL Basename of the original CSV file
folder TEXT NOT NULL Containing folder name, or "(standalone)"

Unique constraint on (filename, folder).

Fact tables

Every fact table includes a file_id INTEGER NOT NULL REFERENCES source_files(file_id) column for provenance tracing.

allocations

Records which barn group/pen an animal was assigned to and when.

Column Type FK β†’
animal_id INTEGER animals
datetime_enter TEXT β€”
datetime_exit TEXT β€”
group INTEGER β€”
file_id INTEGER source_files

Source: output_allocations/allocations.csv

diseases

Health events and diagnoses from HerdePlus.

Column Type FK β†’
animal_id INTEGER animals
disease_first_day TEXT β€”
disease_stop_day TEXT β€”
disease_description TEXT β€”
herdeplus_tier_id INTEGER β€”
herdeplus_gesund_id INTEGER β€”
disease_category TEXT β€”
file_id INTEGER source_files

Source: herdeplus_diseases.csv (standalone)

herdeplus

Milking events, MLP (MilchleistungsprΓΌfung) test-day results, and calving records from the HerdePlus herd management system.

Column Type FK β†’
animal_id INTEGER animals
timestamp TEXT β€”
herdeplus_milked_duration_sec REAL β€”
herdeplus_milked_milk_flow REAL β€”
herdeplus_milked_mkg REAL β€”
herdeplus_mlp_mkg REAL β€”
herdeplus_mlp_fat_percent REAL β€”
herdeplus_mlp_fkg REAL β€”
herdeplus_mlp_protein_percent REAL β€”
herdeplus_mlp_ekg_percent REAL β€”
herdeplus_mlp_lactose REAL β€”
herdeplus_mlp_cell_count REAL β€”
herdeplus_mlp_urea REAL β€”
herdeplus_mlp_f_e REAL β€”
herdeplus_mlp_lkg REAL β€”
herdeplus_mlp_ecm REAL β€”
herdeplus_calving_lactation TEXT β€”
file_id INTEGER source_files

Source: outputs_herdeplus_mlp_gemelk_kalbung/{animal_id}_herdeplus_*.csv

bcs

Body Condition Score assessments.

Column Type FK β†’
animal_id INTEGER animals
timestamp TEXT β€”
bcs_wert REAL β€”
file_id INTEGER source_files

Source: outputs_bcs/{animal_id}_bcs_*.csv

gouna

Respiration frequency from Gouna sensors.

Column Type FK β†’
animal_id INTEGER animals
timestamp TEXT β€”
respirationfrequency REAL β€”
file_id INTEGER source_files

Source: outputs_gouna/{animal_id}_gouna_*.csv

smaxtec_derived

Derived metrics from the smaXtec rumen bolus. This is the largest table (~824 million rows at full ingestion) with 30+ columns of computed indices covering activity, estrus detection, calving prediction, rumination, rumen pH, temperature, and motility.

Column Type FK β†’
animal_id INTEGER animals
timestamp TEXT β€”
act REAL β€”
act_decrease_index REAL β€”
act_estrus_index REAL β€”
act_estrus_preprocess REAL β€”
act_group_heat_index REAL β€”
act_group_ratio REAL β€”
act_index REAL β€”
act_pasture_index REAL β€”
calving_index REAL β€”
drink_cycles_v2 REAL β€”
heat_index REAL β€”
in_reticulum REAL β€”
mot_period REAL β€”
mot_pulse_width REAL β€”
mot_pulse_width_median REAL β€”
ph REAL β€”
ph_under_58 REAL β€”
rum_classification TEXT β€”
rum_dec_index REAL β€”
rum_index REAL β€”
temp REAL β€”
temp_dec_index REAL β€”
temp_group_ratio_svm_inc_index REAL β€”
temp_height_index REAL β€”
temp_inc_index REAL β€”
temp_limit_crossing REAL β€”
temp_normal_index REAL β€”
temp_svm_inc_index REAL β€”
temp_without_drink_cycles REAL β€”
mot_period_rum_6h_we REAL β€”
mot_period_not_rum_6h_we REAL β€”
file_id INTEGER source_files

Source: outputs_smaxtec_derived/{animal_id}_smaxtec_derived_*.csv

Note: Due to sparse data in early rows, some columns may be typed as TEXT by the auto-inference engine. SQLite's dynamic typing means numeric values are still stored and compared correctly regardless of the declared affinity.

smaxtec_events

Discrete reproductive and health events from smaXtec.

Column Type FK β†’
animal_id INTEGER animals
timestamp TEXT β€”
cow INTEGER β€”
event_type TEXT β€”
value REAL β€”
file_id INTEGER source_files

Source: outputs_smaxtec_events/{animal_id}_events.csv

Note: The cow column contains the animal_id redundantly (the canonical animal_id is extracted from the filename). Both are kept for completeness.

smaxtec_water_intake

Daily water intake estimates derived from rumen temperature drops.

Column Type FK β†’
animal_id INTEGER animals
timestamp TEXT β€”
water_intake_liter REAL β€”
file_id INTEGER source_files

Source: outputs_smaxtec_water_intake/{animal_id}_smaxtec_derived_*.csv

smaxtec_barns

Barn climate measurements from smaXtec barn sensors.

Column Type FK β†’
barn_id INTEGER barns
timestamp TEXT β€”
rawtemp REAL β€”
rawhum REAL β€”
temp REAL β€”
hum REAL β€”
temp_hum_index REAL β€”
file_id INTEGER source_files

Source: outputs_smaxtec_barns/{barn_name}_smaxtec_raw_*.csv

lorawan

LoRaWAN environmental sensor readings (battery and current).

Column Type FK β†’
sensor_id INTEGER sensors
timestamp TEXT β€”
battery_level REAL β€”
current_ampere REAL β€”
file_id INTEGER source_files

Source: outputs_lorawan/{sensor_name}_LoRaWAN_raw_*.csv

dwd_weather

Daily weather summaries from the Deutscher Wetterdienst (DWD).

Column Type FK β†’
dt TEXT β€”
thi_max REAL β€”
qb_thi TEXT β€”
num_values_thi INTEGER β€”
enthalpy_max REAL β€”
qb_enthalpy TEXT β€”
num_values_enthalpy INTEGER β€”
file_id INTEGER source_files

Source: outputs_dwd.csv (standalone)

hobo_weather

Weather station readings from HOBO loggers. Column names include sensor serial numbers from the HOBO export format.

Column Type FK β†’
datetime TEXT β€”
21136553_b_battery_v REAL β€”
21141733_1_temperature REAL β€”
21141733_2_rh REAL β€”
21141733_3_dew_point REAL β€”
21141733_b_battery_level REAL β€”
21141735_1_solar_radiation REAL β€”
21141735_b_battery_level REAL β€”
21141737_1_wetness REAL β€”
21141737_b_battery_level REAL β€”
21141734_1_wind_speed REAL β€”
21141734_2_gust_speed REAL β€”
21141734_3_wind_direction REAL β€”
21141734_b_battery_level REAL β€”
file_id INTEGER source_files

Source: outputs_hobo/hobo_exports_*.csv

Indexes

All indexes are created after bulk insertion to avoid write overhead.

Index Table Columns
idx_bcs_animal_ts bcs animal_id, timestamp
idx_gouna_animal_ts gouna animal_id, timestamp
idx_herdeplus_animal_ts herdeplus animal_id, timestamp
idx_smaxtec_derived_animal_ts smaxtec_derived animal_id, timestamp
idx_smaxtec_events_animal_ts smaxtec_events animal_id, timestamp
idx_smaxtec_water_animal_ts smaxtec_water_intake animal_id, timestamp
idx_alloc_animal allocations animal_id
idx_alloc_enter allocations datetime_enter
idx_disease_animal diseases animal_id
idx_disease_start diseases disease_first_day
idx_lorawan_sensor_ts lorawan sensor_id, timestamp
idx_smaxtec_barns_barn_ts smaxtec_barns barn_id, timestamp
idx_dwd_dt dwd_weather dt
idx_hobo_ts hobo_weather datetime
idx_source_folder source_files folder

The composite (entity_id, timestamp) indexes accelerate the primary query pattern: "give me all measurements for entity X between dates A and B".

Planned additions

  • Lactation view combining herdeplus milking + BCS + disease windows per lactation cycle
  • Materialised summary tables for daily/weekly aggregates (if view performance is insufficient)

Analysis views

The file src/digimuh/create_views.sql defines a three-layer view hierarchy. Views are created automatically when any analysis script connects to the database.

Layer 0 β€” hourly aggregates

View Purpose Grouped by
v_smaxtec_hourly Hourly means of temp, activity, rumination, motility, pH animal Γ— day Γ— hour

Layer 1 β€” daily summaries

These views aggregate each data source into one row per animal (or sensor/barn) per day.

View Source table(s) Key columns
v_smaxtec_daily smaxtec_derived temp (mean/min/max/range), activity, rumination, motility, pH, drinking, estrus/calving indices
v_herdeplus_daily herdeplus total milk yield, mean duration/flow, MLP test-day values (fat, protein, FPR, SCC, urea, lactose, ECM)
v_gouna_daily gouna mean/min/max respiration frequency
v_water_daily smaxtec_water_intake total water intake (litres)
v_bcs_latest bcs BCS value and assessment date
v_barn_daily smaxtec_barns barn temp/humidity/THI (mean and max)

Layer 2 β€” analysis-specific joins

Each analysis view joins the daily summaries needed for a specific research question.

View Joins Purpose
v_analysis_ketosis herdeplus + smaxtec + water + diseases FPR, rumination, milk yield, rumen pH, with disease ground truth. Includes a fpr_flag column (+1 = energy deficit, βˆ’1 = acidosis, 0 = normal) and is_sick binary from disease records.
v_analysis_heat_stress smaxtec + water + gouna + herdeplus + DWD weather Rumen temp, respiration, activity, water intake, milk yield, all aligned with ambient THI and enthalpy.
v_analysis_digestive smaxtec + herdeplus + water Motility/pH daily profiles alongside MLP test-day composition for time-lagged cross-correlation analysis.
v_analysis_circadian smaxtec_hourly + diseases Hourly temp/activity/rumination profiles per animal-day with disease status for circadian rhythm analysis.
v_analysis_motility smaxtec_derived (minimal aggregation) Raw motility series (contraction intervals, pulse widths) with concurrent pH, rumination, and temperature for entropy computation.

Example queries using views

-- Daily milk yield vs. heat load for a specific cow
SELECT day, milk_yield_kg, rumen_temp_mean, dwd_thi_max, water_liter
FROM v_analysis_heat_stress
WHERE animal_id = 276001260919234
ORDER BY day;

-- MLP test days with FPR above ketosis threshold
SELECT animal_id, day, mlp_fpr, rum_index_mean, milk_yield_kg, disease_category
FROM v_analysis_ketosis
WHERE fpr_flag = 1
ORDER BY day;

-- Hourly temperature profile for circadian analysis
SELECT hour, AVG(temp_clean_mean) AS temp, AVG(act_index_mean) AS activity
FROM v_analysis_circadian
WHERE animal_id = 276001260919234
  AND day = '2023-07-15'
GROUP BY hour
ORDER BY hour;