Skip to content

Latest commit

 

History

History
168 lines (113 loc) · 7.92 KB

File metadata and controls

168 lines (113 loc) · 7.92 KB

Pipeline Decisions

This file records the concrete QC parameters, harmonisation rules, and per-wave deviations applied during preprocessing. It pairs with README.md (pipeline overview) and feature_alignment.md (cross-wave schema alignment).


1. ESM-level quality control

Applied to EsmResponse.csv before feature extraction:

  • Expired prompts removed. ESM prompts were delivered during waking hours (10:00–22:00) and expired after 10 minutes; unanswered prompts are dropped.
  • Late submissions removed. Any response submitted more than 10 minutes after the trigger time is dropped (same rule as expiry).
  • Prompt schedule. D1 used ~15 prompts/day. D2 and D3 used a reduced per-day rate with a participant-facing progress counter and improved monitoring, resulting in tighter distributions of per-participant response counts.

2. Participant-level quality control

A participant is retained only if all of the following coverage thresholds are met across the wave:

  • Wearable coverage ≥ 50% of eligible waking hours.
  • Smartphone coverage ≥ 50% of eligible waking hours.
  • ESM response rate ≥ 30% of delivered prompts.

Per-wave outcomes:

Wave Recruited Retained Excluded Exclusion rate
D1 102 92 10 9.8%
D2 112 99 13 11.6%
D3 114 106 8 7.0%
Total 328 297 31

Excluded participants are not present in the released UserInfo.csv, EsmResponse.csv, or any *.pkl.


3. Sensor QC

  • Polar H10 ECG is available only for participants during their assigned sub-period in D1 and D2. ECG-derived features therefore have missing rows outside those sub-periods. D3 does not include ECG at all.
  • Keystroke events (keyevent_*) were added from D2 onward. D1 has no keystroke features.
  • Fitness-derived features (FDI, FST, FCL, FAC) are produced only for D2 and D3; D1 has no fitness block.
  • Wi-Fi similarity (WIFI_COS|JAC|EUC|MAN) is present in D1 and D2; skipped in D3 preprocessing.
  • Bluetooth feature set differs by wave:
    • D1 → BT_RSSI bucketed signal-strength features (113 columns).
    • D2 / D3 → BT_classType and BT_DeviceType categorical encodings.

These differences are documented empirically in feature_alignment.md and handled automatically by the Setting C loader via common-feature intersection (see ../benchmark/setting_c/README.md).


4. Label harmonisation

4.1 Stress and Disturbance scale

The ESM scale differs between waves:

Wave Stress Disturbance
D1 −3 to +3 −3 to +3
D2 −3 to +3 −3 to +3
D3 0 to +6 0 to +6

For cross-wave tasks (Setting C), D3 Stress and Disturbance are shifted by −3 so that all waves share the [−3, +3] range. The binary label in pkl[1] is already computed on the harmonised scale; the raw-value array pkl[3] preserves each wave's original scale and must be shifted manually if used.

The EDA helper normalize_label_series in ../EDA/utils.py applies the same shift automatically:

def normalize_label_series(series):
    # shift [0, 6] to [-3, +3]
    smin, smax = series.min(), series.max()
    if smin >= 0 and smax > 3:
        return series - 3
    return series

4.2 Binary labelling convention

Each raw ESM label is binarised (pkl[1]) for classification:

  • Shared labels (Valence, Arousal, Stress, Disturbance): y = 1 when the harmonised raw value is > 0 (HIGH), else 0 (LOW).
  • D1 Attention / Mental / Duration / Change, D2 Attention / Mental / Duration / ValenceChange / ArousalChange: same HIGH / LOW threshold as shared labels.
  • D3 PANAS-style affect words (0 to +6): y = 1 when raw value > 3 (i.e., "moderate or above"); else 0.

5. Feature extraction time windows

The following windows are emitted per sensor (where applicable):

  • Current: value at the ESM trigger timestamp.
  • Immediate past: 15-minute and 30-minute windows ending at the trigger.
  • Yesterday epochs: dawn / morning / afternoon / late afternoon / evening / night of the previous calendar day.
  • Today-so-far epochs: same six epochs of the current day up to the trigger.

See ../data/schema.md § "Time Window Suffixes" for the exact suffix naming.


6. Categorical-value alias map (Setting C harmonisation)

Resolved during Setting C feature alignment so that one-hot categorical columns line up across waves. Applied to the source wave so that it emits the target-wave's vocabulary; documented for each sensor family.

6.1 CALL_CNT (call contact type)

Wave Categories
D1 HOME, MAIN, MOBILE, OTHER, PAGER, VOICE, WORK, + Korean labels 기타, 휴대전화, 휴대폰
D2 HOME, MOBILE, OTHER, UNDEFINED, UNKNOWN, WORK (no MAIN; no Korean)
D3 HOME, MAIN, MOBILE, OTHER, UNDEFINED, UNKNOWN, WORK

Alias map (all → canonical):

  • 기타OTHER
  • 휴대전화, 휴대폰MOBILE
  • PAGER, VOICEOTHER (legacy; drop or alias)

6.2 BAT_PLG# (battery plug state)

  • D1 emits UNKNOWN; D2 / D3 emit UNDEFINED for the same semantic category.
  • Alias: UNKNOWNUNDEFINED.

6.3 WLS# (wireless state)

  • D1: BT_OFF, BT_ON only (2 categories).
  • D2 / D3: 8 categories including transition states (BT_TURNING_OFF, BT_TURNING_ON, WIFI_DISABLED, WIFI_DISABLING, WIFI_ENABLED, WIFI_ENABLING).
  • Action: drop the extra D2/D3-only categories for D1-inclusive Setting C transfers; keep them for D2↔D3 transfers.

6.4 Notification_CAT# (notification category)

  • D1: 16 categories (no NAVIGATION).
  • D2 / D3: 17 categories (NAVIGATION added).
  • Action: drop NAVIGATION when D1 is in the mix; keep for D2↔D3 transfers.

6.5 Location clusters (LOC_CLS#)

  • Cluster IDs are cohort-specific (per-participant k-means / DBSCAN hashes). They do not align across waves and are not harmonisable via aliasing.
  • Action: keep within-wave; drop from Setting C common-feature intersection.

6.6 Feature-group presence matrix (summary)

Feature group D1 D2 D3 Setting C treatment
keyevent_* Drop for D1-inclusive transfers
`WIFI_COS JAC EUC MAN`
`FDI FST FCL FAC`
BT_RSSI buckets Drop for cross-wave transfers; D1-only
BT_classType, BT_DeviceType Drop for D1-inclusive transfers
PIF# participant info Retained but not the focus of this benchmark; removed in some analyses (see paper FeatureAlign report: 7,961 / 10,102 / 10,550 cols after PIF# removal)

The Setting C loader intersects the remaining columns and drops any column not present in every source/target wave.


7. Reproducibility notes

  • All benchmarks use a fixed global seed (documented in ../benchmark/README.md).
  • Fit statistics for preprocessing transformers (means, scales, imputation values) are fit on the training split only and applied to validation/test unchanged.
  • No test-split information leaks into feature selection, normalisation, or model selection.
  • Optuna runs 30 trials per model × task × setting, selected on validation AUROC.

8. Known limitations

  • D1 / D2 Polar H10 coverage is not complete within those waves (sub-period rotation).
  • LOC_CLS# cluster IDs are not aligned across waves; cross-wave location analyses must re-cluster.
  • BAT_LEV and similar numeric features are device-dependent; some Android manufacturers report at different granularities.
  • The binary labelling threshold is a design choice; researchers preferring regression should use the raw values in pkl[3] and re-derive their own targets.