Skip to content

Commit 99b544f

Browse files
committed
Unify pipeline docs, defaults, and reproducibility workflow
1 parent 3dc1833 commit 99b544f

24 files changed

Lines changed: 6755 additions & 561 deletions

.gitignore

Lines changed: 11 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -15,29 +15,25 @@ Thumbs.db
1515
# Raw/local data
1616
data/
1717

18+
# Agent or local workspace metadata
19+
AGENTS.md
20+
1821
# Generated pipeline artifacts (large / reproducible)
1922
analysis_pipeline/derivatives/
2023
analysis_pipeline/features/
2124
analysis_pipeline/models/
25+
analysis_pipeline/runs/
2226

23-
# Large report artifacts
24-
analysis_pipeline/reports/confusion_pngs/
25-
analysis_pipeline/reports/run_logs/
26-
analysis_pipeline/reports/figures/
27-
analysis_pipeline/reports/report_assets_*/
28-
29-
# Large intermediate report files
30-
analysis_pipeline/reports/ml_results*.json
31-
analysis_pipeline/reports/confusion_highlights*.json
32-
analysis_pipeline/reports/epoch_manifest.tsv
33-
analysis_pipeline/reports/trial_table_*.tsv
34-
35-
# Runtime files
36-
analysis_pipeline/reports/*.pid
37-
analysis_pipeline/reports/*.log
27+
# Generated report artifacts
28+
analysis_pipeline/reports/**
3829

3930
# Keep folder structure if needed
4031
!analysis_pipeline/derivatives/.gitkeep
4132
!analysis_pipeline/features/.gitkeep
4233
!analysis_pipeline/models/.gitkeep
4334
!analysis_pipeline/reports/.gitkeep
35+
36+
# Local manuscript/transfer packages
37+
docs/*.docx
38+
docs/*.zip
39+
docs/figure_transfer_*/

AGENTS.md

Lines changed: 0 additions & 107 deletions
This file was deleted.

README.md

Lines changed: 66 additions & 108 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,21 @@
1-
# Physiological Workload ML Pipeline (Standalone)
1+
# Physiological Workload ML Pipeline
22

3-
This repository is intentionally separated from the BIDS data-descriptor repository.
3+
Standalone repository for reproducing the multimodal cognitive-workload pipeline used on the BIDS arithmetic dataset. The codebase covers dataset acquisition, trial-table construction, QC, preprocessing, epoching, unimodal feature extraction, fused-table assembly, split-aware machine learning, confusion analysis, and publication-oriented reporting.
44

5-
It contains only the code and documentation needed to:
6-
1. Download a BIDS dataset snapshot.
7-
2. Build trial tables and extract multimodal physiological features.
8-
3. Train and evaluate machine-learning models for cognitive workload (task difficulty) resolution.
5+
## Repository layout
96

10-
## Repository Scope
7+
- `analysis_pipeline/`: executable stage scripts, pipeline configs, and supporting modules.
8+
- `analysis_pipeline/config/`: checked-in YAML profiles for reproducible runs.
9+
- `scripts/`: dataset download, end-to-end execution, and report/manuscript helpers.
10+
- `docs/`: GitHub-facing documentation plus manuscript handoff material.
11+
- `data/`: local BIDS dataset root (ignored).
12+
- `analysis_pipeline/runs/`: run-specific outputs (ignored).
1113

12-
- `analysis_pipeline/`: staged processing scripts (trial table -> QC -> preprocessing -> epoching -> features -> ML).
13-
- `analysis_pipeline/config/`: YAML configs for full runs and class-variant runs.
14-
- `scripts/download_bids.py`: automatic BIDS download helper (OpenNeuro CLI or archive URL).
15-
- `docs/`: explicit methods notes for journal write-up.
14+
This repository is intentionally separate from any BIDS descriptor or data-publication repository. Track code, configs, and documentation here; keep raw data and generated outputs local.
1615

17-
This repo should not be used to publish final BIDS data artifacts. Keep those in your separate data-descriptor repository.
16+
## Quick start
1817

19-
## Quick Start
20-
21-
### 1) Environment
18+
### 1. Create the environment
2219

2320
```powershell
2421
python -m venv .venv
@@ -27,149 +24,110 @@ python -m pip install --upgrade pip
2724
python -m pip install -r requirements.txt
2825
```
2926

30-
### 2) Download BIDS Dataset (automatic)
27+
`requirements.txt` covers the classic ML and signal-processing stack. Install PyTorch separately if you plan to run deep Stage 6 models such as `lstm1d`, `gru1d`, `cnn1d`, or `transformer`.
28+
29+
### 2. Acquire the BIDS dataset
3130

32-
Option A: OpenNeuro dataset ID
31+
OpenNeuro dataset ID:
3332

3433
```powershell
3534
python .\scripts\download_bids.py `
3635
--dataset-id dsXXXXXX `
3736
--target .\data\bids_arithmetic
3837
```
3938

40-
Option B: Direct archive URL (`.zip`, `.tar`, `.tar.gz`, `.tgz`)
39+
Direct archive URL:
4140

4241
```powershell
4342
python .\scripts\download_bids.py `
4443
--archive-url https://example.org/your_bids_archive.zip `
4544
--target .\data\bids_arithmetic
4645
```
4746

48-
One-command end-to-end example:
47+
One-command download plus pipeline execution:
4948

5049
```powershell
5150
.\scripts\run_end_to_end.ps1 -DatasetId dsXXXXXX -ForceDownload
5251
```
5352

54-
### 3) Run Full Pipeline
55-
56-
```powershell
57-
python .\analysis_pipeline\run_pipeline.py --config .\analysis_pipeline\config\pipeline.yaml
58-
```
59-
60-
By default this runs Stage 0->6 and writes outputs under:
61-
- `analysis_pipeline/derivatives/`
62-
- `analysis_pipeline/features/`
63-
- `analysis_pipeline/models/`
64-
- `analysis_pipeline/reports/`
53+
### 3. Run a checked-in profile
6554

66-
### 4) Stage-Focused Commands (feature extraction + ML)
67-
68-
If you want to run up to feature extraction first:
55+
Default fixed-window profile:
6956

7057
```powershell
7158
python .\analysis_pipeline\run_pipeline.py `
72-
--config .\analysis_pipeline\config\pipeline.yaml `
73-
--only stage0 stage1 stage2 stage3 stage4 stage5
59+
--config .\analysis_pipeline\config\pipeline_unified_classic_nn_baseline_preproc.yaml
7460
```
7561

76-
Then run ML only:
62+
Alternative overlap profile:
7763

7864
```powershell
7965
python .\analysis_pipeline\run_pipeline.py `
80-
--config .\analysis_pipeline\config\pipeline.yaml `
81-
--only stage6 stage6_confusions
66+
--config .\analysis_pipeline\config\pipeline_unified_classic_nn_baseline_overlap3s_50pct_preproc.yaml
8267
```
8368

84-
## Linux -> Windows Handoff (Baseline Reports)
85-
86-
`git pull` alone is not enough to reproduce baseline report assets on Windows. Large ML/report artifacts are intentionally ignored in `.gitignore`:
87-
88-
- `analysis_pipeline/reports/ml_results*.json`
89-
- `analysis_pipeline/reports/confusion_highlights*.json`
90-
- `analysis_pipeline/reports/confusion_pngs/`
91-
- `analysis_pipeline/features/` (entire folder)
92-
93-
If the Linux machine already produced the baseline runs, copy these files from Linux into the same paths on Windows.
94-
95-
Required files to rebuild all baseline confusion reports on Windows:
69+
The checked-in profiles write under `analysis_pipeline/runs/<profile_name>/`. Both set `outputs.clean_start: true`, so rerunning the same profile replaces that profile's run directory only. If you want to preserve an existing run, copy the YAML and change `outputs.root` or set `clean_start: false`.
9670

97-
- `analysis_pipeline/reports/ml_results_baseline_*_baseline.json` (classic baseline track, 5 files)
98-
- `analysis_pipeline/reports/ml_results_baseline_*_baseline_advanced_nn.json` (advanced NN baseline track, 5 files)
71+
### 4. Run stage subsets
9972

100-
Expected scenarios in those filenames:
101-
102-
- `baseline_all_bins`
103-
- `baseline_omit_hardest`
104-
- `baseline_low_high_omit_hardest`
105-
- `baseline_grouped_4class_omit_hardest`
106-
- `baseline_omit_easiest`
107-
108-
If you need to rerun baseline ML on Windows (not only regenerate confusion reports), also copy:
109-
110-
- `analysis_pipeline/features/` (all feature tables/manifests referenced by Stage 6)
111-
- especially `analysis_pipeline/features/split_manifest_tutorial_baseline.json`
112-
113-
Rebuild confusion reports for all available `ml_results*.json` files (includes baseline + advanced if present):
73+
Run through feature extraction:
11474

11575
```powershell
116-
Get-ChildItem .\analysis_pipeline\reports\ml_results*.json | ForEach-Object {
117-
$scenario = $_.BaseName -replace '^ml_results_', ''
118-
python .\analysis_pipeline\stage6_highlight_confusions.py `
119-
--results-json $_.FullName `
120-
--out-json ("analysis_pipeline/reports/confusion_highlights_{0}.json" -f $scenario) `
121-
--out-md ("analysis_pipeline/reports/confusion_highlights_{0}.md" -f $scenario) `
122-
--metric balanced_accuracy_mean `
123-
--top-k-per-protocol 1 `
124-
--include-all `
125-
--out-png-dir analysis_pipeline/reports/confusion_pngs
126-
}
76+
python .\analysis_pipeline\run_pipeline.py `
77+
--config .\analysis_pipeline\config\pipeline_unified_classic_nn_baseline_preproc.yaml `
78+
--only stage0 stage1 stage2 stage3 stage4 stage5
12779
```
12880

129-
Then rebuild aggregate report tables/plots used in writeups:
81+
Run Stage 6 only:
13082

13183
```powershell
132-
python .\analysis_pipeline\stage6_build_report_assets.py `
133-
--results-json-glob "analysis_pipeline/reports/ml_results_baseline*.json" `
134-
--confusion-json-glob "analysis_pipeline/reports/confusion_highlights_baseline*.json" `
135-
--out-dir analysis_pipeline/reports/report_assets_baseline_classic_and_advanced
84+
python .\analysis_pipeline\run_pipeline.py `
85+
--config .\analysis_pipeline\config\pipeline_unified_classic_nn_baseline_preproc.yaml `
86+
--only stage6
13687
```
13788

138-
Quick check that baseline inputs arrived:
89+
When `--only stage6` is used through the orchestrator, `stage6_confusions` is auto-run unless `--no-auto-stage6-confusions` is passed.
13990

140-
```powershell
141-
Get-ChildItem .\analysis_pipeline\reports\ml_results_baseline_*_baseline.json | Measure-Object
142-
Get-ChildItem .\analysis_pipeline\reports\ml_results_baseline_*_baseline_advanced_nn.json | Measure-Object
143-
```
91+
## Pipeline summary
14492

145-
Each command should report `Count = 5`.
93+
| Stage | Script | Main purpose | Main outputs |
94+
| --- | --- | --- | --- |
95+
| 0 | `build_trial_table.py` | Build the canonical trial table from BIDS events. | `<run_root>/reports/trial_table_bids_arithmetic.tsv` |
96+
| 1 | `stage1_qc_summary.py` | Summarize modality coverage, dropped samples, and participant QC. | `<run_root>/reports/qc_dataset_summary.json`, figures, subject table |
97+
| 2 | `stage2_preprocess.py` | Clean EEG, ECG, and pupil streams and write derivatives. | `<run_root>/derivatives/cleaned/`, preprocess logs |
98+
| 3 | `stage3_epoch_trials.py` | Convert trials into fixed or overlapping epochs with drop accounting. | `<run_root>/derivatives/epochs/`, `epoch_manifest.tsv`, `epoch_summary.json` |
99+
| 4 | `stage4_extract_features.py` | Extract modality-specific engineered features. | `<run_root>/features/features_eeg.tsv`, `features_ecg.tsv`, `features_pupil.tsv` |
100+
| 5 | `stage5_build_fused_table.py` | Build unimodal and fused ML tables plus split manifests. | `<run_root>/features/features_fused_tutorial_baseline.tsv`, `split_manifest_tutorial_baseline.json` |
101+
| 6 | `stage6_train_classic_ml.py` | Benchmark classic and optional deep models across datasets, protocols, and class scenarios. | `<run_root>/reports/ml_results_*.json`, `<run_root>/reports/ml_summary_*.md`, `<run_root>/models/` |
102+
| 6b | `stage6_highlight_confusions.py` | Curate top confusion matrices from Stage 6 results. | `<run_root>/reports/confusion_highlights_*.json`, markdown, PNGs |
103+
| 6c | `stage6_build_publication_report.py` | Assemble a publication-facing run summary. | `<run_root>/reports/publication_full_report.md`, `.json` |
146104

147-
## Stage Summary
105+
Stage 6 can also emit live confusion PNGs during training and EEG PSD/topomap QC figures when EEG is part of the selected dataset list.
148106

149-
- Stage 0: canonical trial table from BIDS events.
150-
- Stage 1: QC summary.
151-
- Stage 2: preprocessing (EEG/ECG/pupil).
152-
- Stage 3: epoching from trial windows.
153-
- Stage 4: feature extraction (unimodal).
154-
- Stage 5: fused ML table + split manifest.
155-
- Stage 6: split-aware ML benchmarking.
107+
## Checked-in profiles
156108

157-
See `docs/pipeline_methods.md` for explicit methodological details, including how fixed 6-second arithmetic windows are converted into epochs and optional sub-windows.
158-
For exact feature/model implementation details, see `docs/feature_ml_reference.md`.
109+
| Profile | File | Intended use |
110+
| --- | --- | --- |
111+
| Baseline fixed-window run | `analysis_pipeline/config/pipeline_unified_classic_nn_baseline_preproc.yaml` | Canonical reproducible run: fixed 6 s calculation windows, classic plus deep model sweep, publication report enabled. |
112+
| Overlap-window run | `analysis_pipeline/config/pipeline_unified_classic_nn_baseline_overlap3s_50pct_preproc.yaml` | Same pipeline family with 3 s windows, 1.5 s step size, and overlap enabled for Stage 3. |
159113

160-
## Notes for Manuscript Framing
114+
Both profiles benchmark the `baseline_all_bins`, `baseline_omit_easiest`, `baseline_omit_hardest`, `baseline_low_high_omit_hardest`, and `baseline_grouped_4class_omit_hardest` class scenarios.
161115

162-
This codebase is designed for ML-psychology style reporting of physiological workload resolution:
163-
- modality-wise benchmarking (EEG, ECG, pupil, fused),
164-
- protocol-wise benchmarking (LOSO, group_holdout, within_participant),
165-
- class-resolution comparisons (binary/4/7/8 classes).
116+
## Reproducibility notes
166117

167-
For deep models (`lstm1d`, `gru1d`, `cnn1d`, `transformer`), install PyTorch and run:
118+
- `run_pipeline.py` expands output placeholders such as `{reports_dir}`, `{features_dir}`, and `{models_dir}` from `outputs.root`.
119+
- Expected outputs are verified after every step. Use `--no-strict-outputs` only when debugging incomplete runs.
120+
- Stage 1 strict QC carry-forward is propagated automatically into Stages 2 to 5.
121+
- `--dry-run` prints planned commands and expected outputs without executing them.
122+
- Stage 6 resolves both Windows and WSL-style dataset paths stored in split manifests.
168123

169-
```powershell
170-
python .\analysis_pipeline\run_pipeline.py --config .\analysis_pipeline\config\pipeline_with_deep_models.yaml
171-
```
124+
## Documentation map
125+
126+
- `docs/pipeline_reference.md`: explicit stage-by-stage and config-by-config pipeline reference.
127+
- `docs/reproducibility.md`: artifact policy, rerun guidance, and Linux-to-Windows handoff instructions.
128+
- `analysis_pipeline/README.md`: package-level map of stage scripts and outputs.
129+
- local manuscript handoff material can live under `docs/paper_handoff/` without changing the reproducible pipeline entry points.
172130

173131
## License
174132

175-
CC0 1.0 Universal (see `LICENSE`).
133+
CC0 1.0 Universal (see `LICENSE`).

0 commit comments

Comments
 (0)