You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ROADMAP.md
+127Lines changed: 127 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -70,3 +70,130 @@ Potential later work includes:
70
70
Status:
71
71
72
72
- Deferred until the current modularization and reproducibility work is stable
73
+
74
+
---
75
+
76
+
## Publication Plan
77
+
78
+
This section records the path from the current thesis to a peer-reviewed paper. It reflects an honest read of the three results: the biomarker branch is a clean reproduction (defensible, not novel), the CT branch is confounded by dataset-of-origin (cancer from Pancreas-CT-CB, control from PANCREAS; ResNet50 embeddings cluster by source at k=2), and fusion is exploratory under synthetic pairing. The publishable contribution is therefore the *bias-aware methodology and the shortcut-learning diagnosis*, not a multimodal performance claim.
79
+
80
+
### Now: Q2 paper, no new data (near-term, achievable from current results)
81
+
82
+
A reframed methods / reproducibility paper is publishable today in a decent Q2 venue without collecting anything new. The negative result *is* the contribution.
83
+
84
+
-**Reframe the narrative** away from "multimodal PDAC detection" toward: *"Near-perfect cross-dataset pancreatic CT classification is a domain-confounding artifact — a diagnostic protocol for detecting it, and evidence that pixel-level mitigation is insufficient."*
85
+
-**Lead contributions:** (1) a stepwise bias detection + mitigation pipeline (pixel-mean logistic probe AUC 0.866 -> 0.569 after extreme standardization); (2) the demonstration that the shortcut survives in deep features (K-means-by-source at k=2, silhouette peak, cross-cluster non-identifiability with two institutions); (3) independent reproduction of the Debernardi et al. (2020) urinary panel (AUC 0.944 vs 0.936); (4) a negative-control-based fusion-evaluation framework showing modality dominance under synthetic pairing.
86
+
-**Reuse existing figures:** dataset bias check, K-selection (elbow/silhouette), Grad-CAM, final model comparison, biomarker calibration — they already support this framing.
87
+
-**Target venues (Q2):***Diagnostics*, *Journal of Imaging*, *BMC Medical Imaging*, *Computers in Biology and Medicine*, or a reproducibility / negative-results venue.
88
+
-**Effort:** weeks of rewriting, no new experiments. This is the recommended first submission.
89
+
90
+
### Next: four steps required to reach Q1
91
+
92
+
A Q1 venue (npj Digital Medicine, Medical Image Analysis, IEEE TMI, Radiology: AI) requires new substance, because the field already has 2025 tooling for this problem. Do these in order of leverage:
93
+
94
+
1.**External multi-centre CT validation (THE blocker).** Re-run CT classification on cohorts where cancer and control are *balanced across sources*, so disease is decoupled from dataset-of-origin. Candidate public sources: MSD Task07 Pancreas, NIH Pancreas-CT, and additional TCIA PDAC collections across different scanner vendors/protocols. Without this, no CT performance claim is defensible.
95
+
2.**Domain-adversarial training (gradient reversal).** Add a GRL branch to the ResNet50 backbone that penalizes encoding dataset-of-origin. Evaluate rigorously: show whether it removes embedding-level domain clustering, and quantify the effect on the cancer signal. A rigorous negative result here is still publishable.
96
+
3.**Benchmark the bias pipeline against existing 2025 methods** (e.g. ShortKit-ML and related shortcut-detection frameworks) rather than presenting it standalone, to position the contribution against the current state of the art.
97
+
4.**Genuine paired CT + biomarker cohort for fusion.** Replace synthetic pairing with real or quasi-paired same-patient data (even 50-100 patients) so fusion can be evaluated as a real clinical question. Hardest step; collaboration-dependent. Consider attention/cross-attention fusion once paired data exists.
98
+
99
+
### Parallel option: standalone biomarker screening paper
100
+
101
+
The biomarker branch is the most translatable component (non-invasive, reproducible). A smaller separate paper could extend it with screening-utility analysis (decision-curve analysis, calibration, high-risk subgroup performance) and the original three-class task. Modest novelty, but a clean clinical-utility angle.
102
+
103
+
Status:
104
+
105
+
- Q2 reframe: ready to write (recommended first action)
106
+
- Q1 four-step programme: 6-12 months, new data required
## Validation Datasets to Source (for the Q1 external-validation step)
113
+
114
+
The confound to break: in the thesis, cancer came from one source (Pancreas-CT-CB) and control from another (NIH PANCREAS), so *any* feature separating the two datasets also separated the two classes. The fix is validation cohorts where **cancer and control come from the same multi-centre pipeline**, so class is not tangled with institution. Ranked by usefulness:
115
+
116
+
1.**PANORAMA** (recommended primary). First public PDAC-detection grand challenge; to-date largest public PDAC CT dataset. Portal-venous contrast-enhanced CT, clinical metadata, segmentation masks for six PDAC-related structures, patient-level likelihood labels, **multi-centre with PDAC and non-PDAC from the same pipeline**, public leaderboard for honest benchmarking. This is the dataset that lets us decouple cancer from dataset-of-origin and also provides masks needed for pancreas-ROI localization. (arXiv 2503.10068)
117
+
2.**Medical Segmentation Decathlon (MSD) Task07 Pancreas**. 420 contrast-enhanced CTs with pancreatic lesions (PDAC, PNET, IPMN) from MSKCC, with tumour segmentation masks. Good independent single-source test and provides masks for ROI cropping.
118
+
3.**TCIA Pancreas-CT / NIH** (use with care). NIH Pancreas-CT is the *healthy/normal* set that formed the confounding control arm in the thesis; do NOT reuse it as controls against a different-source cancer set or the bias reappears. Useful only as a normal-pancreas reference within a same-source design.
119
+
4.**Benchmark comparator (cite, don't validate on):** PANDA, Nature Medicine 2023 — non-contrast CT, multi-centre validation on 6,239 patients, AUC 0.986–0.996. Sets the performance bar reviewers expect; reinforces that our contribution should be methodology/honesty, not raw performance.
120
+
5.**Published external-validation precedent to benchmark against:** 2025 radiomics PDAC study, internal 95% → external 86.5% accuracy on TCIA/MSD — the kind of honest generalization-gap result to reproduce and report.
- Patient-level stratified splits (Cell 2.0) — leakage control is correct.
130
+
- Biomarker branch is genuinely solid and under-sold: single clean source, plus calibration, permutation importance, decision-curve and gain/lift analysis (Cells 3.6–3.7). This is the most paper-ready component.
131
+
- Fusion done responsibly: multi-seed + label-mismatch negative controls for both decision- and feature-level fusion (Cells 4.1b, 4.2b).
132
+
133
+
**The central methodological gap — the bias check is partly circular:**
134
+
135
+
- Mitigation (`extreme_standardize`, Cell 1.10) forces body-pixel **mean=128, std=40**. The bias check (Cell 1.11) then tests whether a logistic model on **pixel mean/std** can separate classes. Because mitigation forces exactly those statistics equal, the probe necessarily drops to ~random (0.569). The detector and the fix target the *same low-order statistic*, so the "bias removed" conclusion is self-fulfilling.
136
+
- It does nothing about the higher-order cues a CNN actually exploits: noise/reconstruction-kernel texture, edge/frequency content, field-of-view and body-shape geometry, contrast-phase and slice-thickness signatures. That is exactly why K-means on ResNet embeddings still recovers the source split after standardization.
137
+
- Fix: the bias detector must probe the **learned feature space**, not raw pixel moments (e.g. a domain classifier on embeddings, or a dependence measure such as HSIC between representation and source), and mitigation must act in that space.
138
+
139
+
**CT modeling gap — receptive field too global:**
140
+
141
+
- The "cropped" dataset (`ct_cropped`) is a **whole-body** crop via segmentation, not a pancreas ROI. The ResNet50 still sees global body outline, FOV, and tissue-wide noise texture — all scanner/source fingerprints. Rapid convergence to ceiling AUC (Cell 2.7 history) is itself a tell of trivially separable domain signal.
142
+
143
+
**Upgrade for Q1 (feature-space debiasing — current 2025/26 standard):**
144
+
145
+
- Add a **domain-adversarial branch (gradient reversal)** to the ResNet50 to penalize encoding of dataset-of-origin in the representation.
146
+
- Alternatives/complements from the 2025/26 literature: feature disentanglement (latent-space splitting), dependence-minimization (HSIC-style), knowledge distillation from a specialist teacher.
147
+
- Acceptance criterion, measured with our *own* K-means/silhouette + embedding domain-classifier diagnostic: the source-aligned clustering that pixel standardization could not remove should collapse, while genuine cancer signal is retained (verified on PANORAMA where class ≠institution).
148
+
- Benchmark the pipeline against a 2025 dependence-measure or disentanglement baseline rather than presenting it standalone.
149
+
150
+
**Architectural note (ROI vs whole-image):** moving to a pancreas-ROI model (localize then classify) is good practice and removes the *easiest* global shortcuts, but it is necessary-not-sufficient: scanner/reconstruction texture lives inside the pancreas tissue too, and no receptive field fixes a data-design confound where one source = all cancer and the other = all control. The decisive fix is same-source class balance (PANORAMA) + feature-space debiasing; ROI cropping is a robustness improvement layered on top, and it requires pancreas masks (available in PANORAMA/MSD, absent in the thesis two-source set).
151
+
152
+
153
+
### PANORAMA access details (added)
154
+
155
+
-**License:** CC BY-NC 4.0 (non-commercial) - fine for thesis/paper and academic validation with citation; NOT usable in a commercial product without separate permission.
-**Contents:** 2,238 anonymized contrast-enhanced CT scans from two Dutch centres (Radboud UMC + UMC Groningen), plus 194 MSD and 80 NIH cases - unified multi-centre labelled cohort where class is NOT tied to a single source.
158
+
-**Masks included:** segmentation masks for six PDAC-related structures - supports the pancreas-ROI localization step the thesis two-source data lacked.
159
+
-**Baseline:** official implementation at github.com/DIAGNijmegen/PANORAMA_baseline (benchmark comparator).
160
+
-**Caveat:** PANORAMA folds in the NIH cases (same family as the old confounding control set). Use the unified labelled cohort as-is; do NOT extract the NIH subset as a standalone control arm or the dataset-of-origin confound returns.
161
+
162
+
---
163
+
164
+
## Q1 Experiment Battery (what reviewers will expect)
165
+
166
+
Beyond the four core steps, these analyses are near-mandatory for a strong medical-AI submission. The first three turn the CT result from "near-perfect" into "honestly characterized"; the rest are standard rigor.
167
+
168
+
-**Calibration, not just AUC.** Report Expected Calibration Error (ECE), Brier score, and reliability diagrams. Deployment needs calibrated probabilities. (Biomarker branch already has calibration + decision-curve code in notebook cells 3.6-3.7 - reuse it.)
169
+
-**Missing-modality robustness.** Evaluate CT-only, biomarker-only, both-present, and degraded inputs (missing CT, missing biomarker, noisy biomarker, low-quality CT). A fusion model is only interesting if it degrades gracefully.
170
+
-**Uncertainty / OOD detection.** Add predictive uncertainty (MC-dropout or deep ensembles) and an out-of-distribution flag. This is the distinctive angle: the same OOD machinery that flags an unseen scanner is what would have caught the original domain shift. "The system knows when it doesn't know."
171
+
-**Ablation table.** CT-only / biomarker-only / decision-fusion / feature-fusion / proposed, each with AUROC + CI, so fusion's marginal value (or lack of it) is explicit.
172
+
-**Subgroup / fairness reporting.** Performance by source/scanner, sex, and age where metadata allows - this is what makes "bias-aware" demonstrated rather than asserted.
173
+
174
+
## Statistical Rigor
175
+
176
+
- Report **95% confidence intervals** on all headline metrics (DeLong for AUROC; bootstrap for the rest). With small n, point estimates alone will be challenged.
177
+
- Note the **power limitation** explicitly given cohort sizes; pre-register the analysis plan where possible.
178
+
- Keep **leave-one-site-out / external** as the primary generalization metric, never random splits (mirrors the agroforestry repo's discipline).
179
+
180
+
## Reporting Standards & Checklists (attach at submission)
-**CLAIM** (Checklist for AI in Medical Imaging) for the CT component.
186
+
- A **model card** (already drafted in docs/model_card.md - extend it) and a **data statement** (docs/data_and_ethics.md).
187
+
188
+
## Target Venue Shortlist (consolidated)
189
+
190
+
-**Q2, now (reframed shortcut-learning methods paper):** Diagnostics; Journal of Imaging; BMC Medical Imaging; Computers in Biology and Medicine; or a reproducibility / negative-results venue.
191
+
-**Q1, after the external-validation + feature-space-debiasing work:** npj Digital Medicine; Medical Image Analysis; IEEE Transactions on Medical Imaging; Radiology: Artificial Intelligence.
192
+
-**Biomarker-only screening paper (parallel):** a clinical or screening-oriented journal, leaning on the calibration + decision-curve analysis already implemented.
193
+
194
+
## Open Decisions To Resolve Before Writing
195
+
196
+
- Which paper goes first - the Q2 methods reframe (fast, low-risk) or hold for the Q1 swing. Recommendation on record: submit the Q2 reframe first; it banks a publication and de-risks the narrative.
197
+
- Whether to pursue a real/quasi-paired CT+biomarker cohort for genuine fusion (collaboration-dependent) or keep fusion as an explicitly exploratory section.
198
+
- Scope of feature-space debiasing: domain-adversarial only (minimum) vs. adding disentanglement/dependence-minimization baselines (stronger, more work).
0 commit comments