Skip to content

Commit fca7403

Browse files
committed
Refresh docs and .gitignore; ignore internal agent-context files
1 parent ec1fdbd commit fca7403

3 files changed

Lines changed: 302 additions & 0 deletions

File tree

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,3 +73,8 @@ figures/Screenshot *.png
7373
# Tooling outputs should remain local
7474
tools/_*.txt
7575
tools/_*.json
76+
77+
# Local agent context (not tracked)
78+
AGENTS.md
79+
CLAUDE.md
80+
docs/preprocessing_audit.md

ROADMAP.md

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,3 +70,130 @@ Potential later work includes:
7070
Status:
7171

7272
- Deferred until the current modularization and reproducibility work is stable
73+
74+
---
75+
76+
## Publication Plan
77+
78+
This section records the path from the current thesis to a peer-reviewed paper. It reflects an honest read of the three results: the biomarker branch is a clean reproduction (defensible, not novel), the CT branch is confounded by dataset-of-origin (cancer from Pancreas-CT-CB, control from PANCREAS; ResNet50 embeddings cluster by source at k=2), and fusion is exploratory under synthetic pairing. The publishable contribution is therefore the *bias-aware methodology and the shortcut-learning diagnosis*, not a multimodal performance claim.
79+
80+
### Now: Q2 paper, no new data (near-term, achievable from current results)
81+
82+
A reframed methods / reproducibility paper is publishable today in a decent Q2 venue without collecting anything new. The negative result *is* the contribution.
83+
84+
- **Reframe the narrative** away from "multimodal PDAC detection" toward: *"Near-perfect cross-dataset pancreatic CT classification is a domain-confounding artifact — a diagnostic protocol for detecting it, and evidence that pixel-level mitigation is insufficient."*
85+
- **Lead contributions:** (1) a stepwise bias detection + mitigation pipeline (pixel-mean logistic probe AUC 0.866 -> 0.569 after extreme standardization); (2) the demonstration that the shortcut survives in deep features (K-means-by-source at k=2, silhouette peak, cross-cluster non-identifiability with two institutions); (3) independent reproduction of the Debernardi et al. (2020) urinary panel (AUC 0.944 vs 0.936); (4) a negative-control-based fusion-evaluation framework showing modality dominance under synthetic pairing.
86+
- **Reuse existing figures:** dataset bias check, K-selection (elbow/silhouette), Grad-CAM, final model comparison, biomarker calibration — they already support this framing.
87+
- **Target venues (Q2):** *Diagnostics*, *Journal of Imaging*, *BMC Medical Imaging*, *Computers in Biology and Medicine*, or a reproducibility / negative-results venue.
88+
- **Effort:** weeks of rewriting, no new experiments. This is the recommended first submission.
89+
90+
### Next: four steps required to reach Q1
91+
92+
A Q1 venue (npj Digital Medicine, Medical Image Analysis, IEEE TMI, Radiology: AI) requires new substance, because the field already has 2025 tooling for this problem. Do these in order of leverage:
93+
94+
1. **External multi-centre CT validation (THE blocker).** Re-run CT classification on cohorts where cancer and control are *balanced across sources*, so disease is decoupled from dataset-of-origin. Candidate public sources: MSD Task07 Pancreas, NIH Pancreas-CT, and additional TCIA PDAC collections across different scanner vendors/protocols. Without this, no CT performance claim is defensible.
95+
2. **Domain-adversarial training (gradient reversal).** Add a GRL branch to the ResNet50 backbone that penalizes encoding dataset-of-origin. Evaluate rigorously: show whether it removes embedding-level domain clustering, and quantify the effect on the cancer signal. A rigorous negative result here is still publishable.
96+
3. **Benchmark the bias pipeline against existing 2025 methods** (e.g. ShortKit-ML and related shortcut-detection frameworks) rather than presenting it standalone, to position the contribution against the current state of the art.
97+
4. **Genuine paired CT + biomarker cohort for fusion.** Replace synthetic pairing with real or quasi-paired same-patient data (even 50-100 patients) so fusion can be evaluated as a real clinical question. Hardest step; collaboration-dependent. Consider attention/cross-attention fusion once paired data exists.
98+
99+
### Parallel option: standalone biomarker screening paper
100+
101+
The biomarker branch is the most translatable component (non-invasive, reproducible). A smaller separate paper could extend it with screening-utility analysis (decision-curve analysis, calibration, high-risk subgroup performance) and the original three-class task. Modest novelty, but a clean clinical-utility angle.
102+
103+
Status:
104+
105+
- Q2 reframe: ready to write (recommended first action)
106+
- Q1 four-step programme: 6-12 months, new data required
107+
- Biomarker screening paper: optional parallel track
108+
109+
110+
---
111+
112+
## Validation Datasets to Source (for the Q1 external-validation step)
113+
114+
The confound to break: in the thesis, cancer came from one source (Pancreas-CT-CB) and control from another (NIH PANCREAS), so *any* feature separating the two datasets also separated the two classes. The fix is validation cohorts where **cancer and control come from the same multi-centre pipeline**, so class is not tangled with institution. Ranked by usefulness:
115+
116+
1. **PANORAMA** (recommended primary). First public PDAC-detection grand challenge; to-date largest public PDAC CT dataset. Portal-venous contrast-enhanced CT, clinical metadata, segmentation masks for six PDAC-related structures, patient-level likelihood labels, **multi-centre with PDAC and non-PDAC from the same pipeline**, public leaderboard for honest benchmarking. This is the dataset that lets us decouple cancer from dataset-of-origin and also provides masks needed for pancreas-ROI localization. (arXiv 2503.10068)
117+
2. **Medical Segmentation Decathlon (MSD) Task07 Pancreas**. 420 contrast-enhanced CTs with pancreatic lesions (PDAC, PNET, IPMN) from MSKCC, with tumour segmentation masks. Good independent single-source test and provides masks for ROI cropping.
118+
3. **TCIA Pancreas-CT / NIH** (use with care). NIH Pancreas-CT is the *healthy/normal* set that formed the confounding control arm in the thesis; do NOT reuse it as controls against a different-source cancer set or the bias reappears. Useful only as a normal-pancreas reference within a same-source design.
119+
4. **Benchmark comparator (cite, don't validate on):** PANDA, Nature Medicine 2023 — non-contrast CT, multi-centre validation on 6,239 patients, AUC 0.986–0.996. Sets the performance bar reviewers expect; reinforces that our contribution should be methodology/honesty, not raw performance.
120+
5. **Published external-validation precedent to benchmark against:** 2025 radiomics PDAC study, internal 95% → external 86.5% accuracy on TCIA/MSD — the kind of honest generalization-gap result to reproduce and report.
121+
122+
## Notebook Audit Findings (from 01_multimodal_cancer_detection.ipynb)
123+
124+
Concrete strengths and gaps found by reading the actual cells, to guide the rewrite.
125+
126+
**What was done well (keep):**
127+
128+
- Transparent bias *diagnosis*: pixel-mean logistic probe (Cell 1.7), ResNet embedding + K-means/elbow/silhouette domain audit (Cells 1.15–1.23, 1.41–1.45), per-cluster k=2 evaluation, cross-cluster generalization test.
129+
- Patient-level stratified splits (Cell 2.0) — leakage control is correct.
130+
- Biomarker branch is genuinely solid and under-sold: single clean source, plus calibration, permutation importance, decision-curve and gain/lift analysis (Cells 3.6–3.7). This is the most paper-ready component.
131+
- Fusion done responsibly: multi-seed + label-mismatch negative controls for both decision- and feature-level fusion (Cells 4.1b, 4.2b).
132+
133+
**The central methodological gap — the bias check is partly circular:**
134+
135+
- Mitigation (`extreme_standardize`, Cell 1.10) forces body-pixel **mean=128, std=40**. The bias check (Cell 1.11) then tests whether a logistic model on **pixel mean/std** can separate classes. Because mitigation forces exactly those statistics equal, the probe necessarily drops to ~random (0.569). The detector and the fix target the *same low-order statistic*, so the "bias removed" conclusion is self-fulfilling.
136+
- It does nothing about the higher-order cues a CNN actually exploits: noise/reconstruction-kernel texture, edge/frequency content, field-of-view and body-shape geometry, contrast-phase and slice-thickness signatures. That is exactly why K-means on ResNet embeddings still recovers the source split after standardization.
137+
- Fix: the bias detector must probe the **learned feature space**, not raw pixel moments (e.g. a domain classifier on embeddings, or a dependence measure such as HSIC between representation and source), and mitigation must act in that space.
138+
139+
**CT modeling gap — receptive field too global:**
140+
141+
- The "cropped" dataset (`ct_cropped`) is a **whole-body** crop via segmentation, not a pancreas ROI. The ResNet50 still sees global body outline, FOV, and tissue-wide noise texture — all scanner/source fingerprints. Rapid convergence to ceiling AUC (Cell 2.7 history) is itself a tell of trivially separable domain signal.
142+
143+
**Upgrade for Q1 (feature-space debiasing — current 2025/26 standard):**
144+
145+
- Add a **domain-adversarial branch (gradient reversal)** to the ResNet50 to penalize encoding of dataset-of-origin in the representation.
146+
- Alternatives/complements from the 2025/26 literature: feature disentanglement (latent-space splitting), dependence-minimization (HSIC-style), knowledge distillation from a specialist teacher.
147+
- Acceptance criterion, measured with our *own* K-means/silhouette + embedding domain-classifier diagnostic: the source-aligned clustering that pixel standardization could not remove should collapse, while genuine cancer signal is retained (verified on PANORAMA where class ≠ institution).
148+
- Benchmark the pipeline against a 2025 dependence-measure or disentanglement baseline rather than presenting it standalone.
149+
150+
**Architectural note (ROI vs whole-image):** moving to a pancreas-ROI model (localize then classify) is good practice and removes the *easiest* global shortcuts, but it is necessary-not-sufficient: scanner/reconstruction texture lives inside the pancreas tissue too, and no receptive field fixes a data-design confound where one source = all cancer and the other = all control. The decisive fix is same-source class balance (PANORAMA) + feature-space debiasing; ROI cropping is a robustness improvement layered on top, and it requires pancreas masks (available in PANORAMA/MSD, absent in the thesis two-source set).
151+
152+
153+
### PANORAMA access details (added)
154+
155+
- **License:** CC BY-NC 4.0 (non-commercial) - fine for thesis/paper and academic validation with citation; NOT usable in a commercial product without separate permission.
156+
- **Download:** Zenodo (v1: zenodo.org/records/11034178 ; v2: zenodo.org/records/13742336), mirrored on TCIA (wiki.cancerimagingarchive.net/display/Public/PANORAMA). Challenge: panorama.grand-challenge.org.
157+
- **Contents:** 2,238 anonymized contrast-enhanced CT scans from two Dutch centres (Radboud UMC + UMC Groningen), plus 194 MSD and 80 NIH cases - unified multi-centre labelled cohort where class is NOT tied to a single source.
158+
- **Masks included:** segmentation masks for six PDAC-related structures - supports the pancreas-ROI localization step the thesis two-source data lacked.
159+
- **Baseline:** official implementation at github.com/DIAGNijmegen/PANORAMA_baseline (benchmark comparator).
160+
- **Caveat:** PANORAMA folds in the NIH cases (same family as the old confounding control set). Use the unified labelled cohort as-is; do NOT extract the NIH subset as a standalone control arm or the dataset-of-origin confound returns.
161+
162+
---
163+
164+
## Q1 Experiment Battery (what reviewers will expect)
165+
166+
Beyond the four core steps, these analyses are near-mandatory for a strong medical-AI submission. The first three turn the CT result from "near-perfect" into "honestly characterized"; the rest are standard rigor.
167+
168+
- **Calibration, not just AUC.** Report Expected Calibration Error (ECE), Brier score, and reliability diagrams. Deployment needs calibrated probabilities. (Biomarker branch already has calibration + decision-curve code in notebook cells 3.6-3.7 - reuse it.)
169+
- **Missing-modality robustness.** Evaluate CT-only, biomarker-only, both-present, and degraded inputs (missing CT, missing biomarker, noisy biomarker, low-quality CT). A fusion model is only interesting if it degrades gracefully.
170+
- **Uncertainty / OOD detection.** Add predictive uncertainty (MC-dropout or deep ensembles) and an out-of-distribution flag. This is the distinctive angle: the same OOD machinery that flags an unseen scanner is what would have caught the original domain shift. "The system knows when it doesn't know."
171+
- **Ablation table.** CT-only / biomarker-only / decision-fusion / feature-fusion / proposed, each with AUROC + CI, so fusion's marginal value (or lack of it) is explicit.
172+
- **Subgroup / fairness reporting.** Performance by source/scanner, sex, and age where metadata allows - this is what makes "bias-aware" demonstrated rather than asserted.
173+
174+
## Statistical Rigor
175+
176+
- Report **95% confidence intervals** on all headline metrics (DeLong for AUROC; bootstrap for the rest). With small n, point estimates alone will be challenged.
177+
- Note the **power limitation** explicitly given cohort sizes; pre-register the analysis plan where possible.
178+
- Keep **leave-one-site-out / external** as the primary generalization metric, never random splits (mirrors the agroforestry repo's discipline).
179+
180+
## Reporting Standards & Checklists (attach at submission)
181+
182+
Q1 clinical-AI venues increasingly require a completed reporting checklist. Target compliance with:
183+
184+
- **TRIPOD-AI** (prediction-model reporting) and/or **STARD-AI** (diagnostic-accuracy studies).
185+
- **CLAIM** (Checklist for AI in Medical Imaging) for the CT component.
186+
- A **model card** (already drafted in docs/model_card.md - extend it) and a **data statement** (docs/data_and_ethics.md).
187+
188+
## Target Venue Shortlist (consolidated)
189+
190+
- **Q2, now (reframed shortcut-learning methods paper):** Diagnostics; Journal of Imaging; BMC Medical Imaging; Computers in Biology and Medicine; or a reproducibility / negative-results venue.
191+
- **Q1, after the external-validation + feature-space-debiasing work:** npj Digital Medicine; Medical Image Analysis; IEEE Transactions on Medical Imaging; Radiology: Artificial Intelligence.
192+
- **Biomarker-only screening paper (parallel):** a clinical or screening-oriented journal, leaning on the calibration + decision-curve analysis already implemented.
193+
194+
## Open Decisions To Resolve Before Writing
195+
196+
- Which paper goes first - the Q2 methods reframe (fast, low-risk) or hold for the Q1 swing. Recommendation on record: submit the Q2 reframe first; it banks a publication and de-risks the narrative.
197+
- Whether to pursue a real/quasi-paired CT+biomarker cohort for genuine fusion (collaboration-dependent) or keep fusion as an explicitly exploratory section.
198+
- Scope of feature-space debiasing: domain-adversarial only (minimum) vs. adding disentanglement/dependence-minimization baselines (stronger, more work).
199+

0 commit comments

Comments
 (0)