This repository contains the code for the thesis:
"A Vision-Language Approach to Multimodal Fact-Checking in Healthcare"
Giuseppe Mastellone — University of Naples Federico II, 2026
The work introduces a three-head classifier for medical fact-checking on the MC-MMH dataset, a novel framework combining MultiCare and MM-Health through a data generation and validation piepline.
├── dataset_generation/
│ ├── Swapper.py # Stochastic greedy image swap algorithm
│ ├── swap_quality_check.py # Visual quality check for swapped pairs
│ ├── MM_Health_Claim_Generation.ipynb # Claim generation — MM-Health
│ ├── MultiCare_Claim_Generation.ipynb # Claim generation — MultiCare
│ ├── Claim_Validation.ipynb # NLI-based claim validation pipeline
│ └── NEW_DATA_UNIFICATION.py # Merges MC + MMHL into train/val/test splits
│
├── training/
│ ├── embedding-extractor.ipynb # Token-level embedding extraction
│ ├── Multi_Task_and_Modal_classifier.ipynb # Classifier — final configuration
│ ├── Fakeddit-Classifier.ipynb # Training on Fakeddit
│ └── vlm-zero-shot-evaluation.ipynb # VLM zero-shot baseline evaluation
│
└── README.md
MC-MMH (MultiCare–MM-Health) is a multimodal medical fact-checking benchmark with four classes:
| Class | Description |
|---|---|
TRUE AND CONCORDANT |
True claim, consistent image |
FALSE AND CONCORDANT |
False claim, internally consistent |
FALSE SWAPPED |
Claim from a different clinical context (image swapped) |
FALSE TEXT |
Correct image, false textual report |
Statistics: ~3,200 samples — Train 2,226 / Val 478 / Test 477 — 50% TRUE / 50% FALSE.
Generates FALSE_SWAPPED samples by pairing MultiCare images with semantically
distant counterparts via a stochastic greedy algorithm based on BiomedCLIP
cosine similarity (τ_min=0.10, τ_max=0.50, top-k=20).
# Visually inspect swap quality after running Swapper.py
python swap_quality_check.py \
--json /path/to/dataset.json \
--report /path/to/swap_report.json \
--images /path/to/images/ \
--n 20 --seed 42Generates claims for TRUE AND CONCORDANT, FALSE TEXT, and FALSE SWAPPED
using MedGemma-4B with class-specific prompts.
Generates claims for TRUE AND CONCORDANT and FALSE AND CONCORDANT
from MM-Health articles using MedGemma-4B.
Filters generated claims via Retrieve-and-Classify NLI:
- Embedding model:
pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb - NLI model:
lighteternal/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext-finetuned-mnli
Merges MultiCare and MM-Health validated claims into stratified train/val/test splits (70%/15%/15%).
- Visual encoder: MedSigLIP-448 (T_v=1024, D_v=1152, no CLS token)
- Textual encoder: MedEmbed-base (T_t=256, D_t=768)
- Fusion: CrossAttentionFusion (per head, 4 attention heads)
- IB gate: per-head per-modality soft mask, sigmoid clamped to [0.05, 0.95]
- Heads: FACT (TRUE/FALSE), ALIGN (CONCORDANT/DISCORDANT), TYPE (NONE/TXT_ERR/IMG_SWAP)
- Loss: Focal + KL-IB (β=0.05) + InfoNCE (γ=0.10) + Border penalty (δ=0.1)
- Decision routing: TYPE head override at threshold θ*=0.75 (val-calibrated)
| Metric | Value |
|---|---|
| F1_fact | 0.709 |
| F1_align | 0.725 |
| F1_type | 0.672 |
| Composite C (val) | 1.021 |
| ECE (test) | 0.095 |
| R@1 Image→Text | 0.828 |
| Median Rank | 1.0 |
| Method | F1_fact |
|---|---|
| Majority class | 0.334 |
| BiomedCLIP zero-shot | 0.486 |
| Unimodal image-only | 0.627 |
| Unimodal text-only | 0.673 |
| Late fusion | 0.692 |
| Proposed | 0.709 |
| Early fusion | 0.740 |
| Direction | F1_fact | Transfer gap |
|---|---|---|
| Fakeddit → MC-MMH | 0.514 | −0.196 |
| MC-MMH → Fakeddit | 0.304 | −0.405 |
All notebooks support Google Colab and Kaggle environments.
Update the path configuration cell (marked with # ← update) before running.
pip install torch transformers open_clip_torch sentence-transformers \
scikit-learn numpy pandas tqdm pillow huggingface_hub \
accelerate bitsandbytes datasetsA HuggingFace token with access to gated models is required for MedGemma-4B and MedSigLIP-448:
from huggingface_hub import login
login(token="YOUR_HF_TOKEN")