Skip to content

Optional burned-in text / PHI detection (phi-text-detector)#1

Open
luckfamousa wants to merge 1 commit into
mainfrom
feature/burned-in-text-detection
Open

Optional burned-in text / PHI detection (phi-text-detector)#1
luckfamousa wants to merge 1 commit into
mainfrom
feature/burned-in-text-detection

Conversation

@luckfamousa
Copy link
Copy Markdown
Contributor

What

Tag-based de-identification never touches pixel data, so PHI burned into images (name overlays, ultrasound banners, secondary-capture screenshots) survives. This adds an optional detector to flag such images.

New workspace crate phi-text-detector — a PaddleOCR DB text detector (detection only, no OCR) via ONNX Runtime (ort):

  • onnx_detector — PP-OCR preprocessing (aspect-preserving resize to multiples of 32, ImageNet normalization, NCHW) + ONNX session (behind a Mutex so the detector is Send + Sync with a &self detect()).
  • db_postprocess — pure-Rust probability-map → bitmap → connected components → scored/expanded/rescaled boxes (heavily unit-tested on synthetic maps).
  • dicom_render — render first frame + extract screening metadata.
  • image_prefilter — margin crops + a contrast gate.
  • screeningSafe/Review/Unsafe from detection + metadata (BurnedInAnnotation, modality, secondary-capture SOP class); fail-closed.
  • phi-screen CLI — screen images / DICOM / directories, JSON or JSONL.

Optional integration into the de-id CLI (the requested behavior)

Behind a text-detection feature (off by default, so the core library never pulls in ONNX Runtime):

  • --detect-text off (default) — no detection
  • --detect-text warn — flag on stderr, still writes the de-identified file
  • --detect-text skipwrites no output for flagged images, exits 3

Model

Not committed. scripts/fetch-model.sh downloads and SHA-256-verifies ppocr_det.onnx (source + hash recorded in models/ppocr_det.metadata.json). CI fetches it.

Tests

  • DB postprocessing, screening rules, prefilter — unit tests.
  • Real ONNX detection on committed fixtures (positive_text.png → text; negative_no_text.png → none).
  • De-id CLI policy (skip exits 3 + no output; warn writes + warns; clean image passes) via BurnedInAnnotation.
  • CI runs the workspace + the text-detection integration; verified locally against rustc/clippy 1.95.

Notes

  • This first increment renders the first frame and screens the full image (margin-first crops are available in image_prefilter but not yet wired as the default scan strategy).

🤖 Generated with Claude Code

Tag-based de-identification leaves pixel data untouched, so PHI burned
into images survives. Add a new workspace crate, phi-text-detector, that
flags such images using a PaddleOCR DB text detector (detection only)
run via ONNX Runtime:

- onnx_detector: PP-OCR preprocessing (resize to mult-of-32, ImageNet
  normalize, NCHW) + ONNX Runtime session (session behind a Mutex so the
  detector is Send+Sync with a &self detect()).
- db_postprocess: pure-Rust DB map -> bitmap -> connected components ->
  scored, expanded, rescaled boxes (heavily unit-tested).
- dicom_render: render first frame + extract screening metadata.
- image_prefilter: margin crops + a contrast gate.
- screening: Safe/Review/Unsafe decision from detection + metadata
  (BurnedInAnnotation, modality, secondary capture); fail-closed.
- phi-screen CLI: screen images/DICOM/dirs, JSON / JSONL output.

Integrate optionally into the de-id CLI behind a `text-detection`
feature: --detect-text off|warn|skip (warn flags but still writes; skip
suppresses output and exits 3). Default off, so the core library never
pulls in ONNX Runtime.

The model is fetched + SHA-256-verified by scripts/fetch-model.sh (not
committed); CI fetches it and runs the workspace + feature tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant