Optional burned-in text / PHI detection (phi-text-detector)#1
Open
luckfamousa wants to merge 1 commit into
Open
Optional burned-in text / PHI detection (phi-text-detector)#1luckfamousa wants to merge 1 commit into
luckfamousa wants to merge 1 commit into
Conversation
Tag-based de-identification leaves pixel data untouched, so PHI burned into images survives. Add a new workspace crate, phi-text-detector, that flags such images using a PaddleOCR DB text detector (detection only) run via ONNX Runtime: - onnx_detector: PP-OCR preprocessing (resize to mult-of-32, ImageNet normalize, NCHW) + ONNX Runtime session (session behind a Mutex so the detector is Send+Sync with a &self detect()). - db_postprocess: pure-Rust DB map -> bitmap -> connected components -> scored, expanded, rescaled boxes (heavily unit-tested). - dicom_render: render first frame + extract screening metadata. - image_prefilter: margin crops + a contrast gate. - screening: Safe/Review/Unsafe decision from detection + metadata (BurnedInAnnotation, modality, secondary capture); fail-closed. - phi-screen CLI: screen images/DICOM/dirs, JSON / JSONL output. Integrate optionally into the de-id CLI behind a `text-detection` feature: --detect-text off|warn|skip (warn flags but still writes; skip suppresses output and exits 3). Default off, so the core library never pulls in ONNX Runtime. The model is fetched + SHA-256-verified by scripts/fetch-model.sh (not committed); CI fetches it and runs the workspace + feature tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Tag-based de-identification never touches pixel data, so PHI burned into images (name overlays, ultrasound banners, secondary-capture screenshots) survives. This adds an optional detector to flag such images.
New workspace crate
phi-text-detector— a PaddleOCR DB text detector (detection only, no OCR) via ONNX Runtime (ort):onnx_detector— PP-OCR preprocessing (aspect-preserving resize to multiples of 32, ImageNet normalization, NCHW) + ONNX session (behind aMutexso the detector isSend + Syncwith a&selfdetect()).db_postprocess— pure-Rust probability-map → bitmap → connected components → scored/expanded/rescaled boxes (heavily unit-tested on synthetic maps).dicom_render— render first frame + extract screening metadata.image_prefilter— margin crops + a contrast gate.screening—Safe/Review/Unsafefrom detection + metadata (BurnedInAnnotation, modality, secondary-capture SOP class); fail-closed.phi-screenCLI — screen images / DICOM / directories, JSON or JSONL.Optional integration into the de-id CLI (the requested behavior)
Behind a
text-detectionfeature (off by default, so the core library never pulls in ONNX Runtime):--detect-text off(default) — no detection--detect-text warn— flag on stderr, still writes the de-identified file--detect-text skip— writes no output for flagged images, exits3Model
Not committed.
scripts/fetch-model.shdownloads and SHA-256-verifiesppocr_det.onnx(source + hash recorded inmodels/ppocr_det.metadata.json). CI fetches it.Tests
positive_text.png→ text;negative_no_text.png→ none).skipexits 3 + no output;warnwrites + warns; clean image passes) viaBurnedInAnnotation.text-detectionintegration; verified locally against rustc/clippy 1.95.Notes
image_prefilterbut not yet wired as the default scan strategy).🤖 Generated with Claude Code