Skip to content

feat: add Medical Document Parser using Gemma 4 vision and PyMuPDF#47

Open
Tiioluwani wants to merge 1 commit into
Sumanth077:mainfrom
Tiioluwani:feat/medical-document-parser
Open

feat: add Medical Document Parser using Gemma 4 vision and PyMuPDF#47
Tiioluwani wants to merge 1 commit into
Sumanth077:mainfrom
Tiioluwani:feat/medical-document-parser

Conversation

@Tiioluwani
Copy link
Copy Markdown
Contributor

@Tiioluwani Tiioluwani commented May 13, 2026

Description

Adds a Medical Document Parser that ingests lab reports, prescriptions, and clinical notes — classifying each PDF page as text or vision, routing to the appropriate extraction path, and using Gemma 4 with thinking mode to return a unified JSON clinical profile with abnormal and critical values flagged in red.

Summary by CodeRabbit

  • New Features

    • Launched Medical Document Parser—a web application for uploading and analyzing PDF and image medical documents.
    • Automatically extracts clinical data including patient information, lab findings, and imaging results.
    • Prominently displays flagged abnormal and critical values for quick reference.
  • Documentation

    • Added comprehensive setup and usage guide with prerequisites and installation instructions.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

📝 Walkthrough

Walkthrough

This PR introduces a complete Gradio-based medical document parser that processes PDFs and images, extracts structured clinical information using Google's Gemini LLM, merges multi-page results with intelligent deduplication, and presents findings through a web interface with flagged abnormality highlighting.

Changes

Medical Document Parser with Gemini Integration

Layer / File(s) Summary
Clinical Data Schema and Types
multimodal/medical_document_parser/schemas.py
Pydantic models define PatientInfo, LabFinding (with constrained status values), ImagingResult, and ClinicalProfile using default factories for mutable collection fields and nested structures.
Document and Image Processing Pipeline
multimodal/medical_document_parser/document_processor.py
PDF pages classified by text-content threshold become either text ProcessedPage objects or PNG-rendered vision pages. Image uploads are normalized to RGB and PNG-encoded. process_upload dispatches between PDF iteration and supported raster formats.
Gemini-Based Clinical Extraction
multimodal/medical_document_parser/llm_extractor.py
Gemini client initialized from GOOGLE_API_KEY environment variable with system instructions and ClinicalProfile JSON schema. Per-page extraction branches on page.kind to invoke text-only or vision+text generation, parsing response with Markdown fence stripping.
Multi-Page Profile Aggregation and Deduplication
multimodal/medical_document_parser/merger.py
merge_profiles combines patient metadata (preferring existing fields), concatenates findings and signals, deduplicates lab findings by normalized test name (preferring non-normal status), and auto-generates flagged items for abnormal/critical results.
Gradio UI and Document Orchestration
multimodal/medical_document_parser/app.py
parse_document validates upload, processes pages, extracts per-page profiles, merges results, and returns JSON payload plus formatted flagged-items text. build_app constructs Gradio interface with file upload, button, and dual outputs. Custom CSS highlights flagged findings. __main__ launches with soft theme.
Configuration, Dependencies, and Documentation
multimodal/medical_document_parser/.env.example, multimodal/medical_document_parser/requirements.txt, multimodal/medical_document_parser/README.md
Environment variable template, Python dependency manifest (Gradio, google-genai, PyMuPDF, Pillow, python-dotenv, Pydantic), and comprehensive README covering features, tech stack, installation, environment setup, usage, and module structure.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A parser hops through documents with flair,
PDFs and images processed with care,
Gemini whispers clinical truths so clear,
Merging findings, no duplicates near,
Gradio presents the results—flagged and fair! 🏥

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main feature being added: a Medical Document Parser that uses Gemma 4 vision capabilities and PyMuPDF for PDF processing, which aligns with the comprehensive changes across multiple new modules and documentation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (8)
multimodal/medical_document_parser/requirements.txt (1)

1-6: 🏗️ Heavy lift

Add a lock/constraints file for reproducible installs.

Using only >= bounds makes runtime behavior drift over time. Consider pinning resolved versions (e.g., requirements.lock/constraints) for stable deploys.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@multimodal/medical_document_parser/requirements.txt` around lines 1 - 6,
Create a reproducible install by adding a pinned lock/constraints file and
updating the project metadata: generate a constraints or lock file (e.g.,
requirements.lock or constraints.txt) that pins exact versions for the packages
listed in requirements.txt (gradio, google-genai, PyMuPDF, Pillow,
python-dotenv, pydantic), and update CI/deploy/pip install commands to use
--constraint/--requirement against that file; produce the locked file by
resolving current working versions (using pip freeze, pip-tools' pip-compile, or
poetry/pipenv lock) and commit the resulting lock so installs are deterministic.
multimodal/medical_document_parser/llm_extractor.py (2)

48-54: ⚡ Quick win

Add error handling for JSON parsing failures.

If the LLM returns malformed JSON or the regex fails to strip fences correctly, json.loads will raise an exception. This should be caught and return an empty ClinicalProfile rather than crashing the entire document processing pipeline.

🛡️ Proposed fix to add JSON parsing error handling
 def _parse_response_text(text: str) -> ClinicalProfile:
-    cleaned = text.strip()
-    if cleaned.startswith("```"):
-        cleaned = re.sub(r"^```(?:json)?\s*", "", cleaned)
-        cleaned = re.sub(r"\s*```$", "", cleaned)
-    data = json.loads(cleaned)
-    return ClinicalProfile.model_validate(data)
+    try:
+        cleaned = text.strip()
+        if cleaned.startswith("```"):
+            cleaned = re.sub(r"^```(?:json)?\s*", "", cleaned)
+            cleaned = re.sub(r"\s*```$", "", cleaned)
+        data = json.loads(cleaned)
+        return ClinicalProfile.model_validate(data)
+    except (json.JSONDecodeError, ValueError) as e:
+        print(f"Failed to parse LLM response: {e}")
+        return ClinicalProfile()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@multimodal/medical_document_parser/llm_extractor.py` around lines 48 - 54,
The _parse_response_text function can raise on malformed JSON or failed fence
stripping; wrap the cleaning + json.loads + ClinicalProfile.model_validate calls
in a try/except that catches json.JSONDecodeError and ValueError, log the
parsing failure (use existing logger if available, otherwise a concise message)
and return an empty ClinicalProfile() instead of letting the exception bubble
up; ensure you still strip code fences (the existing re.sub lines) inside the
try so failures there are also caught.

42-44: 💤 Low value

Document cost and latency implications of HIGH thinking level.

The thinking_level=types.ThinkingLevel.HIGH setting may significantly increase API costs and response latency. For a medical document parser processing multi-page documents, this could impact user experience and operational costs. Consider whether HIGH is necessary for all pages or if it should be configurable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@multimodal/medical_document_parser/llm_extractor.py` around lines 42 - 44,
The hardcoded thinking_config using
types.ThinkingConfig(thinking_level=types.ThinkingLevel.HIGH) can drive up API
cost and latency; make the thinking level configurable (e.g., expose a parameter
on the LLMExtractor class or its factory function and/or per-page override) and
document the cost/latency tradeoffs in the function/class docstring and README;
update references to thinking_config in llm_extractor.py so the default is a
lower-cost level (e.g., MEDIUM or a named constant) and allow callers to opt
into HIGH when necessary.
multimodal/medical_document_parser/schemas.py (2)

9-11: ⚡ Quick win

Consider using None for missing values instead of empty strings.

Empty strings as defaults make it difficult to distinguish between "field was not provided" and "field was explicitly set to empty". For medical data, this distinction can be important for downstream validation and clinical decision support.

♻️ Proposed change to use Optional types
 class PatientInfo(BaseModel):
-    name: str = ""
-    age: str = ""
-    date: str = ""
+    name: str | None = None
+    age: str | None = None
+    date: str | None = None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@multimodal/medical_document_parser/schemas.py` around lines 9 - 11, Change
the default field values from empty strings to None and use Optional types so
missing vs explicitly empty is distinguishable: update the declarations for
name, age, and date in schemas.py to use Optional[str] = None (and add the
necessary typing import, e.g., Optional) so all three fields are None by default
instead of "" and downstream validation can detect absent values.

11-11: ⚡ Quick win

Use a validated date type for stronger guarantees.

Storing dates as strings bypasses Pydantic's built-in validation. Medical documents often have critical date requirements (e.g., lab result dates, prescription dates).

📅 Proposed change to use datetime type
+from datetime import date
+
 class PatientInfo(BaseModel):
     name: str = ""
     age: str = ""
-    date: str = ""
+    date: date | None = None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@multimodal/medical_document_parser/schemas.py` at line 11, The field
currently declared as "date: str = \"\"" should be changed to a validated date
type: replace it with "date: Optional[date] = None" (import date and Optional)
and add a Pydantic validator (e.g., `@validator`('date', pre=True,
allow_reuse=True) def parse_date(cls, v): ...) to parse strings and
reject/normalize invalid values; update any code that relied on empty-string
default to handle None instead. Ensure the new validator lives in the same model
in multimodal/medical_document_parser/schemas.py and use datetime.date
parsing/formatting for consistent validation.
multimodal/medical_document_parser/document_processor.py (2)

73-78: ⚡ Quick win

Add error handling for corrupted or malformed PDFs.

PyMuPDF can raise exceptions when opening corrupted PDF files. Without explicit error handling, users would see generic stack traces instead of helpful error messages.

🛡️ Proposed fix to add error handling
     if suffix == ".pdf":
         pages: list[ProcessedPage] = []
-        with fitz.open(path) as document:
-            for page in document:
-                pages.append(_classify_pdf_page(page))
+        try:
+            with fitz.open(path) as document:
+                for page in document:
+                    pages.append(_classify_pdf_page(page))
+        except Exception as e:
+            raise ValueError(f"Failed to process PDF: {e}") from e
         return pages
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@multimodal/medical_document_parser/document_processor.py` around lines 73 -
78, Wrap the PDF handling block that uses fitz.open(path) and iterates pages in
a try/except that catches PyMuPDF-specific exceptions (fitz.FitzError) and a
general Exception fallback; on error, surface a clear message that includes the
file path and the original exception (either by logging via the module logger or
re-raising a ValueError/RuntimeError that wraps the original exception) instead
of letting a raw traceback propagate; ensure the code calling _classify_pdf_page
still runs for valid pages and that the exception path returns or raises a
single, helpful error for the caller.

11-11: 💤 Low value

Document the TEXT_THRESHOLD rationale.

The threshold of 50 characters appears arbitrary. In medical documents, even short pages (e.g., a prescription with patient name, medication, and dosage) may contain critical text that should be processed as text rather than vision. Consider documenting why 50 was chosen or making it configurable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@multimodal/medical_document_parser/document_processor.py` at line 11, The
TEXT_THRESHOLD constant (TEXT_THRESHOLD = 50) is undocumented and may be too low
for medical pages; update the code to document the rationale and make it
configurable: add a docstring or inline comment next to TEXT_THRESHOLD
explaining why 50 was chosen (e.g., tradeoff between OCR cost and noise), and
refactor code to allow overriding it via a configuration parameter or
constructor argument (e.g., a config.TXT_THRESHOLD or
DocumentProcessor(text_threshold=...)) so callers can adjust behavior for short
but important pages; ensure all references to TEXT_THRESHOLD in
functions/methods within DocumentProcessor are changed to read from the new
configurable value and include a default of 50.
multimodal/medical_document_parser/merger.py (1)

55-62: ⚡ Quick win

Optimize the membership check for flagged_items.

The if label not in merged.flagged_items check on line 61 performs a linear O(n) scan for each lab finding. For documents with many abnormal findings, this could become a performance bottleneck. Since _dedupe_strings is called earlier (line 53), duplicates will be removed anyway.

⚡ Proposed optimization

Option 1: Skip the check and rely on deduplication

     for finding in merged.lab_findings:
         if finding.status in {"abnormal", "critical"}:
             label = f"{finding.test}: {finding.value}"
             if finding.normal_range:
                 label += f" (ref: {finding.normal_range})"
             label += f" [{finding.status}]"
-            if label not in merged.flagged_items:
-                merged.flagged_items.append(label)
+            merged.flagged_items.append(label)
+    
+    # Deduplicate again after adding auto-generated items
+    merged.flagged_items = _dedupe_strings(merged.flagged_items)

Option 2: Use a set for O(1) lookup

+    existing_flags = {item.strip().lower() for item in merged.flagged_items}
     for finding in merged.lab_findings:
         if finding.status in {"abnormal", "critical"}:
             label = f"{finding.test}: {finding.value}"
             if finding.normal_range:
                 label += f" (ref: {finding.normal_range})"
             label += f" [{finding.status}]"
-            if label not in merged.flagged_items:
+            if label.strip().lower() not in existing_flags:
                 merged.flagged_items.append(label)
+                existing_flags.add(label.strip().lower())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@multimodal/medical_document_parser/merger.py` around lines 55 - 62, The
membership check "if label not in merged.flagged_items" is causing O(n^2)
behavior; replace it by maintaining a temporary set (e.g. seen_flagged) local to
the loop and use seen_flagged.add(label) for O(1) lookups when iterating
merged.lab_findings, then extend merged.flagged_items from that set (or keep the
list order by appending only when label not in seen_flagged). Alternatively, if
you prefer relying on deduplication, remove the per-item check entirely since
_dedupe_strings is already called earlier; choose one approach and apply it
consistently to the code that builds merged.flagged_items.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@multimodal/medical_document_parser/app.py`:
- Around line 65-66: The Gradio css and theme kwargs are being passed to
demo.launch() but Gradio 5.x requires them on Blocks construction; update the
gr.Blocks(...) call that creates demo (the gr.Blocks(title="Medical Document
Parser") as demo block) to include the css and theme arguments (e.g., css=...,
theme=...) and remove css and theme from the demo.launch(...) invocation so
launch() is called without those kwargs.
- Around line 31-34: The parse_document function currently uses a mutable/shared
default by setting progress: gr.Progress = gr.Progress() at definition time;
change the signature to accept progress: gr.Progress | None = None and inside
parse_document check if progress is None then set progress = gr.Progress() so
each call gets a fresh Progress instance (refer to parse_document and its
progress parameter when making this change).

In `@multimodal/medical_document_parser/document_processor.py`:
- Around line 69-83: process_upload lacks file size checks and can load very
large PDFs/images into memory; add a configurable MAX_UPLOAD_SIZE_BYTES constant
and validate path.stat().st_size against it at the top of process_upload,
raising ValueError if exceeded, before calling fitz.open or path.read_bytes();
keep the checks when handling both the PDF branch (before looping into
_classify_pdf_page) and the image branch (before invoking _image_to_page) so
large files are rejected early.

In `@multimodal/medical_document_parser/llm_extractor.py`:
- Around line 57-82: Wrap the calls to client.models.generate_content inside
extract_from_page with a try/except that catches network/HTTP/auth/timeouts
(SDK-specific exceptions and a generic Exception), enforce a request timeout via
the SDK or by adding a timeout param from _generation_config (or client init)
and return an empty ClinicalProfile or propagate a clear, logged error when
failures occur; ensure both the text and vision branches use the same guarded
call, log or attach the exception details, and keep _parse_response_text usage
unchanged for successful responses.

In `@multimodal/medical_document_parser/README.md`:
- Around line 89-90: The project tree root in README.md is incorrect—replace the
top-level folder name "medical-document-parser/" with the repository path
"multimodal/medical_document_parser/" so the tree header matches the actual repo
layout and avoids confusion; update the tree line containing
"medical-document-parser/" (above the listed files like app.py) to
"multimodal/medical_document_parser/".

In `@multimodal/medical_document_parser/requirements.txt`:
- Line 4: Update the Pillow dependency in requirements.txt from "Pillow>=10.0.0"
to "Pillow>=12.2.0" to ensure the package version includes fixes for the listed
CVEs; locate the Pillow line in
multimodal/medical_document_parser/requirements.txt and change the version floor
to 12.2.0, then run dependency install/check (e.g., pip check or CI dependency
audit) to confirm no conflicts.

In `@multimodal/medical_document_parser/schemas.py`:
- Around line 8-11: The PatientInfo model currently holds PHI (name, age, date)
and must not be sent to external LLMs without safeguards: update the PatientInfo
class and the code paths in llm_extractor.py to require explicit user consent
(e.g., a boolean consent flag or signed consent token) before transmission, add
an option to de-identify/anonymize fields via a new method (e.g.,
PatientInfo.deidentify()) and use that by default, ensure all network calls use
TLS and that persisted PHI is stored encrypted (enable encryption-at-rest config
checks), add a runtime check that a configured BAA_with_google flag or
documented BAA approval is present before calling Gemini, and emit audit log
entries (who, what, when, purpose) whenever PatientInfo is accessed or sent to
an external API; implement these checks centrally where PatientInfo is
serialized for llm_extractor.py so they cannot be bypassed.

---

Nitpick comments:
In `@multimodal/medical_document_parser/document_processor.py`:
- Around line 73-78: Wrap the PDF handling block that uses fitz.open(path) and
iterates pages in a try/except that catches PyMuPDF-specific exceptions
(fitz.FitzError) and a general Exception fallback; on error, surface a clear
message that includes the file path and the original exception (either by
logging via the module logger or re-raising a ValueError/RuntimeError that wraps
the original exception) instead of letting a raw traceback propagate; ensure the
code calling _classify_pdf_page still runs for valid pages and that the
exception path returns or raises a single, helpful error for the caller.
- Line 11: The TEXT_THRESHOLD constant (TEXT_THRESHOLD = 50) is undocumented and
may be too low for medical pages; update the code to document the rationale and
make it configurable: add a docstring or inline comment next to TEXT_THRESHOLD
explaining why 50 was chosen (e.g., tradeoff between OCR cost and noise), and
refactor code to allow overriding it via a configuration parameter or
constructor argument (e.g., a config.TXT_THRESHOLD or
DocumentProcessor(text_threshold=...)) so callers can adjust behavior for short
but important pages; ensure all references to TEXT_THRESHOLD in
functions/methods within DocumentProcessor are changed to read from the new
configurable value and include a default of 50.

In `@multimodal/medical_document_parser/llm_extractor.py`:
- Around line 48-54: The _parse_response_text function can raise on malformed
JSON or failed fence stripping; wrap the cleaning + json.loads +
ClinicalProfile.model_validate calls in a try/except that catches
json.JSONDecodeError and ValueError, log the parsing failure (use existing
logger if available, otherwise a concise message) and return an empty
ClinicalProfile() instead of letting the exception bubble up; ensure you still
strip code fences (the existing re.sub lines) inside the try so failures there
are also caught.
- Around line 42-44: The hardcoded thinking_config using
types.ThinkingConfig(thinking_level=types.ThinkingLevel.HIGH) can drive up API
cost and latency; make the thinking level configurable (e.g., expose a parameter
on the LLMExtractor class or its factory function and/or per-page override) and
document the cost/latency tradeoffs in the function/class docstring and README;
update references to thinking_config in llm_extractor.py so the default is a
lower-cost level (e.g., MEDIUM or a named constant) and allow callers to opt
into HIGH when necessary.

In `@multimodal/medical_document_parser/merger.py`:
- Around line 55-62: The membership check "if label not in merged.flagged_items"
is causing O(n^2) behavior; replace it by maintaining a temporary set (e.g.
seen_flagged) local to the loop and use seen_flagged.add(label) for O(1) lookups
when iterating merged.lab_findings, then extend merged.flagged_items from that
set (or keep the list order by appending only when label not in seen_flagged).
Alternatively, if you prefer relying on deduplication, remove the per-item check
entirely since _dedupe_strings is already called earlier; choose one approach
and apply it consistently to the code that builds merged.flagged_items.

In `@multimodal/medical_document_parser/requirements.txt`:
- Around line 1-6: Create a reproducible install by adding a pinned
lock/constraints file and updating the project metadata: generate a constraints
or lock file (e.g., requirements.lock or constraints.txt) that pins exact
versions for the packages listed in requirements.txt (gradio, google-genai,
PyMuPDF, Pillow, python-dotenv, pydantic), and update CI/deploy/pip install
commands to use --constraint/--requirement against that file; produce the locked
file by resolving current working versions (using pip freeze, pip-tools'
pip-compile, or poetry/pipenv lock) and commit the resulting lock so installs
are deterministic.

In `@multimodal/medical_document_parser/schemas.py`:
- Around line 9-11: Change the default field values from empty strings to None
and use Optional types so missing vs explicitly empty is distinguishable: update
the declarations for name, age, and date in schemas.py to use Optional[str] =
None (and add the necessary typing import, e.g., Optional) so all three fields
are None by default instead of "" and downstream validation can detect absent
values.
- Line 11: The field currently declared as "date: str = \"\"" should be changed
to a validated date type: replace it with "date: Optional[date] = None" (import
date and Optional) and add a Pydantic validator (e.g., `@validator`('date',
pre=True, allow_reuse=True) def parse_date(cls, v): ...) to parse strings and
reject/normalize invalid values; update any code that relied on empty-string
default to handle None instead. Ensure the new validator lives in the same model
in multimodal/medical_document_parser/schemas.py and use datetime.date
parsing/formatting for consistent validation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 12eadcc4-ce8d-4fb1-927d-81f395e236a6

📥 Commits

Reviewing files that changed from the base of the PR and between e206b3f and a468a22.

⛔ Files ignored due to path filters (1)
  • multimodal/medical_document_parser/assets/demo.png is excluded by !**/*.png
📒 Files selected for processing (8)
  • multimodal/medical_document_parser/.env.example
  • multimodal/medical_document_parser/README.md
  • multimodal/medical_document_parser/app.py
  • multimodal/medical_document_parser/document_processor.py
  • multimodal/medical_document_parser/llm_extractor.py
  • multimodal/medical_document_parser/merger.py
  • multimodal/medical_document_parser/requirements.txt
  • multimodal/medical_document_parser/schemas.py

Comment on lines +31 to +34
def parse_document(
upload: str | None,
progress: gr.Progress = gr.Progress(),
) -> tuple[dict, str]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In Python, are default argument expressions evaluated once at function definition time, and is this why linters warn about function calls in defaults (e.g., Ruff B008)?

💡 Result:

Yes, in Python, default argument expressions are evaluated once at function definition time, not on each call. This is explicitly stated in the official Python documentation: "The default values are evaluated at the point of function definition in the defining scope" and "Important warning: The default value is evaluated only once."[1] This behavior stores the defaults in the function's defaults tuple, reusing the same value (or object) across calls.[2][3][4] This is why linters like Ruff warn about function calls in defaults via rule B008: "Any function call that's used in a default argument will only be performed once, at definition time. The returned value will then be reused by all calls to the function, which can lead to unexpected behaviour."[5] For example, calling a function that returns a mutable object (like a list) shares that object across calls, causing mutations to persist unexpectedly. Even non-mutable calls (e.g., time.time or random) yield stale values from definition time.[5] The standard fix is to use None as the default and initialize inside the function body.[1][5]

Citations:


🏁 Script executed:

cd multimodal/medical_document_parser && head -120 app.py | tail -70

Repository: Sumanth077/Hands-On-AI-Engineering

Length of output: 2416


🌐 Web query:

Gradio 5.0.0 API: How to pass css and theme arguments to Blocks and launch methods? What is the correct pattern?

💡 Result:

Correct pattern in Gradio (including 5.0.0-era and current docs): pass theme and CSS-related arguments to Blocks’ launch method, not to the Blocks constructor. 1) Theme - Use demo.launch(theme=...) Example: with gr.Blocks as demo:... demo.launch(theme=gr.themes.Soft) This is explicitly shown in the Gradio theming guide: “pass the theme= kwarg to the launch method of Blocks” [1]. 2) Custom CSS - Use demo.launch(css=...) for a CSS string - Use demo.launch(css_paths=...) for one or more CSS files (path objects) Example: with gr.Blocks as demo:... demo.launch(css=".gradio-container { background-color: red; }") demo.launch(css_paths=["./styles/base.css", "./styles/overrides.css"]) The “Custom CSS and JS” guide shows passing css= and css_paths= to launch [2]. 3) What changed vs passing to Blocks(...) - Gradio moved styling-related parameters (theme, css, css_paths, js, head, head_paths) from the Blocks constructor to the launch method. The codebase includes a warning message stating that these parameters “have been moved from the Blocks constructor to the launch method in Gradio 6.0: theme, css, css_paths, js, head, head_paths. Please pass these parameters to launch instead.” [3]. So, the “correct pattern” is: - Build UI inside the Blocks context - Call launch(...) with theme= and css=/css_paths= (and any other styling-related args) If you currently have something like gr.Blocks(theme=..., css=...), update it to gr.Blocks + demo.launch(theme=..., css=..., css_paths=...). [3][1][2]

Citations:


Avoid function-call default for progress (shared default evaluation).

At Line 33, gr.Progress() is evaluated at function definition time, creating a single shared instance. Use None as the default and instantiate inside the function instead to create a fresh instance per call.

Proposed fix
 def parse_document(
     upload: str | None,
-    progress: gr.Progress = gr.Progress(),
+    progress: gr.Progress | None = None,
 ) -> tuple[dict, str]:
+    if progress is None:
+        progress = gr.Progress()
+
     if upload is None:
         raise gr.Error("Please upload a medical PDF or image.")
🧰 Tools
🪛 Ruff (0.15.12)

[warning] 33-33: Do not perform function call gr.Progress in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@multimodal/medical_document_parser/app.py` around lines 31 - 34, The
parse_document function currently uses a mutable/shared default by setting
progress: gr.Progress = gr.Progress() at definition time; change the signature
to accept progress: gr.Progress | None = None and inside parse_document check if
progress is None then set progress = gr.Progress() so each call gets a fresh
Progress instance (refer to parse_document and its progress parameter when
making this change).

Comment on lines +65 to +66
with gr.Blocks(title="Medical Document Parser") as demo:
gr.Markdown(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

For gradio==5.0.0, does gr.Blocks.launch() accept cssandtheme keyword arguments, or must they be passed to gr.Blocks(...) when constructing the app?

💡 Result:

For gradio==5.0.0, css and theme must be passed to gr.Blocks(...) when constructing the app, not to gr.Blocks.launch(). The move of these parameters from the Blocks constructor to launch() occurred in Gradio 6.x as a breaking change[1][2][3]. Gradio 5 docs and examples show them in the constructor[4][5][6][7][8], and the Gradio 6 migration guide contrasts "Before (Gradio 5.x)" with constructor usage vs "After (Gradio 6.x)" with launch[1]. The 5.0.0 release notes do not mention adding them to launch[9]. In later versions (post-6.0), they were temporarily restored to the constructor with warnings[3][10].

Citations:


🏁 Script executed:

cat -n multimodal/medical_document_parser/app.py | sed -n '60,115p'

Repository: Sumanth077/Hands-On-AI-Engineering

Length of output: 2198


🏁 Script executed:

# Check for CUSTOM_CSS definition and Gradio version requirement
head -30 multimodal/medical_document_parser/app.py

Repository: Sumanth077/Hands-On-AI-Engineering

Length of output: 804


🏁 Script executed:

# Look for requirements or version specs
find . -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" | head -5

Repository: Sumanth077/Hands-On-AI-Engineering

Length of output: 324


🏁 Script executed:

cat multimodal/medical_document_parser/requirements.txt

Repository: Sumanth077/Hands-On-AI-Engineering

Length of output: 180


Move css/theme to gr.Blocks(...); launch() should not receive them.

The code passes css and theme to launch() (line 110), which is incompatible with Gradio 5.x. Per the requirements.txt (gradio>=5.0.0), these parameters must be configured at Blocks construction (line 65). Gradio 5 does not accept these kwargs in launch().

Proposed fix
 def build_app() -> gr.Blocks:
-    with gr.Blocks(title="Medical Document Parser") as demo:
+    with gr.Blocks(
+        title="Medical Document Parser",
+        css=CUSTOM_CSS,
+        theme=gr.themes.Soft(),
+    ) as demo:
         gr.Markdown(
             """
             # Medical Document Parser
@@
 if __name__ == "__main__":
-    build_app().launch(css=CUSTOM_CSS, theme=gr.themes.Soft())
+    build_app().launch()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@multimodal/medical_document_parser/app.py` around lines 65 - 66, The Gradio
css and theme kwargs are being passed to demo.launch() but Gradio 5.x requires
them on Blocks construction; update the gr.Blocks(...) call that creates demo
(the gr.Blocks(title="Medical Document Parser") as demo block) to include the
css and theme arguments (e.g., css=..., theme=...) and remove css and theme from
the demo.launch(...) invocation so launch() is called without those kwargs.

Comment on lines +69 to +83
def process_upload(file_path: str | Path) -> list[ProcessedPage]:
path = Path(file_path)
suffix = path.suffix.lower()

if suffix == ".pdf":
pages: list[ProcessedPage] = []
with fitz.open(path) as document:
for page in document:
pages.append(_classify_pdf_page(page))
return pages

if suffix in {".png", ".jpg", ".jpeg", ".webp", ".bmp", ".tif", ".tiff"}:
return [_image_to_page(path.read_bytes())]

raise ValueError(f"Unsupported file type: {suffix or 'unknown'}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚖️ Poor tradeoff

Add file size validation to prevent memory exhaustion.

The function processes uploaded files without checking their size. Large PDF files or high-resolution images could consume excessive memory when loaded entirely into RAM, potentially causing application crashes or denial-of-service conditions.

🛡️ Proposed fix to add size validation
+MAX_FILE_SIZE = 50 * 1024 * 1024  # 50 MB
+
 def process_upload(file_path: str | Path) -> list[ProcessedPage]:
     path = Path(file_path)
+    
+    file_size = path.stat().st_size
+    if file_size > MAX_FILE_SIZE:
+        raise ValueError(f"File too large: {file_size / (1024*1024):.1f} MB (max {MAX_FILE_SIZE / (1024*1024):.0f} MB)")
+    
     suffix = path.suffix.lower()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def process_upload(file_path: str | Path) -> list[ProcessedPage]:
path = Path(file_path)
suffix = path.suffix.lower()
if suffix == ".pdf":
pages: list[ProcessedPage] = []
with fitz.open(path) as document:
for page in document:
pages.append(_classify_pdf_page(page))
return pages
if suffix in {".png", ".jpg", ".jpeg", ".webp", ".bmp", ".tif", ".tiff"}:
return [_image_to_page(path.read_bytes())]
raise ValueError(f"Unsupported file type: {suffix or 'unknown'}")
MAX_FILE_SIZE = 50 * 1024 * 1024 # 50 MB
def process_upload(file_path: str | Path) -> list[ProcessedPage]:
path = Path(file_path)
file_size = path.stat().st_size
if file_size > MAX_FILE_SIZE:
raise ValueError(f"File too large: {file_size / (1024*1024):.1f} MB (max {MAX_FILE_SIZE / (1024*1024):.0f} MB)")
suffix = path.suffix.lower()
if suffix == ".pdf":
pages: list[ProcessedPage] = []
with fitz.open(path) as document:
for page in document:
pages.append(_classify_pdf_page(page))
return pages
if suffix in {".png", ".jpg", ".jpeg", ".webp", ".bmp", ".tif", ".tiff"}:
return [_image_to_page(path.read_bytes())]
raise ValueError(f"Unsupported file type: {suffix or 'unknown'}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@multimodal/medical_document_parser/document_processor.py` around lines 69 -
83, process_upload lacks file size checks and can load very large PDFs/images
into memory; add a configurable MAX_UPLOAD_SIZE_BYTES constant and validate
path.stat().st_size against it at the top of process_upload, raising ValueError
if exceeded, before calling fitz.open or path.read_bytes(); keep the checks when
handling both the PDF branch (before looping into _classify_pdf_page) and the
image branch (before invoking _image_to_page) so large files are rejected early.

Comment on lines +57 to +82
def extract_from_page(client: genai.Client, page: ProcessedPage) -> ClinicalProfile:
config = _generation_config()
page_context = f"{EXTRACTION_PROMPT}\n\nPage {page.page_number}."

if page.kind == "text" and page.text:
response = client.models.generate_content(
model=MODEL_ID,
contents=f"{page_context}\n\nDocument text:\n{page.text}",
config=config,
)
elif page.kind == "vision" and page.image_bytes:
response = client.models.generate_content(
model=MODEL_ID,
contents=[
types.Part.from_bytes(data=page.image_bytes, mime_type="image/png"),
page_context,
],
config=config,
)
else:
return ClinicalProfile()

if not response.text:
return ClinicalProfile()

return _parse_response_text(response.text)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Add error handling and timeout for API calls.

The Gemini API calls lack error handling for network failures, rate limits, authentication issues, or timeouts. In production, external API calls should be wrapped with proper exception handling and timeout configuration to prevent indefinite hangs and provide meaningful error messages.

🛡️ Proposed fix to add error handling and timeout
+from google.genai.errors import ClientError
+
 def extract_from_page(client: genai.Client, page: ProcessedPage) -> ClinicalProfile:
     config = _generation_config()
     page_context = f"{EXTRACTION_PROMPT}\n\nPage {page.page_number}."
 
-    if page.kind == "text" and page.text:
-        response = client.models.generate_content(
-            model=MODEL_ID,
-            contents=f"{page_context}\n\nDocument text:\n{page.text}",
-            config=config,
-        )
-    elif page.kind == "vision" and page.image_bytes:
-        response = client.models.generate_content(
-            model=MODEL_ID,
-            contents=[
-                types.Part.from_bytes(data=page.image_bytes, mime_type="image/png"),
-                page_context,
-            ],
-            config=config,
-        )
-    else:
+    try:
+        if page.kind == "text" and page.text:
+            response = client.models.generate_content(
+                model=MODEL_ID,
+                contents=f"{page_context}\n\nDocument text:\n{page.text}",
+                config=config,
+            )
+        elif page.kind == "vision" and page.image_bytes:
+            response = client.models.generate_content(
+                model=MODEL_ID,
+                contents=[
+                    types.Part.from_bytes(data=page.image_bytes, mime_type="image/png"),
+                    page_context,
+                ],
+                config=config,
+            )
+        else:
+            return ClinicalProfile()
+    except ClientError as e:
+        # Log error and return empty profile rather than crashing
+        print(f"API error on page {page.page_number}: {e}")
         return ClinicalProfile()
+    except Exception as e:
+        print(f"Unexpected error on page {page.page_number}: {e}")
+        return ClinicalProfile()

Note: The google-genai SDK may require additional timeout configuration via client initialization or request parameters. Check the SDK documentation for the proper timeout mechanism.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@multimodal/medical_document_parser/llm_extractor.py` around lines 57 - 82,
Wrap the calls to client.models.generate_content inside extract_from_page with a
try/except that catches network/HTTP/auth/timeouts (SDK-specific exceptions and
a generic Exception), enforce a request timeout via the SDK or by adding a
timeout param from _generation_config (or client init) and return an empty
ClinicalProfile or propagate a clear, logged error when failures occur; ensure
both the text and vision branches use the same guarded call, log or attach the
exception details, and keep _parse_response_text usage unchanged for successful
responses.

Comment on lines +89 to +90
medical-document-parser/
├── app.py # Gradio UI and orchestration
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use the real repository path/name in the project tree.

The tree root (medical-document-parser/) doesn’t match the documented repo location (multimodal/medical_document_parser). Aligning names avoids navigation confusion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@multimodal/medical_document_parser/README.md` around lines 89 - 90, The
project tree root in README.md is incorrect—replace the top-level folder name
"medical-document-parser/" with the repository path
"multimodal/medical_document_parser/" so the tree header matches the actual repo
layout and avoids confusion; update the tree line containing
"medical-document-parser/" (above the listed files like app.py) to
"multimodal/medical_document_parser/".

gradio>=5.0.0
google-genai>=1.0.0
PyMuPDF>=1.24.0
Pillow>=10.0.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

What is the first Pillow version that fixes GHSA-3f63-hfp8-52jq, GHSA-44wm-f244-xhp3, GHSA-j7hp-h8jx-5ppr, GHSA-r73j-pqj5-w3x7, and GHSA-wjx4-4jcj-g98j?

💡 Result:

First fixed Pillow version for each GHSA: - GHSA-3f63-hfp8-52jq: Pillow 10.2.0 [1] - GHSA-44wm-f244-xhp3: Pillow 10.3.0 [2] - GHSA-j7hp-h8jx-5ppr: Pillow 10.0.1 [3] - GHSA-r73j-pqj5-w3x7: Pillow 12.2.0 [4] - GHSA-wjx4-4jcj-g98j: Pillow 12.2.0 [5] (“First fixed” interpreted as the patched version specified by each advisory/summary.)

Citations:


Bump Pillow to ≥12.2.0 to address critical image parsing vulnerabilities.

The current floor of Pillow>=10.0.0 allows multiple known vulnerabilities (GHSA-3f63-hfp8-52jq, GHSA-44wm-f244-xhp3, GHSA-j7hp-h8jx-5ppr, GHSA-r73j-pqj5-w3x7, GHSA-wjx4-4jcj-g98j) that are particularly relevant since this application parses user-supplied images and PDF-derived content. Pillow 12.2.0 is the first version that patches all identified advisories.

Suggested diff
-Pillow>=10.0.0
+Pillow>=12.2.0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Pillow>=10.0.0
Pillow>=12.2.0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@multimodal/medical_document_parser/requirements.txt` at line 4, Update the
Pillow dependency in requirements.txt from "Pillow>=10.0.0" to "Pillow>=12.2.0"
to ensure the package version includes fixes for the listed CVEs; locate the
Pillow line in multimodal/medical_document_parser/requirements.txt and change
the version floor to 12.2.0, then run dependency install/check (e.g., pip check
or CI dependency audit) to confirm no conflicts.

Comment on lines +8 to +11
class PatientInfo(BaseModel):
name: str = ""
age: str = ""
date: str = ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Flag PII handling and compliance requirements.

This model stores Protected Health Information (PHI) including patient names. The codebase sends this data to Google's Gemini API (as seen in llm_extractor.py). Medical document parsers must comply with HIPAA, GDPR, or similar regulations when handling patient data.

Ensure:

  • Users provide explicit consent for external API transmission
  • Data is encrypted in transit and at rest
  • A Business Associate Agreement (BAA) exists with Google
  • Audit logging tracks all PHI access
  • Consider de-identification or anonymization before LLM processing
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@multimodal/medical_document_parser/schemas.py` around lines 8 - 11, The
PatientInfo model currently holds PHI (name, age, date) and must not be sent to
external LLMs without safeguards: update the PatientInfo class and the code
paths in llm_extractor.py to require explicit user consent (e.g., a boolean
consent flag or signed consent token) before transmission, add an option to
de-identify/anonymize fields via a new method (e.g., PatientInfo.deidentify())
and use that by default, ensure all network calls use TLS and that persisted PHI
is stored encrypted (enable encryption-at-rest config checks), add a runtime
check that a configured BAA_with_google flag or documented BAA approval is
present before calling Gemini, and emit audit log entries (who, what, when,
purpose) whenever PatientInfo is accessed or sent to an external API; implement
these checks centrally where PatientInfo is serialized for llm_extractor.py so
they cannot be bypassed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant