AI-Powered Prior Authorization Form Auto-Fill System

A full-stack healthcare automation application that uses LLMs to automatically complete prior authorization forms from patient clinical data. Built with FastAPI backend and Streamlit frontend, the system leverages OpenAI's structured outputs to extract and validate medical information.

Core Features

Automatic Form Filling: Extracts answers from patient visit notes and demographics using GPT models with structured JSON output
Validation Pass: Optional second LLM call to review and correct initial answers with detailed change tracking (what changed, why, and rationale)
Mock Data Generation: Synthetic patient data and test case generation for development and testing
LLM-as-a-Judge Evaluation: Sophisticated semantic evaluation system that judges answer correctness using clinical knowledge rather than string matching, tracking validation improvements/degradations with confidence levels
Streaming Support: Real-time SSE streaming for progressive form field updates
File Processing: PDF/text upload with clinical note summarization
Caching: Hash-based request caching to avoid redundant API calls

Tech Stack

Backend: FastAPI with Pydantic models
Frontend: Streamlit interactive UI
LLM: OpenAI GPT (gpt-5-mini/gpt-5) with structured outputs
Evaluation: F1 score metrics + LLM judge with confidence scoring

Use Case

Automates the tedious process of filling pharmaceutical prior authorization forms (e.g., Zepbound) by intelligently extracting relevant clinical information from patient records.

This system demonstrates production-ready LLM integration with validation, evaluation pipelines, and quality assurance mechanisms for healthcare automation.

Implementation Highlights

Features Implemented

User dropdown select patient data
User can view or upload to update the existing visitor notes
User can provide feedback before or after to guide the form filling process
Streamlit display works as intended
User interactions work as intended
FastAPI endpoint for answer and file upload work as intended (streaming not fully tested due to organization restraint requiring verified OPENAI_API_KEY)

Optimizations

Actor-Critic with Feedback - Using the validation loop
Eval Pipeline - Testing for F1 score with and without validation loop (Tweaked validation loop after initial test result to include the QA and Feedback notes)
Caching - No more reprocessing when unneeded

Evaluation Results

Initial F1 Score Testing

Upon initial testing, we observed the F1 score eval pipeline outputting results that seemed to disagree with our model main.py FastAPI pipeline output. As a result, the F1 score was quite low:

=================================================== short test summary info ====================================================
FAILED tests/test_answers.py::test_f1_score_without_validation - AssertionError: F1 score too low: 0.077
FAILED tests/test_answers.py::test_f1_score_with_validation - AssertionError: F1 score too low: 0.053
====================================== 2 failed, 1 passed, 1 warning in 701.59s (0:11:41) ======================================

Analysis: We needed to validate the True Answer Dataset to see whether or not the true answer dataset was even correct. We approached this with LLM-as-a-Judge to double check the generated True Answer Dataset before testing with the Non-validated LLM form parsing response and the Validated LLM form parsing response.

LLM-as-a-Judge Results

Generating Mock Expected Answers

python -m tests.generate_mock_data

This generated 10 sets of 30 questions with expected answers based on 10 sample patient data.

Running the Tests

$env:RUN_LIVE_OPENAI="1"; pytest tests/test_answers.py::test_llm_judge_without_validation -v -s
$env:RUN_LIVE_OPENAI="1"; pytest tests/test_answers.py::test_llm_judge_with_validation -v -s

Results:

Without validation: 100% accuracy
With validation: 93.3% accuracy

This is likely the result of the prompt not being optimized for the validation pass right now, and the LLM is confused by the additional context from reading the extra {initial_answer} context clue.

Further optimization needed:

Incorporate special prompt design
Use larger parameter reasoning model rather than gpt-5-mini
Add few-shot examples for edge cases

Detailed Test Logs

Mock Data Generation Log

(engineer-take-home) PS D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum>
 python -m tests.generate_mock_data
Loaded 10 patients and 30 questions

Generating expected answers by calling API...
======================================================================

[1/10] Processing patient: Isaiah Reed
  ✓ Generated 30 expected answers

[2/10] Processing patient: Elizabeth Munoz
  ✓ Generated 30 expected answers

[3/10] Processing patient: Rebecca Edwards
  ✓ Generated 30 expected answers

[4/10] Processing patient: Suzanne Harris
  ✓ Generated 30 expected answers

[5/10] Processing patient: Jeffrey Donovan
  ✓ Generated 30 expected answers

[6/10] Processing patient: Madison Cook
  ✓ Generated 30 expected answers

[7/10] Processing patient: Michelle Andrews
  ✓ Generated 30 expected answers

[8/10] Processing patient: Michelle Dougherty
  ✓ Generated 30 expected answers

[9/10] Processing patient: Anthony Chaney
  ✓ Generated 30 expected answers

[10/10] Processing patient: James Aguilar
  ✓ Generated 30 expected answers

======================================================================
✅ Generated 10 mock test cases
📁 Saved to: D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum\tests\mock_test_data.json

⚠️  NOTE: These are API-generated answers, not human-validated ground truth.
   For production use, have clinical experts review and validate these answers.

LLM Judge WITH Validation Test Log

(engineer-take-home) PS D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum>
 $env:RUN_LIVE_OPENAI="1"; pytest tests/test_answers.py::test_llm_judge_with_validation -v -s
=============================== test session starts ===============================
platform win32 -- Python 3.11.5, pytest-8.4.0, pluggy-1.6.0 -- D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum\.venv\Scripts\python.exe
cachedir: .pytest_cache
rootdir: D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum
configfile: pyproject.toml
plugins: anyio-4.9.0, Faker-37.4.0, logfire-3.18.0
collected 1 item

tests/test_answers.py::test_llm_judge_with_validation
============================================================
Testing LLM Judge WITH Validation Pass
============================================================

[1] Testing patient: Isaiah Reed

[2] Testing patient: Elizabeth Munoz

------------------------------------------------------------
Total Questions: 60
Initial Accuracy: 93.3%
Final Accuracy: 93.3%
Validation Improvements: 0
Net Improvement: +0.0%
============================================================
PASSED

================================ warnings summary =================================
.venv\Lib\site-packages\pydantic\_internal\_fields.py:198
  D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum\.venv\Lib\site-packages\pydantic\_internal\_fields.py:198: UserWarning: Field name "validate" in "AnswerInput" shadows an attribute in parent "BaseModel"
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================== 1 passed, 1 warning in 402.39s (0:06:42) =====================

LLM Judge WITHOUT Validation Test Log

(engineer-take-home) PS D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum>
 $env:RUN_LIVE_OPENAI="1"; pytest tests/test_answers.py::test_llm_judge_without_validation -v -s
=============================== test session starts ===============================
platform win32 -- Python 3.11.5, pytest-8.4.0, pluggy-1.6.0 -- D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum\.venv\Scripts\python.exe
cachedir: .pytest_cache
rootdir: D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum
configfile: pyproject.toml
plugins: anyio-4.9.0, Faker-37.4.0, logfire-3.18.0
collected 1 item

tests/test_answers.py::test_llm_judge_without_validation
============================================================
Testing LLM Judge WITHOUT Validation Pass
============================================================

[1] Testing patient: Isaiah Reed

[2] Testing patient: Elizabeth Munoz

------------------------------------------------------------
Total Questions: 60
Correct Answers: 60
Accuracy: 100.0%
High-Confidence Accuracy: 100.0%
============================================================
PASSED

================================ warnings summary =================================
.venv\Lib\site-packages\pydantic\_internal\_fields.py:198
  D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum\.venv\Lib\site-packages\pydantic\_internal\_fields.py:198: UserWarning: Field name "validate" in "AnswerInput" shadows an attribute in parent "BaseModel"
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================== 1 passed, 1 warning in 110.80s (0:01:50) =====================

AI Tooling and Design Considerations

Development Tools

Augment VS Code Extension - AI-powered coding assistant
Feeding Web Documentation as Context:
- OpenAI Structured Outputs
- OpenAI PDF Files

Design Process

Whiteboarding with hand-drawn low fidelity design into section-by-section implementation manually prompted:

Each section and written checkpoint is jotted down for clarity of thought and structured implementation.

Evaluation design tailored to the context of the sample data and the AI Engineer take home interview descriptions.

Automated Testing with Augment

Augment can be smartly prompted to run automated testing until all "live smoke tests using real API keys" pass:
- Set $env:YOUR_KEY in the Augment CLI terminal
- This specific prompt and key setting allow Augment to run the test-fix-refine-retest loop for hours
- Note: This feature works wonderfully with message-per-month pricing but would be expensive with credit-based pricing
- Future to-do: Create live smoke tests with True Answer Dataset and validate with tests generated using real APIs integrated into FastAPI backend after careful scenario-based test designs

Example design for a different application (notebook application with automated agent for research tasks):

Future Considerations

Scalability

Visitor Note RAG Implementation for scaling to hundreds/thousands of visitor notes accumulated over a person's lifespan:
- Async per form question per visitor note validation pipeline
- Enhance correctness and speed of completion
- Each question may have top 100K visitor notes suitable for the response
- Order them based on datetime and batch into grouped chunks (10 notes per chunk)
- Come up with preliminary decision and rationale
- Concatenate the 10 batched decision and rationale responses
- Final response prioritizing the most recent preliminary decisions

Production Assumptions

Production Environment Considerations:
- HIPAA Compliance: Production LLM Model used is isolated within provider environment that can guarantee HIPAA Business Associate Agreement
- Golden Dataset: Production application and form filling pipeline deployment is tested and evaluated for Accuracy and F1 score with Golden True Answer Dataset that is historically submitted and human verified
- Ensemble Methods: Production environment result can be further enhanced with ensembling method where we mesh multiple LLMs' responses and calculate weighted confidence score to see which model scores highest for each question
- F1 Score Insights: Can give insight into which question and which patient/medical background the LLM model performs best upon
- Re-evaluation Pipeline: Use another validation pipeline with overall high F1 score LLM model to judge over the decision and rationale
- Observability: Pydantic Logfire can help with observation over the LLMs' performance

Additional Enhancements

Additionally, we added the validated_findings field for the LLM response that has the LLM validation to display under the rationale field for each form question and highlight the form question if the answer were to be modified as a result of the additional validation loop.

Getting Started

For detailed setup and usage instructions, please refer to:

Project Discussion & Evaluation Note

Context

During ongoing collaboration and live discussion, the following clarification outlines the design intent and rationale behind the LLM-as-a-Judge evaluation architecture.

For the automated prior authorization form-filling task, it is both efficient and semantically appropriate to:

Reduce evaluation complexity by replacing the F1 score with an Accuracy score
Adopt LLM-as-a-Judge for result validation — allowing the system to determine whether each output is correct in meaning, not just by string equivalence

This approach addresses real-world semantic variation in clinical text. For example, if the expected answer is "Zedbound 10mg" and the model outputs "10mg of Zedbound", the LLM-as-a-Judge system correctly validates it as equivalent, whereas naive string-matching would fail.

Implementation Reference

The updated implementation, evaluation tests, and working demo are available below:

GitHub Repository: HomenShum/LLM-test-suite Automated Testing with LLM for Classification, LLM-as-a-Judge, Context Verification & Pruning, and Agent Scaffold Systems
Live Demo: https://llm-test-suite-cafecorner.streamlit.app/

The current implementation introduces a working variant of the evaluation pipeline using LLM-as-a-Judge and replaces the initial string-matching logic with semantic validation for production-grade testing.

Visual Reference

A visual sketch of the proposed evaluation flow illustrates how validation, reasoning, and iterative feedback improve the evaluation loop.

Commit History

fef8669 - Add validation features, LLM judge evaluation, and comprehensive documentation
82d356e - feat: Add LLM-as-a-Judge evaluation and fix validation metadata
8fb1969 - feat: Implement prior authorization auto-fill with validation loop and F1 evaluation
336a1e2 - initial note
6e2b2ba - Initial commit

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
app		app
docs		docs
sample_data		sample_data
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
DESIGN_SPECS.md		DESIGN_SPECS.md
Dockerfile		Dockerfile
FOLDER_STRUCTURES.md		FOLDER_STRUCTURES.md
README.md		README.md
VALIDATION_FEATURES_SUMMARY.md		VALIDATION_FEATURES_SUMMARY.md
homen_shum_notes.md		homen_shum_notes.md
openai_diagnostic.py		openai_diagnostic.py
openai_structured_ouput_official_doc.md		openai_structured_ouput_official_doc.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-Powered Prior Authorization Form Auto-Fill System

Core Features

Tech Stack

Use Case

Implementation Highlights

Features Implemented

Optimizations

Evaluation Results

Initial F1 Score Testing

LLM-as-a-Judge Results

Generating Mock Expected Answers

Running the Tests

Detailed Test Logs

AI Tooling and Design Considerations

Development Tools

Design Process

Automated Testing with Augment

Future Considerations

Scalability

Production Assumptions

Additional Enhancements

Getting Started

Project Discussion & Evaluation Note

Context

Implementation Reference

Visual Reference

Commit History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI-Powered Prior Authorization Form Auto-Fill System

Core Features

Tech Stack

Use Case

Implementation Highlights

Features Implemented

Optimizations

Evaluation Results

Initial F1 Score Testing

LLM-as-a-Judge Results

Generating Mock Expected Answers

Running the Tests

Detailed Test Logs

AI Tooling and Design Considerations

Development Tools

Design Process

Automated Testing with Augment

Future Considerations

Scalability

Production Assumptions

Additional Enhancements

Getting Started

Project Discussion & Evaluation Note

Context

Implementation Reference

Visual Reference

Commit History

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages