A full-stack healthcare automation application that uses LLMs to automatically complete prior authorization forms from patient clinical data. Built with FastAPI backend and Streamlit frontend, the system leverages OpenAI's structured outputs to extract and validate medical information.
- Automatic Form Filling: Extracts answers from patient visit notes and demographics using GPT models with structured JSON output
- Validation Pass: Optional second LLM call to review and correct initial answers with detailed change tracking (what changed, why, and rationale)
- Mock Data Generation: Synthetic patient data and test case generation for development and testing
- LLM-as-a-Judge Evaluation: Sophisticated semantic evaluation system that judges answer correctness using clinical knowledge rather than string matching, tracking validation improvements/degradations with confidence levels
- Streaming Support: Real-time SSE streaming for progressive form field updates
- File Processing: PDF/text upload with clinical note summarization
- Caching: Hash-based request caching to avoid redundant API calls
- Backend: FastAPI with Pydantic models
- Frontend: Streamlit interactive UI
- LLM: OpenAI GPT (gpt-5-mini/gpt-5) with structured outputs
- Evaluation: F1 score metrics + LLM judge with confidence scoring
Automates the tedious process of filling pharmaceutical prior authorization forms (e.g., Zepbound) by intelligently extracting relevant clinical information from patient records.
This system demonstrates production-ready LLM integration with validation, evaluation pipelines, and quality assurance mechanisms for healthcare automation.
- User dropdown select patient data
- User can view or upload to update the existing visitor notes
- User can provide feedback before or after to guide the form filling process
- Streamlit display works as intended
- User interactions work as intended
- FastAPI endpoint for answer and file upload work as intended (streaming not fully tested due to organization restraint requiring verified OPENAI_API_KEY)
- Actor-Critic with Feedback - Using the validation loop
- Eval Pipeline - Testing for F1 score with and without validation loop (Tweaked validation loop after initial test result to include the QA and Feedback notes)
- Caching - No more reprocessing when unneeded
Upon initial testing, we observed the F1 score eval pipeline outputting results that seemed to disagree with our model main.py FastAPI pipeline output. As a result, the F1 score was quite low:
=================================================== short test summary info ====================================================
FAILED tests/test_answers.py::test_f1_score_without_validation - AssertionError: F1 score too low: 0.077
FAILED tests/test_answers.py::test_f1_score_with_validation - AssertionError: F1 score too low: 0.053
====================================== 2 failed, 1 passed, 1 warning in 701.59s (0:11:41) ======================================
Analysis: We needed to validate the True Answer Dataset to see whether or not the true answer dataset was even correct. We approached this with LLM-as-a-Judge to double check the generated True Answer Dataset before testing with the Non-validated LLM form parsing response and the Validated LLM form parsing response.
python -m tests.generate_mock_dataThis generated 10 sets of 30 questions with expected answers based on 10 sample patient data.
$env:RUN_LIVE_OPENAI="1"; pytest tests/test_answers.py::test_llm_judge_without_validation -v -s
$env:RUN_LIVE_OPENAI="1"; pytest tests/test_answers.py::test_llm_judge_with_validation -v -sResults:
- Without validation: 100% accuracy
- With validation: 93.3% accuracy
This is likely the result of the prompt not being optimized for the validation pass right now, and the LLM is confused by the additional context from reading the extra {initial_answer} context clue.
Further optimization needed:
- Incorporate special prompt design
- Use larger parameter reasoning model rather than gpt-5-mini
- Add few-shot examples for edge cases
Mock Data Generation Log
(engineer-take-home) PS D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum>
python -m tests.generate_mock_data
Loaded 10 patients and 30 questions
Generating expected answers by calling API...
======================================================================
[1/10] Processing patient: Isaiah Reed
✓ Generated 30 expected answers
[2/10] Processing patient: Elizabeth Munoz
✓ Generated 30 expected answers
[3/10] Processing patient: Rebecca Edwards
✓ Generated 30 expected answers
[4/10] Processing patient: Suzanne Harris
✓ Generated 30 expected answers
[5/10] Processing patient: Jeffrey Donovan
✓ Generated 30 expected answers
[6/10] Processing patient: Madison Cook
✓ Generated 30 expected answers
[7/10] Processing patient: Michelle Andrews
✓ Generated 30 expected answers
[8/10] Processing patient: Michelle Dougherty
✓ Generated 30 expected answers
[9/10] Processing patient: Anthony Chaney
✓ Generated 30 expected answers
[10/10] Processing patient: James Aguilar
✓ Generated 30 expected answers
======================================================================
✅ Generated 10 mock test cases
📁 Saved to: D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum\tests\mock_test_data.json
⚠️ NOTE: These are API-generated answers, not human-validated ground truth.
For production use, have clinical experts review and validate these answers.
LLM Judge WITH Validation Test Log
(engineer-take-home) PS D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum>
$env:RUN_LIVE_OPENAI="1"; pytest tests/test_answers.py::test_llm_judge_with_validation -v -s
=============================== test session starts ===============================
platform win32 -- Python 3.11.5, pytest-8.4.0, pluggy-1.6.0 -- D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum\.venv\Scripts\python.exe
cachedir: .pytest_cache
rootdir: D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum
configfile: pyproject.toml
plugins: anyio-4.9.0, Faker-37.4.0, logfire-3.18.0
collected 1 item
tests/test_answers.py::test_llm_judge_with_validation
============================================================
Testing LLM Judge WITH Validation Pass
============================================================
[1] Testing patient: Isaiah Reed
[2] Testing patient: Elizabeth Munoz
------------------------------------------------------------
Total Questions: 60
Initial Accuracy: 93.3%
Final Accuracy: 93.3%
Validation Improvements: 0
Net Improvement: +0.0%
============================================================
PASSED
================================ warnings summary =================================
.venv\Lib\site-packages\pydantic\_internal\_fields.py:198
D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum\.venv\Lib\site-packages\pydantic\_internal\_fields.py:198: UserWarning: Field name "validate" in "AnswerInput" shadows an attribute in parent "BaseModel"
warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================== 1 passed, 1 warning in 402.39s (0:06:42) =====================
LLM Judge WITHOUT Validation Test Log
(engineer-take-home) PS D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum>
$env:RUN_LIVE_OPENAI="1"; pytest tests/test_answers.py::test_llm_judge_without_validation -v -s
=============================== test session starts ===============================
platform win32 -- Python 3.11.5, pytest-8.4.0, pluggy-1.6.0 -- D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum\.venv\Scripts\python.exe
cachedir: .pytest_cache
rootdir: D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum
configfile: pyproject.toml
plugins: anyio-4.9.0, Faker-37.4.0, logfire-3.18.0
collected 1 item
tests/test_answers.py::test_llm_judge_without_validation
============================================================
Testing LLM Judge WITHOUT Validation Pass
============================================================
[1] Testing patient: Isaiah Reed
[2] Testing patient: Elizabeth Munoz
------------------------------------------------------------
Total Questions: 60
Correct Answers: 60
Accuracy: 100.0%
High-Confidence Accuracy: 100.0%
============================================================
PASSED
================================ warnings summary =================================
.venv\Lib\site-packages\pydantic\_internal\_fields.py:198
D:\VSCode Projects\Develop_Health\take_home_test\Homen-Shum\.venv\Lib\site-packages\pydantic\_internal\_fields.py:198: UserWarning: Field name "validate" in "AnswerInput" shadows an attribute in parent "BaseModel"
warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================== 1 passed, 1 warning in 110.80s (0:01:50) =====================
- Augment VS Code Extension - AI-powered coding assistant
- Feeding Web Documentation as Context:
- Whiteboarding with hand-drawn low fidelity design into section-by-section implementation manually prompted:
Each section and written checkpoint is jotted down for clarity of thought and structured implementation.
- Evaluation design tailored to the context of the sample data and the AI Engineer take home interview descriptions.
- Augment can be smartly prompted to run automated testing until all "live smoke tests using real API keys" pass:
- Set
$env:YOUR_KEYin the Augment CLI terminal - This specific prompt and key setting allow Augment to run the test-fix-refine-retest loop for hours
- Note: This feature works wonderfully with message-per-month pricing but would be expensive with credit-based pricing
- Future to-do: Create live smoke tests with True Answer Dataset and validate with tests generated using real APIs integrated into FastAPI backend after careful scenario-based test designs
- Set
Example design for a different application (notebook application with automated agent for research tasks):
- Visitor Note RAG Implementation for scaling to hundreds/thousands of visitor notes accumulated over a person's lifespan:
- Async per form question per visitor note validation pipeline
- Enhance correctness and speed of completion
- Each question may have top 100K visitor notes suitable for the response
- Order them based on datetime and batch into grouped chunks (10 notes per chunk)
- Come up with preliminary decision and rationale
- Concatenate the 10 batched decision and rationale responses
- Final response prioritizing the most recent preliminary decisions
- Production Environment Considerations:
- HIPAA Compliance: Production LLM Model used is isolated within provider environment that can guarantee HIPAA Business Associate Agreement
- Golden Dataset: Production application and form filling pipeline deployment is tested and evaluated for Accuracy and F1 score with Golden True Answer Dataset that is historically submitted and human verified
- Ensemble Methods: Production environment result can be further enhanced with ensembling method where we mesh multiple LLMs' responses and calculate weighted confidence score to see which model scores highest for each question
- F1 Score Insights: Can give insight into which question and which patient/medical background the LLM model performs best upon
- Re-evaluation Pipeline: Use another validation pipeline with overall high F1 score LLM model to judge over the decision and rationale
- Observability: Pydantic Logfire can help with observation over the LLMs' performance
Additionally, we added the validated_findings field for the LLM response that has the LLM validation to display under the rationale field for each form question and highlight the form question if the answer were to be modified as a result of the additional validation loop.
For detailed setup and usage instructions, please refer to:
- Design Specifications
- Folder Structure
- Validation Features Summary
- LLM Judge Guide
- Evaluation Findings
During ongoing collaboration and live discussion, the following clarification outlines the design intent and rationale behind the LLM-as-a-Judge evaluation architecture.
For the automated prior authorization form-filling task, it is both efficient and semantically appropriate to:
- Reduce evaluation complexity by replacing the F1 score with an Accuracy score
- Adopt LLM-as-a-Judge for result validation — allowing the system to determine whether each output is correct in meaning, not just by string equivalence
This approach addresses real-world semantic variation in clinical text.
For example, if the expected answer is "Zedbound 10mg" and the model outputs "10mg of Zedbound", the LLM-as-a-Judge system correctly validates it as equivalent, whereas naive string-matching would fail.
The updated implementation, evaluation tests, and working demo are available below:
- GitHub Repository: HomenShum/LLM-test-suite Automated Testing with LLM for Classification, LLM-as-a-Judge, Context Verification & Pruning, and Agent Scaffold Systems
- Live Demo: https://llm-test-suite-cafecorner.streamlit.app/
The current implementation introduces a working variant of the evaluation pipeline using LLM-as-a-Judge and replaces the initial string-matching logic with semantic validation for production-grade testing.
A visual sketch of the proposed evaluation flow illustrates how validation, reasoning, and iterative feedback improve the evaluation loop.
fef8669- Add validation features, LLM judge evaluation, and comprehensive documentation82d356e- feat: Add LLM-as-a-Judge evaluation and fix validation metadata8fb1969- feat: Implement prior authorization auto-fill with validation loop and F1 evaluation336a1e2- initial note6e2b2ba- Initial commit
