PulsePredict: Adverse Medical Event Prediction from Doctor-Patient Calls

Problem Statement

Adverse medical events often arise from miscommunication or overlooked information during patient-physician interactions. Identifying these events proactively could significantly improve patient safety and healthcare outcomes. However, there is currently no scalable, automated method to analyze spoken medical conversations and flag potential adverse outcomes.

Deployed Link - PulsePredict App

Objective

Develop an end-to-end system that:

Transcribes doctor-patient audio conversations
Extracts key medical entities
Predicts the likelihood of an adverse medical event based on the conversation

Proposed Solution

PulsePredict follows a multi-stage processing pipeline:

Audio Transcription
Transcribe audio calls using OpenAI’s Whisper model to convert speech into text.
Medical Entity Extraction
Use AWS Comprehend Medical to extract relevant medical entities (symptoms, medications, diagnoses, etc.) from the transcriptions.
Labeling for Adverse Events
Utilize a curated FAERS (FDA Adverse Event Reporting System) dataset to identify and label potential adverse medical events based on extracted entities.
Feature Engineering
Engineer structured features from the medical concepts to be used for model training.
Adverse Event Prediction
Train a machine learning model on these features to predict the probability of an adverse medical event occurring as a result of the doctor-patient interaction.

Workflow

The following diagram illustrates the complete pipeline for predicting adverse medical events from audio-based medical conversations:

Pipeline Scripts

A breakdown of the core scripts used in this project:

predict_from_audio.py
Orchestrates the complete pipeline: from audio input to final prediction using the trained model.
train_model.py
Trains the machine learning model using features extracted from labeled medical conversations.
evaluate_model.py
Evaluates the trained model and benchmarks its performance against a rule-based baseline system.
utils/
A directory containing helper functions for:
- Loading FAERS adverse event data
- Labeling entities with known adverse events
- Parsing and cleaning entities from AWS Comprehend Medical

How It Works

Transcribe medical calls using Whisper
Extract medical entities using AWS Comprehend Medical
Label entities based on FAERS adverse events database
Engineer features from labeled entities
Train a classifier to predict if an adverse event occurred
Run end-to-end predictions on new audio

Demo Video

Video Link - PulsePredict Video

Tech Stack

Layer	Technology
Transcription	OpenAI Whisper
NLP	AWS Comprehend Medical
ML Model	Scikit-learn (Random Forest)
Backend	Python
Front End	Streamlit
Deployment	Streamlit

Testing

The project includes two types of testing:

UI Automation Testing
- Tests built with Playwright and Pytest
- Validates UI elements: titles, input fields, and buttons
- Runs in a headless browser (Chromium)
- Ensures UI consistency and responsiveness
Manual Testing
- Covers backend pipeline from audio input to prediction
- Tests transcription, entity extraction, and adverse event detection
- Includes edge case handling (e.g. missing/corrupted files)
- Helps verify functional correctness of each module

Test Documents

Challenges Faced & Solutions

1. Audio Transcription

Challenge: Batch processing of .mp3 files with Whisper.
Issues: Manual transcription, CPU performance warnings.
Solution: Developed batch_transcribe.py and predict_from_audio.py for automated transcription.

2. Entity Extraction

Challenge: Structured medical data extraction via AWS Comprehend Medical.
Issues: Missing AWS credentials.
Solution: Configured AWS CLI and used batch_entity_extraction.py.

3. Data Preprocessing

Challenge: Noisy transcripts reduced NLP performance.
Solution: Built batch_preprocess_transcripts.py to clean transcripts using a filler word list.

4. Adverse Event Labeling

Challenge: Matching entities with FAERS data.
Issues: Complex CSV format, exact matching.
Solution: Cleaned FAERS data and used partial/lowercase matching in label_entities.py.

5. Feature Engineering

Challenge: Poor model performance due to weak features.
Solution: Added meaningful features (e.g., adverse_event_ratio) and rebuilt feature_engineering.py.

6. Model Training & Evaluation

Challenge: Model overfitting and poor generalization.
Solution: Balanced dataset with false samples, evaluated with both model and rule-based methods.

7. Model Bias Fix

Challenge: Dataset bias towards positive samples.
Solution: Added negative samples and improved feature diversity for better model performance.

8. Streamlit Deployment

Challenge: Interface bugs and missing dependencies.
Solution: Installed necessary libraries and finalized medical_streamlit_app_updated.py.

9. GitHub Cleanup

Challenge: Uploaded unnecessary files, missing .gitignore.
Solution: Added .gitignore, removed unused scripts, and updated project documentation.

Screenshots

Future Improvements

Incorporate time-aware features such as event sequences and timestamps.
Use larger Whisper models to improve transcription accuracy.
Fine-tune domain-specific NLP models like BioBERT for better entity extraction.
Expand the FAERS dataset to cover more entity types and adverse events.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
audio_calls		audio_calls
cleaned_transcripts		cleaned_transcripts
data		data
entities		entities
images		images
labeled_entities		labeled_entities
manual_tests		manual_tests
model		model
new_audio		new_audio
scripts		scripts
transcripts		transcripts
ui_tests		ui_tests
.gitignore		.gitignore
README.md		README.md
predict_from_audio.py		predict_from_audio.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PulsePredict: Adverse Medical Event Prediction from Doctor-Patient Calls

Problem Statement

Objective

Proposed Solution

Workflow

Pipeline Scripts

How It Works

Demo Video

Tech Stack

Testing

Challenges Faced & Solutions

1. Audio Transcription

2. Entity Extraction

3. Data Preprocessing

4. Adverse Event Labeling

5. Feature Engineering

6. Model Training & Evaluation

7. Model Bias Fix

8. Streamlit Deployment

9. GitHub Cleanup

Screenshots

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PulsePredict: Adverse Medical Event Prediction from Doctor-Patient Calls

Problem Statement

Objective

Proposed Solution

Workflow

Pipeline Scripts

How It Works

Demo Video

Tech Stack

Testing

Challenges Faced & Solutions

1. Audio Transcription

2. Entity Extraction

3. Data Preprocessing

4. Adverse Event Labeling

5. Feature Engineering

6. Model Training & Evaluation

7. Model Bias Fix

8. Streamlit Deployment

9. GitHub Cleanup

Screenshots

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages