AutoQA: Rubric Reliability & Predictive Usefulness Analysis 🤖📊

📌 Project Overview

This repository contains a comprehensive data analysis project designed to evaluate an Automated Quality Assurance (AutoQA) system for educational lessons. The goal is to convert lesson transcripts into reliable, evidence-based teaching quality signals using Large Language Models (LLMs) and validate whether those signals predict meaningful student outcomes (Retention and Attendance).

This project demonstrates a closed-loop improvement workflow (Detect → Route → Fix → Verify) to scale teaching QA globally.

🎯 Business Objectives

Measurement Reliability (Block 1): Evaluate if AI models (GPT-4o vs. GPT-5.1) can measure pedagogical rubrics consistently, avoiding the "Agreement Trap" (high simple agreement due to severe class imbalance).
Predictive Usefulness (Block 2): Determine if the reliable AutoQA signals actually impact the company's bottom line—specifically predicting m1_retained (Month 1 Retention) and next_lesson_attended.

🛠️ Methodology & Tech Stack

Language: Python
Libraries: Pandas (Data manipulation, groupby aggregations), Matplotlib & Seaborn (Data visualization), Scikit-Learn (Cohen's Kappa, Accuracy metrics).
Techniques Used:
- Statistical Reliability: Calculated Cohen's Kappa ($\kappa$) to measure true inter-rater reliability between LLMs.
- Prompt/Rubric Engineering: Rewrote ambiguous rubric items applying strict COUNT ONLY IF / DO NOT COUNT boundaries to make them calibration-grade and machine-readable.
- Predictive Analytics: Applied cross-validation between model reliability ($\kappa$) and mean percentage lifts to discover true leading indicators of student engagement.

🚀 Key Findings

The "Agreement Trap": Discovered that items with 99%+ simple agreement were actually statistically useless (Cohen's Kappa = 0.00) due to models defaulting to the majority class on autopilot.
Top Predictive Signals:
- item_3: Showed a massive +25.7% lift in Month 1 retention when performed correctly by the tutor.
- item_14: Demonstrated an 18.2% lift in next lesson attendance, paired with a solid moderate Kappa (0.54), making it a highly reliable and economically impactful signal.
Operational Strategy: Designed a one-to-many automated workflow to route aggregated insights to Tutor Development, triggering async training campaigns instead of unscalable 1-on-1 interventions.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
docs		docs
AutoQA.ipynb		AutoQA.ipynb
README.md		README.md
autoqa_output_gpt4o_240.csv		autoqa_output_gpt4o_240.csv
autoqa_output_gpt51_240.csv		autoqa_output_gpt51_240.csv
autoqa_specialist_test_task_instruction.md		autoqa_specialist_test_task_instruction.md
outcomes_240.csv		outcomes_240.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoQA: Rubric Reliability & Predictive Usefulness Analysis 🤖📊

📌 Project Overview

🎯 Business Objectives

🛠️ Methodology & Tech Stack

🚀 Key Findings

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AutoQA: Rubric Reliability & Predictive Usefulness Analysis 🤖📊

📌 Project Overview

🎯 Business Objectives

🛠️ Methodology & Tech Stack

🚀 Key Findings

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages