DefectGuard is a computer vision MLOps system for automated manufacturing defect detection.
It combines model training, experiment tracking, model packaging, API serving, browser-based inspection, monitoring, orchestration, and local deployment into one end-to-end system.
capture -> validate -> train -> promote -> serve -> inspect -> monitor.
- Uses YOLOv8 for visual defect detection on manufacturing-style image data
- Tracks experiments, artifacts, and model versions with MLflow
- Validates input data before training with Great Expectations
- Serves predictions through a FastAPI inference API and lightweight frontend
- Logs predictions for monitoring and drift analysis
- Generates drift reports with Evidently
- Orchestrates workflows with Prefect
- Reproduces pipeline stages with DVC
- Ships with Docker Compose, Nginx, Prometheus, and Grafana for local platform operations
- Includes automated tests and GitHub Actions CI
the platform is organized into five layers:
- Data layer: dataset download, manifest validation, dataset config, DVC stages
- Training layer: YOLOv8 training, MLflow tracking, evaluation, registry packaging
- Serving layer: FastAPI API, prediction abstraction, browser frontend
- Monitoring layer: JSONL prediction logs, Evidently drift reports, Prometheus metrics, Grafana dashboards
- Operations layer: Prefect flows, Docker images, Docker Compose stack, Nginx reverse proxy, CI
- YOLOv8 training via
scripts/train.py - MLflow experiment logging for params, metrics, and artifacts
- MLflow PyFunc packaging for standardized model serving
- Optional registry promotion using model version tags
- Quality gate based on
mAP@0.5 - Champion-vs-challenger promotion logic for Production stage decisions
- Prediction log capture in JSONL format
- Reference-vs-current drift reporting with Evidently
- Prometheus service metrics
- Grafana datasource provisioning
- Local restart policies and healthchecks in Docker Compose
- Automated API tests with lightweight dummy predictor mode
- YOLOv8 / Ultralytics
- NumPy
- Pandas
- Pillow
- Great Expectations
- DVC
- MLflow Tracking
- MLflow PyFunc
- MLflow Model Registry
- Prefect
- Evidently
- FastAPI
- Uvicorn
- Nginx
- Docker
- Docker Compose
- Prometheus
- Grafana
- pytest
- Ruff
- GitHub Actions
- Python 3.11
pip- Docker and Docker Compose for containerized local runs
pip3 install -r requirements.txt -r requirements-mlops.txt -r requirements-dev.txtSet the project import path and point the API to a local YOLO weights file:
export PYTHONPATH=src
export MODEL_PATH=/absolute/path/to/best.pt
uvicorn api.main:app --reloadOpen:
- UI:
http://127.0.0.1:8000/ - Swagger docs:
http://127.0.0.1:8000/docs
Start the full stack:
docker compose up --buildService endpoints:
- Nginx entrypoint and UI:
http://127.0.0.1:8080 - API direct access:
http://127.0.0.1:8000 - MLflow UI:
http://127.0.0.1:5000 - Prometheus UI:
http://127.0.0.1:9090 - Grafana UI:
http://127.0.0.1:3000usingadmin/admin
Download and extract the dataset:
python3 scripts/download_mvtec_ad.py --out data/raw/mvtec_adImportant note:
- MVTec AD is released under
CC BY-NC-SA 4.0 - review license terms before any non-demo use
Training expects a YOLO dataset YAML. A placeholder example is available at data/dataset.yaml.
Run training with MLflow tracking:
export MLFLOW_TRACKING_URI=http://127.0.0.1:5000
export MLFLOW_EXPERIMENT_NAME=defect-detection
export MLFLOW_MODEL_NAME=defect-yolo
PYTHONPATH=src python3 scripts/train.py --data data/dataset.yamlEnable a production-style quality gate:
export MLFLOW_TRACKING_URI=http://127.0.0.1:5000
export MLFLOW_MODEL_NAME=defect-yolo
export ENFORCE_GATE=1
export MIN_MAP50=0.85
export PROMOTE_MODEL=1
PYTHONPATH=src python3 scripts/train.py --data data/dataset.yamlValidate the training manifest before model training:
PYTHONPATH=src python3 scripts/validate_data.py --manifest data/manifest.csvWhat this checks:
- required manifest schema
- null safety for image paths
- file existence on disk
- optional label file presence
The repository defines a simple reproducible DVC pipeline in dvc.yaml.
Current stages:
download_mvtec_advalidatetrain
Run the pipeline:
dvc reproRuntime and model monitoring are both included.
- Prometheus scrapes
/metrics - Grafana reads from Prometheus
- FastAPI exposes request count and latency metrics
- the API logs predictions to
data/predictions.jsonl scripts/set_reference_predictions.pycreates a baseline snapshotscripts/drift_report.pycompares baseline vs current behavior
Generate a drift report:
python3 scripts/set_reference_predictions.py
python3 scripts/drift_report.pyPrefect flows are provided for:
- retraining flow: validation -> training
- monitoring flow: reference snapshot -> drift report
Run the default flow entrypoint:
python3 pipelines/prefect_flow.pyMLFLOW_EXPERIMENT_NAMEMIN_MAP50ENFORCE_GATEPROMOTE_MODEL
MVTEC_AD_URLMVTEC_AD_ARCHIVEMVTEC_AD_OUT
During normal usage, the platform creates outputs such as:
runs/from YOLO trainingdata/predictions.jsonldata/reference_predictions.jsonlreports/validation.okreports/drift_report.html- MLflow run and model artifacts
If you use the MVTec AD helper flow, make sure dataset usage follows the original dataset license.
