Author: Ticona Perales Guillermo Sebastian
This repository provides a MLOps workflow for classifying HuffPost news categories with two models:
- BASELINE: EmbeddingBag (neural bag-of-words) + small MLP
- MAIN: TinyBERT fine-tuning (huawei-noah/TinyBERT_General_4L_312D)
It includes reproducible training, inference, MLflow logging, Hydra configs, DVC-based data management, and ONNX export.
Predict the news category (for example POLITICS, ENTERTAINMENT, SPORTS, TECH, TRAVEL) from a HuffPost headline with an optional short description. This enables automatic tagging for search, recommendations, and content analytics.
Input (inference):
headline: str (required)short_description: str (optional, may be empty)
Output:
category: str (one of the predefined classes)- optional
top_kprobabilities per class
Primary metrics:
- Accuracy
- Macro F1-score (important due to class imbalance)
Expected performance on the full dataset depends on training budget and hardware. The baseline serves as a lower bound, while TinyBERT is expected to outperform it when trained for enough epochs.
We split the dataset into train / validation / test = 70% / 15% / 15% with stratification by category and a fixed seed (42). Split indices are saved for reproducibility.
We use the HuffPost News Category dataset (Rishabh Misra). The implementation downloads from Hugging Face:
The dataset contains ~200k news entries with fields like headline, short_description, and category. The raw files are ~80–90 MB. Known issues include class imbalance and overlapping labels (for example WORLDPOST vs WORLD NEWS).
Alternate source:
Baseline:
- Lowercasing + punctuation removal tokenization
- Neural bag-of-words with
nn.EmbeddingBag+ small MLP head
Main:
- TinyBERT fine-tuning (
huawei-noah/TinyBERT_General_4L_312D) - Headline + optional short description concatenated with a separator token
Both models are trained with PyTorch Lightning and logged with MLflow.
The project provides CLI-based inference and MLflow Serving packaging for local deployment and integration.
poetry install
poetry run pre-commit install
poetry run pre-commit run -aNotes:
- MLflow tracking defaults to
http://127.0.0.1:8080and can be changed vialogging.tracking_uri. - DVC is optional. If a remote is not configured, the code will fall back to downloading data from Hugging Face.
DVC installation (Windows):
pip install dvcInitialize DVC in the repo:
dvc initBaseline model:
poetry run python -m huffpost_classifier.cli_train model=baseline_embedding_bagTinyBERT model:
poetry run python -m huffpost_classifier.cli_train model=bert_finetuneGPU training (if available):
poetry run python -m huffpost_classifier.cli_train model=bert_finetune trainer.accelerator=gpu trainer.devices=1Quick run (small subset):
poetry run python -m huffpost_classifier.cli_train model=baseline_embedding_bag data.limit_train_samples=20000 data.limit_val_samples=5000 data.limit_test_samples=5000 trainer.max_epochs=5Artifacts are saved under artifacts/<run_name>/ and plots under plots/<run_name>/.
If you have multiple runs, set infer.artifacts_dir=artifacts/<run_name> to target a specific run.
The baseline uses lowercase + punctuation removal tokenization, drops rows with empty headlines, and applies class-weighted loss (plus weighted sampling for baseline) to address imbalance.
Export ONNX for the latest run:
poetry run python -m huffpost_classifier.cli_export model=baseline_embedding_bag
poetry run python -m huffpost_classifier.cli_export model=bert_finetuneIf you want to export a specific run:
poetry run python -m huffpost_classifier.cli_export model=baseline_embedding_bag infer.artifacts_dir=artifacts/<run_name>Artifacts layout:
artifacts/<run_name>/baseline/:best.ckpt,model.pt,vocab.json,label_map.jsonartifacts/<run_name>/bert/:best.ckpt,model/,tokenizer/,label_map.jsonartifacts/<run_name>/onnx/:baseline.onnx,bert.onnx
Single example:
poetry run python -m huffpost_classifier.cli_infer model=baseline_embedding_bag infer.headline="Example headline" infer.short_description="Example short description"TinyBERT example:
poetry run python -m huffpost_classifier.cli_infer model=bert_finetune infer.headline="Example headline" infer.short_description="Example short description"JSONL input:
poetry run python -m huffpost_classifier.cli_infer model=baseline_embedding_bag infer.input_path=data/examples/infer_example.jsonlTest-split inference (default when no input is provided):
poetry run python -m huffpost_classifier.cli_infer model=baseline_embedding_bag infer.max_rows=5Limit rows for quick testing:
poetry run python -m huffpost_classifier.cli_infer model=baseline_embedding_bag infer.input_path=data/examples/infer_example.jsonl infer.max_rows=5Write predictions to a JSONL file:
poetry run python -m huffpost_classifier.cli_infer model=baseline_embedding_bag infer.input_path=data/examples/infer_example.jsonl infer.max_rows=5 infer.output_path=outputs/predictions.jsonlstdin JSONL input:
type data\examples\infer_example.jsonl | poetry run python -m huffpost_classifier.cli_infer model=baseline_embedding_bag infer.use_stdin=trueExpected output format:
prediction[0]: POLITICS
top1: POLITICS (0.8123)
top2: WORLD (0.0644)
top3: BUSINESS (0.0412)
To use a DVC remote (optional):
dvc initThe training and inference commands first attempt dvc pull via the DVC Python API and fall back to a dataset download if the remote is unavailable.
After the first data download, track local data with DVC:
dvc add data/raw data/processed data/splits data/examples
git add data/*.dvcPackage the latest model into an MLflow pyfunc bundle:
poetry run python -m huffpost_classifier.cli_package_mlflow model=baseline_embedding_bagServe the model locally using the current environment:
poetry run mlflow models serve -m artifacts/<run_name>/mlflow_baseline --env-manager local --port 5001