news-topic-classifier

HuffPost News Category Classifier (MLOps)

Author: Ticona Perales Guillermo Sebastian

This repository provides a MLOps workflow for classifying HuffPost news categories with two models:

BASELINE: EmbeddingBag (neural bag-of-words) + small MLP
MAIN: TinyBERT fine-tuning (huawei-noah/TinyBERT_General_4L_312D)

It includes reproducible training, inference, MLflow logging, Hydra configs, DVC-based data management, and ONNX export.

Project overview

Problem statement

Predict the news category (for example POLITICS, ENTERTAINMENT, SPORTS, TECH, TRAVEL) from a HuffPost headline with an optional short description. This enables automatic tagging for search, recommendations, and content analytics.

Input and output data format

Input (inference):

headline: str (required)
short_description: str (optional, may be empty)

Output:

category: str (one of the predefined classes)
optional top_k probabilities per class

Metrics

Primary metrics:

Accuracy
Macro F1-score (important due to class imbalance)

Expected performance on the full dataset depends on training budget and hardware. The baseline serves as a lower bound, while TinyBERT is expected to outperform it when trained for enough epochs.

Validation

We split the dataset into train / validation / test = 70% / 15% / 15% with stratification by category and a fixed seed (42). Split indices are saved for reproducibility.

Data

We use the HuffPost News Category dataset (Rishabh Misra). The implementation downloads from Hugging Face:

https://huggingface.co/datasets/heegyu/news-category-dataset

The dataset contains ~200k news entries with fields like headline, short_description, and category. The raw files are ~80–90 MB. Known issues include class imbalance and overlapping labels (for example WORLDPOST vs WORLD NEWS).

Alternate source:

https://www.kaggle.com/datasets/rmisra/news-category-dataset

Modeling

Baseline:

Lowercasing + punctuation removal tokenization
Neural bag-of-words with nn.EmbeddingBag + small MLP head

Main:

TinyBERT fine-tuning (huawei-noah/TinyBERT_General_4L_312D)
Headline + optional short description concatenated with a separator token

Both models are trained with PyTorch Lightning and logged with MLflow.

Deployment

The project provides CLI-based inference and MLflow Serving packaging for local deployment and integration.

Setup

poetry install
poetry run pre-commit install
poetry run pre-commit run -a

Notes:

MLflow tracking defaults to http://127.0.0.1:8080 and can be changed via logging.tracking_uri.
DVC is optional. If a remote is not configured, the code will fall back to downloading data from Hugging Face.

DVC installation (Windows):

pip install dvc

Initialize DVC in the repo:

dvc init

Train

Baseline model:

poetry run python -m huffpost_classifier.cli_train model=baseline_embedding_bag

TinyBERT model:

poetry run python -m huffpost_classifier.cli_train model=bert_finetune

GPU training (if available):

poetry run python -m huffpost_classifier.cli_train model=bert_finetune trainer.accelerator=gpu trainer.devices=1

Quick run (small subset):

poetry run python -m huffpost_classifier.cli_train model=baseline_embedding_bag data.limit_train_samples=20000 data.limit_val_samples=5000 data.limit_test_samples=5000 trainer.max_epochs=5

Artifacts are saved under artifacts/<run_name>/ and plots under plots/<run_name>/. If you have multiple runs, set infer.artifacts_dir=artifacts/<run_name> to target a specific run. The baseline uses lowercase + punctuation removal tokenization, drops rows with empty headlines, and applies class-weighted loss (plus weighted sampling for baseline) to address imbalance.

Production preparation

Export ONNX for the latest run:

poetry run python -m huffpost_classifier.cli_export model=baseline_embedding_bag
poetry run python -m huffpost_classifier.cli_export model=bert_finetune

If you want to export a specific run:

poetry run python -m huffpost_classifier.cli_export model=baseline_embedding_bag infer.artifacts_dir=artifacts/<run_name>

Artifacts layout:

artifacts/<run_name>/baseline/: best.ckpt, model.pt, vocab.json, label_map.json
artifacts/<run_name>/bert/: best.ckpt, model/, tokenizer/, label_map.json
artifacts/<run_name>/onnx/: baseline.onnx, bert.onnx

Infer

Single example:

poetry run python -m huffpost_classifier.cli_infer model=baseline_embedding_bag infer.headline="Example headline" infer.short_description="Example short description"

TinyBERT example:

poetry run python -m huffpost_classifier.cli_infer model=bert_finetune infer.headline="Example headline" infer.short_description="Example short description"

JSONL input:

poetry run python -m huffpost_classifier.cli_infer model=baseline_embedding_bag infer.input_path=data/examples/infer_example.jsonl

Test-split inference (default when no input is provided):

poetry run python -m huffpost_classifier.cli_infer model=baseline_embedding_bag infer.max_rows=5

Limit rows for quick testing:

poetry run python -m huffpost_classifier.cli_infer model=baseline_embedding_bag infer.input_path=data/examples/infer_example.jsonl infer.max_rows=5

Write predictions to a JSONL file:

poetry run python -m huffpost_classifier.cli_infer model=baseline_embedding_bag infer.input_path=data/examples/infer_example.jsonl infer.max_rows=5 infer.output_path=outputs/predictions.jsonl

stdin JSONL input:

type data\examples\infer_example.jsonl | poetry run python -m huffpost_classifier.cli_infer model=baseline_embedding_bag infer.use_stdin=true

Expected output format:

prediction[0]: POLITICS
  top1: POLITICS (0.8123)
  top2: WORLD (0.0644)
  top3: BUSINESS (0.0412)

DVC notes

To use a DVC remote (optional):

dvc init

The training and inference commands first attempt dvc pull via the DVC Python API and fall back to a dataset download if the remote is unavailable.

After the first data download, track local data with DVC:

dvc add data/raw data/processed data/splits data/examples
git add data/*.dvc

Inference server (MLflow Serving)

Package the latest model into an MLflow pyfunc bundle:

poetry run python -m huffpost_classifier.cli_package_mlflow model=baseline_embedding_bag

Serve the model locally using the current environment:

poetry run mlflow models serve -m artifacts/<run_name>/mlflow_baseline --env-manager local --port 5001

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.dvc		.dvc
configs		configs
data		data
huffpost_classifier		huffpost_classifier
.dvcignore		.dvcignore
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

news-topic-classifier

HuffPost News Category Classifier (MLOps)

Project overview

Problem statement

Input and output data format

Metrics

Validation

Data

Modeling

Deployment

Setup

Train

Production preparation

Infer

DVC notes

Inference server (MLflow Serving)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

news-topic-classifier

HuffPost News Category Classifier (MLOps)

Project overview

Problem statement

Input and output data format

Metrics

Validation

Data

Modeling

Deployment

Setup

Train

Production preparation

Infer

DVC notes

Inference server (MLflow Serving)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages