Distributed Fraud Detection with XGBoost and Dask on Amazon SageMaker

Train an XGBoost fraud detection model at scale using distributed multi-GPU training with Dask on the SageMaker XGBoost Deep Learning Container in Algorithm mode.

Overview

Fraud detection systems process millions of transactions and must retrain frequently as attack patterns evolve. This tutorial demonstrates how to use SageMaker's built-in XGBoost algorithm with Dask-based distributed GPU training to handle large-scale, imbalanced fraud datasets efficiently.

What you'll learn:

Use the SageMaker XGBoost DLC in Algorithm mode (no custom training script needed)
Generate a realistic synthetic fraud dataset with class imbalance
Run distributed multi-GPU training with Dask across multiple GPUs
Handle class imbalance with scale_pos_weight
Partition data correctly for Dask-based training

Why distributed GPU training?

Train on datasets with millions of rows in minutes instead of hours
Dask utilizes all GPUs across one or more instances
Cost-effective - faster training means lower total compute cost
Available since XGBoost 1.5-1 on SageMaker

Prerequisites

AWS account with SageMaker permissions
AWS CLI configured
Python 3.8+ with boto3, sagemaker, pandas, scikit-learn installed
An S3 bucket for training data and model artifacts

Files

run_tutorial.py - End-to-end orchestration: synthetic data generation, training, deployment, inference, cleanup

Quick Start

1. Install Dependencies

pip install boto3 sagemaker pandas scikit-learn

2. Set Environment Variables

export SAGEMAKER_ROLE="arn:aws:iam::<account-id>:role/<SageMakerExecutionRole>"
export S3_BUCKET="<your-s3-bucket>"

3. Run the Tutorial

# Single multi-GPU instance (recommended starting point)
python run_tutorial.py \
  --role "$SAGEMAKER_ROLE" \
  --bucket "$S3_BUCKET" \
  --instance-type ml.g5.12xlarge \
  --instance-count 1

# Scale out: 2 multi-GPU instances
python run_tutorial.py \
  --role "$SAGEMAKER_ROLE" \
  --bucket "$S3_BUCKET" \
  --instance-type ml.g5.12xlarge \
  --instance-count 2 \
  --num-samples 2000000

Command Line Options

--role - SageMaker execution role ARN (required)
--bucket - S3 bucket for data and artifacts (required)
--region - AWS region (default: us-west-2)
--image-uri - XGBoost container image URI (default: auto-generated for region)
--instance-type - Training instance type (default: ml.g5.12xlarge)
--instance-count - Number of training instances (default: 1)
--deploy-instance-type - Endpoint instance type (default: ml.m5.large)
--num-samples - Number of synthetic transactions (default: 500000)
--fraud-rate - Fraction of fraudulent transactions (default: 0.02)
--num-round - Number of XGBoost boosting rounds (default: 200)
--max-depth - Maximum tree depth (default: 8)
--skip-deploy - Skip deployment and inference
--skip-cleanup - Skip endpoint cleanup

Step-by-Step Walkthrough

Step 1: Generate Synthetic Fraud Data

The script generates a realistic imbalanced dataset mimicking credit card fraud:

30 numerical features (transaction amount, velocity, distance, etc.)
~2% fraud rate (configurable)
Default: 500K transactions, scalable to millions

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=500_000,
    n_features=30,
    n_informative=15,
    n_redundant=5,
    weights=[0.98, 0.02],  # 2% fraud rate
    random_state=42,
)

Step 2: Partition Data for Dask

Dask reads each file as a partition, with one Dask worker per GPU. The number of data files should exceed the total GPU count.

# For ml.g5.12xlarge (4 GPUs) × 2 instances = 8 GPUs
# Create 16 partitions (2× GPU count)
num_partitions = num_gpus * 2

Important: Dask distributed training only supports CSV and Parquet formats. LIBSVM and PROTOBUF will cause the training job to fail.

Step 3: Launch Distributed GPU Training

from sagemaker.estimator import Estimator

estimator = Estimator(
    image_uri=xgboost_image_uri,  # XGBoost 3.0-5
    role=role,
    instance_count=2,
    instance_type="ml.g5.12xlarge",
    hyperparameters={
        "objective": "binary:logistic",
        "num_round": 200,
        "max_depth": 8,
        "eta": 0.1,
        "tree_method": "gpu_hist",
        "scale_pos_weight": 49,  # ratio of negatives to positives
        "eval_metric": "auc",
        "use_dask_gpu_training": "true",
    },
)

# FullyReplicated - Dask handles data distribution internally
train_input = TrainingInput(s3_data=train_s3_uri, distribution="FullyReplicated")
estimator.fit({"train": train_input, "validation": val_input})

Key hyperparameters for distributed GPU training:

tree_method: gpu_hist - enables GPU-accelerated histogram-based training
use_dask_gpu_training: "true" - enables Dask multi-GPU coordination
scale_pos_weight: 49 - compensates for 2% fraud rate (98/2 ≈ 49)

Step 4: Deploy and Test

predictor = estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large",  # CPU is fine for inference
)

Step 5: Clean Up

predictor.delete_endpoint()

Instance Selection Guide

Instance	GPUs	GPU Memory	Best For
ml.g5.xlarge	1 × A10G	24 GB	Small datasets, testing
ml.g5.12xlarge	4 × A10G	96 GB	Medium datasets (recommended)
ml.g5.24xlarge	4 × A10G	96 GB	Large datasets, more CPU/RAM

XGBoost 3.0-5 note: P3 instances are not supported. Use G4dn or G5 family.

Dask Training Best Practices

File count: Create more files than total GPUs (instance_count × GPUs per instance). Too few files underutilizes GPUs; too many degrades performance.
File format: Use CSV or Parquet only. Parquet column names must be strings.
Distribution: Set distribution="FullyReplicated" or omit it. Do not use ShardedByS3Key.
No pipe mode: Dask does not support pipe mode input.
File sizes: Aim for roughly equal-sized partitions for balanced GPU utilization.

XGBoost Version Comparison

Feature	1.5-1	1.7-1	3.0-5
Dask multi-GPU	✅	✅	✅
GPU instance support	P2, P3, G4dn, G5	P3, G4dn, G5	G4dn, G5
SageMaker Debugger	✅	✅	❌

Cost Estimate

Configuration	Instance	Training Time (500K rows)	Approximate Cost
1 × ml.g5.xlarge	1 GPU	~8 min	~$0.14
1 × ml.g5.12xlarge	4 GPUs	~3 min	~$0.28
2 × ml.g5.12xlarge	8 GPUs	~2 min	~$0.37

GPU training is faster and often more cost-effective than CPU despite higher per-instance cost.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed Fraud Detection with XGBoost and Dask on Amazon SageMaker

Overview

Prerequisites

Files

Quick Start

1. Install Dependencies

2. Set Environment Variables

3. Run the Tutorial

Command Line Options

Step-by-Step Walkthrough

Step 1: Generate Synthetic Fraud Data

Step 2: Partition Data for Dask

Step 3: Launch Distributed GPU Training

Step 4: Deploy and Test

Step 5: Clean Up

Instance Selection Guide

Dask Training Best Practices

XGBoost Version Comparison

Cost Estimate

References

Uh oh!

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Distributed Fraud Detection with XGBoost and Dask on Amazon SageMaker

Overview

Prerequisites

Files

Quick Start

1. Install Dependencies

2. Set Environment Variables

3. Run the Tutorial

Command Line Options

Step-by-Step Walkthrough

Step 1: Generate Synthetic Fraud Data

Step 2: Partition Data for Dask

Step 3: Launch Distributed GPU Training

Step 4: Deploy and Test

Step 5: Clean Up

Instance Selection Guide

Dask Training Best Practices

XGBoost Version Comparison

Cost Estimate

References