Train an XGBoost fraud detection model at scale using distributed multi-GPU training with Dask on the SageMaker XGBoost Deep Learning Container in Algorithm mode.
Fraud detection systems process millions of transactions and must retrain frequently as attack patterns evolve. This tutorial demonstrates how to use SageMaker's built-in XGBoost algorithm with Dask-based distributed GPU training to handle large-scale, imbalanced fraud datasets efficiently.
What you'll learn:
- Use the SageMaker XGBoost DLC in Algorithm mode (no custom training script needed)
- Generate a realistic synthetic fraud dataset with class imbalance
- Run distributed multi-GPU training with Dask across multiple GPUs
- Handle class imbalance with
scale_pos_weight - Partition data correctly for Dask-based training
Why distributed GPU training?
- Train on datasets with millions of rows in minutes instead of hours
- Dask utilizes all GPUs across one or more instances
- Cost-effective - faster training means lower total compute cost
- Available since XGBoost 1.5-1 on SageMaker
- AWS account with SageMaker permissions
- AWS CLI configured
- Python 3.8+ with
boto3,sagemaker,pandas,scikit-learninstalled - An S3 bucket for training data and model artifacts
run_tutorial.py- End-to-end orchestration: synthetic data generation, training, deployment, inference, cleanup
pip install boto3 sagemaker pandas scikit-learnexport SAGEMAKER_ROLE="arn:aws:iam::<account-id>:role/<SageMakerExecutionRole>"
export S3_BUCKET="<your-s3-bucket>"# Single multi-GPU instance (recommended starting point)
python run_tutorial.py \
--role "$SAGEMAKER_ROLE" \
--bucket "$S3_BUCKET" \
--instance-type ml.g5.12xlarge \
--instance-count 1
# Scale out: 2 multi-GPU instances
python run_tutorial.py \
--role "$SAGEMAKER_ROLE" \
--bucket "$S3_BUCKET" \
--instance-type ml.g5.12xlarge \
--instance-count 2 \
--num-samples 2000000--role- SageMaker execution role ARN (required)--bucket- S3 bucket for data and artifacts (required)--region- AWS region (default: us-west-2)--image-uri- XGBoost container image URI (default: auto-generated for region)--instance-type- Training instance type (default: ml.g5.12xlarge)--instance-count- Number of training instances (default: 1)--deploy-instance-type- Endpoint instance type (default: ml.m5.large)--num-samples- Number of synthetic transactions (default: 500000)--fraud-rate- Fraction of fraudulent transactions (default: 0.02)--num-round- Number of XGBoost boosting rounds (default: 200)--max-depth- Maximum tree depth (default: 8)--skip-deploy- Skip deployment and inference--skip-cleanup- Skip endpoint cleanup
The script generates a realistic imbalanced dataset mimicking credit card fraud:
- 30 numerical features (transaction amount, velocity, distance, etc.)
- ~2% fraud rate (configurable)
- Default: 500K transactions, scalable to millions
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=500_000,
n_features=30,
n_informative=15,
n_redundant=5,
weights=[0.98, 0.02], # 2% fraud rate
random_state=42,
)Dask reads each file as a partition, with one Dask worker per GPU. The number of data files should exceed the total GPU count.
# For ml.g5.12xlarge (4 GPUs) × 2 instances = 8 GPUs
# Create 16 partitions (2× GPU count)
num_partitions = num_gpus * 2Important: Dask distributed training only supports CSV and Parquet formats. LIBSVM and PROTOBUF will cause the training job to fail.
from sagemaker.estimator import Estimator
estimator = Estimator(
image_uri=xgboost_image_uri, # XGBoost 3.0-5
role=role,
instance_count=2,
instance_type="ml.g5.12xlarge",
hyperparameters={
"objective": "binary:logistic",
"num_round": 200,
"max_depth": 8,
"eta": 0.1,
"tree_method": "gpu_hist",
"scale_pos_weight": 49, # ratio of negatives to positives
"eval_metric": "auc",
"use_dask_gpu_training": "true",
},
)
# FullyReplicated - Dask handles data distribution internally
train_input = TrainingInput(s3_data=train_s3_uri, distribution="FullyReplicated")
estimator.fit({"train": train_input, "validation": val_input})Key hyperparameters for distributed GPU training:
tree_method: gpu_hist- enables GPU-accelerated histogram-based traininguse_dask_gpu_training: "true"- enables Dask multi-GPU coordinationscale_pos_weight: 49- compensates for 2% fraud rate (98/2 ≈ 49)
predictor = estimator.deploy(
initial_instance_count=1,
instance_type="ml.m5.large", # CPU is fine for inference
)predictor.delete_endpoint()| Instance | GPUs | GPU Memory | Best For |
|---|---|---|---|
| ml.g5.xlarge | 1 × A10G | 24 GB | Small datasets, testing |
| ml.g5.12xlarge | 4 × A10G | 96 GB | Medium datasets (recommended) |
| ml.g5.24xlarge | 4 × A10G | 96 GB | Large datasets, more CPU/RAM |
XGBoost 3.0-5 note: P3 instances are not supported. Use G4dn or G5 family.
- File count: Create more files than total GPUs (instance_count × GPUs per instance). Too few files underutilizes GPUs; too many degrades performance.
- File format: Use CSV or Parquet only. Parquet column names must be strings.
- Distribution: Set
distribution="FullyReplicated"or omit it. Do not useShardedByS3Key. - No pipe mode: Dask does not support pipe mode input.
- File sizes: Aim for roughly equal-sized partitions for balanced GPU utilization.
| Feature | 1.5-1 | 1.7-1 | 3.0-5 |
|---|---|---|---|
| Dask multi-GPU | ✅ | ✅ | ✅ |
| GPU instance support | P2, P3, G4dn, G5 | P3, G4dn, G5 | G4dn, G5 |
| SageMaker Debugger | ✅ | ✅ | ❌ |
| Configuration | Instance | Training Time (500K rows) | Approximate Cost |
|---|---|---|---|
| 1 × ml.g5.xlarge | 1 GPU | ~8 min | ~$0.14 |
| 1 × ml.g5.12xlarge | 4 GPUs | ~3 min | ~$0.28 |
| 2 × ml.g5.12xlarge | 8 GPUs | ~2 min | ~$0.37 |
GPU training is faster and often more cost-effective than CPU despite higher per-instance cost.