MongoDB Production → Staging Sync - Runbook

Quick Reference

Purpose: Clone production MongoDB data to staging with PII anonymization
Duration: 2-4 minutes
Staging Downtime: ~30-45 seconds
Production Impact: None (read-only operation)
Frequency: On-demand or weekly schedule

Execution Steps

Method 1: GitHub Actions UI (Recommended)

Step 1: Navigate to Workflow

1. Open GitHub repository
2. Click "Actions" tab
3. Select "Production to Staging DB Sync" workflow
4. Click "Run workflow" button (top right)

Step 2: Configure Options

Branch: main (or desired branch)
Anonymize PII data: ✓ Yes (checked) ← IMPORTANT for production data
Target environment: staging

Click "Run workflow"

Step 3: Monitor Execution

Expected duration: 7-10 minutes

Watch for 9 jobs to complete in sequence:
1. ✅ Setup and Validate Environment     (~30 sec)
2. ✅ Stop MongoDB on Staging            (~5 sec)
3. ✅ Create Snapshot from Production    (~2-3 min)
4. ✅ Create and Swap EBS Volumes        (~1-2 min)
5. ✅ Mount Volume on Staging            (~10 sec)
6. ✅ Start MongoDB and Verify Data      (~20 sec)
7. ✅ Anonymize PII Data                 (~30 sec)
8. ✅ Cleanup Resources                  (~10 sec)
9. ✅ Final Summary                      (~10 sec)

All jobs should show green checkmarks ✅

Post-Execution Verification

Automatic Checks (Built into Workflow)

The workflow automatically verifies:

✅ Volume mounted successfully
✅ MongoDB started and responsive
✅ Documents counted (matches production)
✅ Anonymization completed
✅ Old resources cleaned up

Manual Verification (Optional)

1. Check README Status

# View updated README.md in repository
cat README.md | grep -A 5 "Last Restore Status"

Expected:
| Environment | Last Restored     | Status    | Documents | Anonymized | Duration |
| Staging     | 2026-03-03 XX:XX  | ✅ Success | 10        | ✅ Yes      | 8m 34s   |

2. Verify Staging Database

# SSH to staging
ssh ec2-user@<staging-ip>

# Check MongoDB is running
sudo systemctl status mongod
# Expected: active (running)

# Count documents
mongosh --quiet --eval "db.getSiblingDB('userdb').users.countDocuments()"
# Expected: Same count as production

# Verify anonymization
mongosh --quiet --eval "
  const db = db.getSiblingDB('userdb');
  const sample = db.users.findOne();
  print('Name: ' + sample.name);       // Should be: User <ID>
  print('Email: ' + sample.email);     // Should be: user<ID>@anonymized.local
  print('SSN: ' + sample.ssn);         // Should be: XXX-XX-XXXX
"
# Expected: All PII fields anonymized

# Check volume mount
df -h | grep mongodb
# Expected: /dev/nvme1n1 mounted on /data/mongodb with data

Expected Duration for Each Step

Step	Job Name	Expected Time	Notes
1	Setup & Validation	~30 seconds	Fast: Only API calls
2	Stop MongoDB	~5 seconds	Direct systemctl command
3	Create Snapshot	2-3 minutes	Longest step - depends on data size
4	Swap Volumes	1-2 minutes	Volume creation + attach/detach
5	Mount Volume	~10 seconds	Fast: Local filesystem operation
6	Start & Verify	~20 seconds	MongoDB startup + ping loop
7	Anonymize	~30 seconds	Depends on document count
8	Cleanup	~10 seconds	Async deletes
9	Summary	~10 seconds	Git commit + push
TOTAL	All Jobs	2-4 minutes	End-to-end

Staging Downtime: Step 2 (stop) to Step 6 (start) = ~30-45 seconds

Rollback Procedures

Scenario 1: Workflow Fails During Snapshot Creation

Impact: None - production unchanged, staging unchanged
Action: None required - simply retry workflow

Scenario 2: Workflow Fails During Volume Swap

Impact: Staging down, production unaffected
Symptoms: Job 4 shows ❌

Recovery:

# Get old staging volume ID from workflow logs
OLD_VOLUME_ID=$(gh run view --log | grep "Old Volume" | awk '{print $4}')

# Reattach old volume
aws ec2 attach-volume \
  --volume-id $OLD_VOLUME_ID \
  --instance-id i-05661b198eb8d9b0a \
  --device /dev/sdf

# SSH to staging
ssh ec2-user@<staging-ip>

# Mount and start
sudo mount /data/mongodb
sudo systemctl start mongod

# Verify
mongosh --eval "db.adminCommand('ping')"

Scenario 3: Anonymization Fails

Impact: Staging has PRODUCTION data (non-anonymized)
Symptoms: Job 7 shows ❌
Action: CRITICAL - Block staging access immediately

# Block external access (if not already blocked)
aws ec2 revoke-security-group-ingress \
  --group-id sg-staging \
  --protocol tcp \
  --port 27017 \
  --cidr 0.0.0.0/0

# SSH to staging and manually run anonymization
ssh ec2-user@<staging-ip>
mongosh /home/ec2-user/anonymize_data.js

# Verify anonymization worked
mongosh --eval "db.getSiblingDB('userdb').users.findOne()"
# Check: No real PII data visible

# If anonymization still fails, rollback to old volume (Scenario 2)

Scenario 4: Mount Fails After Volume Swap

Impact: Staging down, volume attached but not mounted
Symptoms: Job 5 shows ❌

Recovery:

# SSH to staging
ssh ec2-user@<staging-ip>

# Check device exists
lsblk
# Look for nvme1n1 or similar

# Manually run mount script
sudo bash /home/ec2-user/mount.sh

# If mount script fails, check filesystem
sudo file -s /dev/nvme1n1
# Expected: XFS filesystem

# Try manual mount
sudo mount -t xfs /dev/nvme1n1 /data/mongodb

# If still fails, check dmesg for errors
dmesg | tail -20

# Last resort: Format and restore from snapshot
# (Contact team lead before doing this)

Scenario 5: Complete Rollback (Return to Previous State)

When: Any major failure, data corruption, or extended issues
Impact: Staging reverted to pre-sync state

# 1. Get old volume ID from AWS console or previous workflow run
#    Tag: Environment=staging, look for detached volumes

OLD_VOLUME=vol-XXXXXXXXX  # From AWS console
NEW_VOLUME=vol-YYYYYYYYY  # Currently attached (failed)

# 2. Stop MongoDB on staging
ssh ec2-user@<staging-ip> "sudo systemctl stop mongod; sudo umount /data/mongodb"

# 3. Detach new (failed) volume
aws ec2 detach-volume --volume-id $NEW_VOLUME
sleep 15  # Wait for detachment

# 4. Reattach old volume
aws ec2 attach-volume \
  --volume-id $OLD_VOLUME \
  --instance-id i-05661b198eb8d9b0a \
  --device /dev/sdf
sleep 10  # Wait for attachment

# 5. Mount and start
ssh ec2-user@<staging-ip> "sudo mount /data/mongodb && sudo systemctl start mongod"

# 6. Verify
ssh ec2-user@<staging-ip> "mongosh --eval 'db.adminCommand(\"ping\")'"

# 7. Clean up failed volume
aws ec2 delete-volume --volume-id $NEW_VOLUME

Troubleshooting Guide

Issue: Self-Hosted Runner Offline

Symptoms: Workflow shows "Waiting for a runner..."

Diagnosis:

# SSH to staging instance
ssh ec2-user@<staging-ip>

# Check runner service
sudo systemctl status gha

Fix:

# Restart runner service
sudo systemctl restart gha

# Or re-register runner (if needed)
cd /home/ec2-user/actions-runner
sudo ./svc.sh stop
sudo ./svc.sh uninstall
sudo ./config.sh remove --token <TOKEN>
# Re-register using GitHub UI instructions

Issue: Workflow Stuck on "Create Snapshot"

Symptoms: Job 3 runs for >10 minutes

Diagnosis:

# Check snapshot status
SNAPSHOT_ID=$(gh run view --log | grep "Snapshot created" | awk '{print $4}')
aws ec2 describe-snapshots --snapshot-ids $SNAPSHOT_ID

Fix:

If snapshot shows "error" state: Cancel workflow, retry
If snapshot shows "pending" for >15 min: AWS issue, contact support
Normal time: 2-3 minutes for 20GB

Issue: MongoDB Won't Start After Restore

Symptoms: Job 6 fails, MongoDB status shows "failed"

Diagnosis:

ssh ec2-user@<staging-ip>

# Check MongoDB logs
sudo journalctl -u mongod -n 100
sudo cat /var/log/mongodb/mongod.log | tail -50

# Check permissions
ls -la /data/mongodb
# Expected: owned by mongod:mongod

Fix:

# Fix ownership if needed
sudo chown -R mongod:mongod /data/mongodb

# Fix MongoDB config if corrupted
sudo cp /etc/mongod.conf.backup /etc/mongod.conf

# Restart
sudo systemctl start mongod

Issue: Document Count Mismatch

Symptoms: Staging shows different count than production

Diagnosis:

# Check production count
ssh ec2-user@<production-ip>
mongosh --eval "db.getSiblingDB('userdb').users.countDocuments()"

# Check staging count
ssh ec2-user@<staging-ip>
mongosh --eval "db.getSiblingDB('userdb').users.countDocuments()"

Possible Causes:

Staging has old data (sync failed silently)
Production was actively writing during snapshot (expected - snapshot is point-in-time)
Anonymization deleted documents (shouldn't happen)

Fix: Re-run sync workflow

Criteria for Success

✅ All 9 workflow jobs show green checkmarks
✅ README updated with new restore timestamp
✅ Staging MongoDB is running and responsive
✅ Document count matches production
✅ PII fields are anonymized (if enabled)
✅ Total duration under 15 minutes
✅ No AWS errors or rate limits hit
✅ Old snapshot and volume deleted

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MongoDB Production → Staging Sync - Runbook

Quick Reference

Execution Steps

Method 1: GitHub Actions UI (Recommended)

Step 1: Navigate to Workflow

Step 2: Configure Options

Step 3: Monitor Execution

Post-Execution Verification

Automatic Checks (Built into Workflow)

Manual Verification (Optional)

1. Check README Status

2. Verify Staging Database

Expected Duration for Each Step

Rollback Procedures

Scenario 1: Workflow Fails During Snapshot Creation

Scenario 2: Workflow Fails During Volume Swap

Scenario 3: Anonymization Fails

Scenario 4: Mount Fails After Volume Swap

Scenario 5: Complete Rollback (Return to Previous State)

Troubleshooting Guide

Issue: Self-Hosted Runner Offline

Issue: Workflow Stuck on "Create Snapshot"

Issue: MongoDB Won't Start After Restore

Issue: Document Count Mismatch

Criteria for Success

FilesExpand file tree

RUNBOOK.md

Latest commit

History

RUNBOOK.md

File metadata and controls

MongoDB Production → Staging Sync - Runbook

Quick Reference

Execution Steps

Method 1: GitHub Actions UI (Recommended)

Step 1: Navigate to Workflow

Step 2: Configure Options

Step 3: Monitor Execution

Post-Execution Verification

Automatic Checks (Built into Workflow)

Manual Verification (Optional)

1. Check README Status

2. Verify Staging Database

Expected Duration for Each Step

Rollback Procedures

Scenario 1: Workflow Fails During Snapshot Creation

Scenario 2: Workflow Fails During Volume Swap

Scenario 3: Anonymization Fails

Scenario 4: Mount Fails After Volume Swap

Scenario 5: Complete Rollback (Return to Previous State)

Troubleshooting Guide

Issue: Self-Hosted Runner Offline

Issue: Workflow Stuck on "Create Snapshot"

Issue: MongoDB Won't Start After Restore

Issue: Document Count Mismatch

Criteria for Success