Skip to content

Latest commit

 

History

History
353 lines (269 loc) · 9.13 KB

File metadata and controls

353 lines (269 loc) · 9.13 KB

MongoDB Production → Staging Sync - Runbook

Quick Reference

Purpose: Clone production MongoDB data to staging with PII anonymization
Duration: 2-4 minutes
Staging Downtime: ~30-45 seconds
Production Impact: None (read-only operation)
Frequency: On-demand or weekly schedule


Execution Steps

Method 1: GitHub Actions UI (Recommended)

Step 1: Navigate to Workflow

1. Open GitHub repository
2. Click "Actions" tab
3. Select "Production to Staging DB Sync" workflow
4. Click "Run workflow" button (top right)

Step 2: Configure Options

Branch: main (or desired branch)
Anonymize PII data: ✓ Yes (checked) ← IMPORTANT for production data
Target environment: staging

Click "Run workflow"

Step 3: Monitor Execution

Expected duration: 7-10 minutes

Watch for 9 jobs to complete in sequence:
1. ✅ Setup and Validate Environment     (~30 sec)
2. ✅ Stop MongoDB on Staging            (~5 sec)
3. ✅ Create Snapshot from Production    (~2-3 min)
4. ✅ Create and Swap EBS Volumes        (~1-2 min)
5. ✅ Mount Volume on Staging            (~10 sec)
6. ✅ Start MongoDB and Verify Data      (~20 sec)
7. ✅ Anonymize PII Data                 (~30 sec)
8. ✅ Cleanup Resources                  (~10 sec)
9. ✅ Final Summary                      (~10 sec)

All jobs should show green checkmarks ✅

Post-Execution Verification

Automatic Checks (Built into Workflow)

The workflow automatically verifies:

  • ✅ Volume mounted successfully
  • ✅ MongoDB started and responsive
  • ✅ Documents counted (matches production)
  • ✅ Anonymization completed
  • ✅ Old resources cleaned up

Manual Verification (Optional)

1. Check README Status

# View updated README.md in repository
cat README.md | grep -A 5 "Last Restore Status"

Expected:
| Environment | Last Restored     | Status    | Documents | Anonymized | Duration |
| Staging     | 2026-03-03 XX:XX  | ✅ Success | 10        | ✅ Yes      | 8m 34s   |

2. Verify Staging Database

# SSH to staging
ssh ec2-user@<staging-ip>

# Check MongoDB is running
sudo systemctl status mongod
# Expected: active (running)

# Count documents
mongosh --quiet --eval "db.getSiblingDB('userdb').users.countDocuments()"
# Expected: Same count as production

# Verify anonymization
mongosh --quiet --eval "
  const db = db.getSiblingDB('userdb');
  const sample = db.users.findOne();
  print('Name: ' + sample.name);       // Should be: User <ID>
  print('Email: ' + sample.email);     // Should be: user<ID>@anonymized.local
  print('SSN: ' + sample.ssn);         // Should be: XXX-XX-XXXX
"
# Expected: All PII fields anonymized

# Check volume mount
df -h | grep mongodb
# Expected: /dev/nvme1n1 mounted on /data/mongodb with data

Expected Duration for Each Step

Step Job Name Expected Time Notes
1 Setup & Validation ~30 seconds Fast: Only API calls
2 Stop MongoDB ~5 seconds Direct systemctl command
3 Create Snapshot 2-3 minutes Longest step - depends on data size
4 Swap Volumes 1-2 minutes Volume creation + attach/detach
5 Mount Volume ~10 seconds Fast: Local filesystem operation
6 Start & Verify ~20 seconds MongoDB startup + ping loop
7 Anonymize ~30 seconds Depends on document count
8 Cleanup ~10 seconds Async deletes
9 Summary ~10 seconds Git commit + push
TOTAL All Jobs 2-4 minutes End-to-end

Staging Downtime: Step 2 (stop) to Step 6 (start) = ~30-45 seconds


Rollback Procedures

Scenario 1: Workflow Fails During Snapshot Creation

Impact: None - production unchanged, staging unchanged
Action: None required - simply retry workflow

Scenario 2: Workflow Fails During Volume Swap

Impact: Staging down, production unaffected
Symptoms: Job 4 shows ❌

Recovery:

# Get old staging volume ID from workflow logs
OLD_VOLUME_ID=$(gh run view --log | grep "Old Volume" | awk '{print $4}')

# Reattach old volume
aws ec2 attach-volume \
  --volume-id $OLD_VOLUME_ID \
  --instance-id i-05661b198eb8d9b0a \
  --device /dev/sdf

# SSH to staging
ssh ec2-user@<staging-ip>

# Mount and start
sudo mount /data/mongodb
sudo systemctl start mongod

# Verify
mongosh --eval "db.adminCommand('ping')"

Scenario 3: Anonymization Fails

Impact: Staging has PRODUCTION data (non-anonymized)
Symptoms: Job 7 shows ❌
Action: CRITICAL - Block staging access immediately

# Block external access (if not already blocked)
aws ec2 revoke-security-group-ingress \
  --group-id sg-staging \
  --protocol tcp \
  --port 27017 \
  --cidr 0.0.0.0/0

# SSH to staging and manually run anonymization
ssh ec2-user@<staging-ip>
mongosh /home/ec2-user/anonymize_data.js

# Verify anonymization worked
mongosh --eval "db.getSiblingDB('userdb').users.findOne()"
# Check: No real PII data visible

# If anonymization still fails, rollback to old volume (Scenario 2)

Scenario 4: Mount Fails After Volume Swap

Impact: Staging down, volume attached but not mounted
Symptoms: Job 5 shows ❌

Recovery:

# SSH to staging
ssh ec2-user@<staging-ip>

# Check device exists
lsblk
# Look for nvme1n1 or similar

# Manually run mount script
sudo bash /home/ec2-user/mount.sh

# If mount script fails, check filesystem
sudo file -s /dev/nvme1n1
# Expected: XFS filesystem

# Try manual mount
sudo mount -t xfs /dev/nvme1n1 /data/mongodb

# If still fails, check dmesg for errors
dmesg | tail -20

# Last resort: Format and restore from snapshot
# (Contact team lead before doing this)

Scenario 5: Complete Rollback (Return to Previous State)

When: Any major failure, data corruption, or extended issues
Impact: Staging reverted to pre-sync state

# 1. Get old volume ID from AWS console or previous workflow run
#    Tag: Environment=staging, look for detached volumes

OLD_VOLUME=vol-XXXXXXXXX  # From AWS console
NEW_VOLUME=vol-YYYYYYYYY  # Currently attached (failed)

# 2. Stop MongoDB on staging
ssh ec2-user@<staging-ip> "sudo systemctl stop mongod; sudo umount /data/mongodb"

# 3. Detach new (failed) volume
aws ec2 detach-volume --volume-id $NEW_VOLUME
sleep 15  # Wait for detachment

# 4. Reattach old volume
aws ec2 attach-volume \
  --volume-id $OLD_VOLUME \
  --instance-id i-05661b198eb8d9b0a \
  --device /dev/sdf
sleep 10  # Wait for attachment

# 5. Mount and start
ssh ec2-user@<staging-ip> "sudo mount /data/mongodb && sudo systemctl start mongod"

# 6. Verify
ssh ec2-user@<staging-ip> "mongosh --eval 'db.adminCommand(\"ping\")'"

# 7. Clean up failed volume
aws ec2 delete-volume --volume-id $NEW_VOLUME

Troubleshooting Guide

Issue: Self-Hosted Runner Offline

Symptoms: Workflow shows "Waiting for a runner..."

Diagnosis:

# SSH to staging instance
ssh ec2-user@<staging-ip>

# Check runner service
sudo systemctl status gha

Fix:

# Restart runner service
sudo systemctl restart gha

# Or re-register runner (if needed)
cd /home/ec2-user/actions-runner
sudo ./svc.sh stop
sudo ./svc.sh uninstall
sudo ./config.sh remove --token <TOKEN>
# Re-register using GitHub UI instructions

Issue: Workflow Stuck on "Create Snapshot"

Symptoms: Job 3 runs for >10 minutes

Diagnosis:

# Check snapshot status
SNAPSHOT_ID=$(gh run view --log | grep "Snapshot created" | awk '{print $4}')
aws ec2 describe-snapshots --snapshot-ids $SNAPSHOT_ID

Fix:

  • If snapshot shows "error" state: Cancel workflow, retry
  • If snapshot shows "pending" for >15 min: AWS issue, contact support
  • Normal time: 2-3 minutes for 20GB

Issue: MongoDB Won't Start After Restore

Symptoms: Job 6 fails, MongoDB status shows "failed"

Diagnosis:

ssh ec2-user@<staging-ip>

# Check MongoDB logs
sudo journalctl -u mongod -n 100
sudo cat /var/log/mongodb/mongod.log | tail -50

# Check permissions
ls -la /data/mongodb
# Expected: owned by mongod:mongod

Fix:

# Fix ownership if needed
sudo chown -R mongod:mongod /data/mongodb

# Fix MongoDB config if corrupted
sudo cp /etc/mongod.conf.backup /etc/mongod.conf

# Restart
sudo systemctl start mongod

Issue: Document Count Mismatch

Symptoms: Staging shows different count than production

Diagnosis:

# Check production count
ssh ec2-user@<production-ip>
mongosh --eval "db.getSiblingDB('userdb').users.countDocuments()"

# Check staging count
ssh ec2-user@<staging-ip>
mongosh --eval "db.getSiblingDB('userdb').users.countDocuments()"

Possible Causes:

  • Staging has old data (sync failed silently)
  • Production was actively writing during snapshot (expected - snapshot is point-in-time)
  • Anonymization deleted documents (shouldn't happen)

Fix: Re-run sync workflow


Criteria for Success

All 9 workflow jobs show green checkmarks
README updated with new restore timestamp
Staging MongoDB is running and responsive
Document count matches production
PII fields are anonymized (if enabled)
Total duration under 15 minutes
No AWS errors or rate limits hit
Old snapshot and volume deleted