Purpose: Clone production MongoDB data to staging with PII anonymization
Duration: 2-4 minutes
Staging Downtime: ~30-45 seconds
Production Impact: None (read-only operation)
Frequency: On-demand or weekly schedule
1. Open GitHub repository
2. Click "Actions" tab
3. Select "Production to Staging DB Sync" workflow
4. Click "Run workflow" button (top right)
Branch: main (or desired branch)
Anonymize PII data: ✓ Yes (checked) ← IMPORTANT for production data
Target environment: staging
Click "Run workflow"
Expected duration: 7-10 minutes
Watch for 9 jobs to complete in sequence:
1. ✅ Setup and Validate Environment (~30 sec)
2. ✅ Stop MongoDB on Staging (~5 sec)
3. ✅ Create Snapshot from Production (~2-3 min)
4. ✅ Create and Swap EBS Volumes (~1-2 min)
5. ✅ Mount Volume on Staging (~10 sec)
6. ✅ Start MongoDB and Verify Data (~20 sec)
7. ✅ Anonymize PII Data (~30 sec)
8. ✅ Cleanup Resources (~10 sec)
9. ✅ Final Summary (~10 sec)
All jobs should show green checkmarks ✅
The workflow automatically verifies:
- ✅ Volume mounted successfully
- ✅ MongoDB started and responsive
- ✅ Documents counted (matches production)
- ✅ Anonymization completed
- ✅ Old resources cleaned up
# View updated README.md in repository
cat README.md | grep -A 5 "Last Restore Status"
Expected:
| Environment | Last Restored | Status | Documents | Anonymized | Duration |
| Staging | 2026-03-03 XX:XX | ✅ Success | 10 | ✅ Yes | 8m 34s |# SSH to staging
ssh ec2-user@<staging-ip>
# Check MongoDB is running
sudo systemctl status mongod
# Expected: active (running)
# Count documents
mongosh --quiet --eval "db.getSiblingDB('userdb').users.countDocuments()"
# Expected: Same count as production
# Verify anonymization
mongosh --quiet --eval "
const db = db.getSiblingDB('userdb');
const sample = db.users.findOne();
print('Name: ' + sample.name); // Should be: User <ID>
print('Email: ' + sample.email); // Should be: user<ID>@anonymized.local
print('SSN: ' + sample.ssn); // Should be: XXX-XX-XXXX
"
# Expected: All PII fields anonymized
# Check volume mount
df -h | grep mongodb
# Expected: /dev/nvme1n1 mounted on /data/mongodb with data| Step | Job Name | Expected Time | Notes |
|---|---|---|---|
| 1 | Setup & Validation | ~30 seconds | Fast: Only API calls |
| 2 | Stop MongoDB | ~5 seconds | Direct systemctl command |
| 3 | Create Snapshot | 2-3 minutes | Longest step - depends on data size |
| 4 | Swap Volumes | 1-2 minutes | Volume creation + attach/detach |
| 5 | Mount Volume | ~10 seconds | Fast: Local filesystem operation |
| 6 | Start & Verify | ~20 seconds | MongoDB startup + ping loop |
| 7 | Anonymize | ~30 seconds | Depends on document count |
| 8 | Cleanup | ~10 seconds | Async deletes |
| 9 | Summary | ~10 seconds | Git commit + push |
| TOTAL | All Jobs | 2-4 minutes | End-to-end |
Staging Downtime: Step 2 (stop) to Step 6 (start) = ~30-45 seconds
Impact: None - production unchanged, staging unchanged
Action: None required - simply retry workflow
Impact: Staging down, production unaffected
Symptoms: Job 4 shows ❌
Recovery:
# Get old staging volume ID from workflow logs
OLD_VOLUME_ID=$(gh run view --log | grep "Old Volume" | awk '{print $4}')
# Reattach old volume
aws ec2 attach-volume \
--volume-id $OLD_VOLUME_ID \
--instance-id i-05661b198eb8d9b0a \
--device /dev/sdf
# SSH to staging
ssh ec2-user@<staging-ip>
# Mount and start
sudo mount /data/mongodb
sudo systemctl start mongod
# Verify
mongosh --eval "db.adminCommand('ping')"Impact: Staging has PRODUCTION data (non-anonymized)
Symptoms: Job 7 shows ❌
Action: CRITICAL - Block staging access immediately
# Block external access (if not already blocked)
aws ec2 revoke-security-group-ingress \
--group-id sg-staging \
--protocol tcp \
--port 27017 \
--cidr 0.0.0.0/0
# SSH to staging and manually run anonymization
ssh ec2-user@<staging-ip>
mongosh /home/ec2-user/anonymize_data.js
# Verify anonymization worked
mongosh --eval "db.getSiblingDB('userdb').users.findOne()"
# Check: No real PII data visible
# If anonymization still fails, rollback to old volume (Scenario 2)Impact: Staging down, volume attached but not mounted
Symptoms: Job 5 shows ❌
Recovery:
# SSH to staging
ssh ec2-user@<staging-ip>
# Check device exists
lsblk
# Look for nvme1n1 or similar
# Manually run mount script
sudo bash /home/ec2-user/mount.sh
# If mount script fails, check filesystem
sudo file -s /dev/nvme1n1
# Expected: XFS filesystem
# Try manual mount
sudo mount -t xfs /dev/nvme1n1 /data/mongodb
# If still fails, check dmesg for errors
dmesg | tail -20
# Last resort: Format and restore from snapshot
# (Contact team lead before doing this)When: Any major failure, data corruption, or extended issues
Impact: Staging reverted to pre-sync state
# 1. Get old volume ID from AWS console or previous workflow run
# Tag: Environment=staging, look for detached volumes
OLD_VOLUME=vol-XXXXXXXXX # From AWS console
NEW_VOLUME=vol-YYYYYYYYY # Currently attached (failed)
# 2. Stop MongoDB on staging
ssh ec2-user@<staging-ip> "sudo systemctl stop mongod; sudo umount /data/mongodb"
# 3. Detach new (failed) volume
aws ec2 detach-volume --volume-id $NEW_VOLUME
sleep 15 # Wait for detachment
# 4. Reattach old volume
aws ec2 attach-volume \
--volume-id $OLD_VOLUME \
--instance-id i-05661b198eb8d9b0a \
--device /dev/sdf
sleep 10 # Wait for attachment
# 5. Mount and start
ssh ec2-user@<staging-ip> "sudo mount /data/mongodb && sudo systemctl start mongod"
# 6. Verify
ssh ec2-user@<staging-ip> "mongosh --eval 'db.adminCommand(\"ping\")'"
# 7. Clean up failed volume
aws ec2 delete-volume --volume-id $NEW_VOLUMESymptoms: Workflow shows "Waiting for a runner..."
Diagnosis:
# SSH to staging instance
ssh ec2-user@<staging-ip>
# Check runner service
sudo systemctl status ghaFix:
# Restart runner service
sudo systemctl restart gha
# Or re-register runner (if needed)
cd /home/ec2-user/actions-runner
sudo ./svc.sh stop
sudo ./svc.sh uninstall
sudo ./config.sh remove --token <TOKEN>
# Re-register using GitHub UI instructionsSymptoms: Job 3 runs for >10 minutes
Diagnosis:
# Check snapshot status
SNAPSHOT_ID=$(gh run view --log | grep "Snapshot created" | awk '{print $4}')
aws ec2 describe-snapshots --snapshot-ids $SNAPSHOT_IDFix:
- If snapshot shows "error" state: Cancel workflow, retry
- If snapshot shows "pending" for >15 min: AWS issue, contact support
- Normal time: 2-3 minutes for 20GB
Symptoms: Job 6 fails, MongoDB status shows "failed"
Diagnosis:
ssh ec2-user@<staging-ip>
# Check MongoDB logs
sudo journalctl -u mongod -n 100
sudo cat /var/log/mongodb/mongod.log | tail -50
# Check permissions
ls -la /data/mongodb
# Expected: owned by mongod:mongodFix:
# Fix ownership if needed
sudo chown -R mongod:mongod /data/mongodb
# Fix MongoDB config if corrupted
sudo cp /etc/mongod.conf.backup /etc/mongod.conf
# Restart
sudo systemctl start mongodSymptoms: Staging shows different count than production
Diagnosis:
# Check production count
ssh ec2-user@<production-ip>
mongosh --eval "db.getSiblingDB('userdb').users.countDocuments()"
# Check staging count
ssh ec2-user@<staging-ip>
mongosh --eval "db.getSiblingDB('userdb').users.countDocuments()"Possible Causes:
- Staging has old data (sync failed silently)
- Production was actively writing during snapshot (expected - snapshot is point-in-time)
- Anonymization deleted documents (shouldn't happen)
Fix: Re-run sync workflow
✅ All 9 workflow jobs show green checkmarks
✅ README updated with new restore timestamp
✅ Staging MongoDB is running and responsive
✅ Document count matches production
✅ PII fields are anonymized (if enabled)
✅ Total duration under 15 minutes
✅ No AWS errors or rate limits hit
✅ Old snapshot and volume deleted