Complete guide to setting up automated health monitoring and backups for AutoMem on Railway.
AutoMem includes three layers of data protection:
- Persistent Volumes - Railway volumes for FalkorDB data
- Dual Storage - Data stored in both FalkorDB (graph) and Qdrant (vectors)
- Automated Backups - Scheduled exports to compressed JSON + optional S3 upload
flowchart TB
subgraph layer1 [Layer 1: Persistent Volumes]
RailwayVol[Railway Volume Snapshots<br/>Automatic every 24h<br/>One-click restore]
end
subgraph layer2 [Layer 2: Dual Storage]
FalkorDB[(FalkorDB<br/>Graph Database<br/>Canonical record)]
Qdrant[(Qdrant<br/>Vector Database<br/>Semantic search)]
FalkorDB <-->|Redundancy| Qdrant
end
subgraph layer3 [Layer 3: Automated Backups]
Script[Backup Scripts<br/>Compressed JSON]
Local[Local Backups<br/>./backups/]
S3[S3 Cloud Storage<br/>Cross-region]
GitHub[GitHub Actions<br/>Free tier]
Script --> Local
Script --> S3
GitHub --> Script
end
layer1 --> layer2
layer2 --> layer3
Recovery[Recovery Options]
RailwayVol -.->|Quick restore| Recovery
Qdrant -.->|Rebuild FalkorDB| Recovery
Local -.->|Full restore| Recovery
S3 -.->|Disaster recovery| Recovery
The health_monitor.py script continuously monitors system health and can automatically trigger recovery.
Option 1: Deploy as Railway Service (Recommended)
Create a new Railway service for continuous monitoring:
# In Railway dashboard
1. Create new service from GitHub repo
2. Set Dockerfile path: scripts/Dockerfile.health-monitor (we'll create this)
3. Configure environment variables (same as main service)
4. DeployOption 2: Run as Cron Job
# One-time health check (safe)
railway run --service automem python scripts/health_monitor.py --once
# Alert-only monitoring (no auto-recovery)
railway run --service automem python scripts/health_monitor.py --interval 300
# With Slack webhook alerts
railway run --service automem python scripts/health_monitor.py \
--interval 300 \
--webhook https://hooks.slack.com/services/YOUR/WEBHOOK/URLSet these environment variables on your monitoring service:
# Required (same as main service)
FALKORDB_HOST=falkordb.railway.internal
FALKORDB_PORT=6379
FALKORDB_PASSWORD=<your-password>
QDRANT_URL=<your-qdrant-url>
QDRANT_API_KEY=<your-qdrant-key>
AUTOMEM_API_URL=https://your-automem-deployment.up.railway.app
# Optional monitoring settings
HEALTH_MONITOR_DRIFT_THRESHOLD=5 # Warning at 5% drift
HEALTH_MONITOR_CRITICAL_THRESHOLD=50 # Critical at 50% drift
HEALTH_MONITOR_WEBHOOK=<slack-webhook> # Alert webhookEnable automatic recovery when data loss is detected:
python scripts/health_monitor.py \
--auto-recover \
--interval 300 \
--critical-threshold 50Already configured! If you're using Railway, your FalkorDB service has automatic volume backups enabled.
Features:
- ✅ Automatic snapshots (default: every 24 hours)
- ✅ One-click restore from Railway dashboard
- ✅ Included with Railway Pro (no extra cost)
- ✅ Instant volume snapshots
Access backups:
- Railway Dashboard →
falkordbservice - Click "Backups" tab
- View backup history and schedule
- Click "Restore" to recover from any snapshot
Limitations:
- Only backs up FalkorDB (not Qdrant)
- Platform-locked (can't export/download)
- Use for quick recovery; combine with script backups for full protection
For portable backups that cover both databases, use the backup_automem.py script:
The backup_automem.py script exports both FalkorDB and Qdrant to compressed JSON files:
# Basic backup to ./backups/
python scripts/backup_automem.py
# Backup with cleanup (keep last 7)
python scripts/backup_automem.py --cleanup --keep 7
# Custom backup directory
python scripts/backup_automem.py --backup-dir /mnt/backupsUpload backups to S3 for disaster recovery:
# Install AWS SDK
pip install boto3
# Configure AWS credentials (Railway secrets)
export AWS_ACCESS_KEY_ID=<your-key>
export AWS_SECRET_ACCESS_KEY=<your-secret>
export AWS_DEFAULT_REGION=us-east-1
# Backup with S3 upload
python scripts/backup_automem.py \
--s3-bucket my-automem-backups \
--cleanup --keep 7Recommended: GitHub Actions (Free)
GitHub Actions is the simplest way to automate backups - free and doesn't consume Railway resources.
Setup (5 minutes):
-
Workflow file already exists:
.github/workflows/backup.yml -
Enable TCP Proxy on FalkorDB (required for external access):
⚠️ Critical: GitHub Actions runners are external to Railway's network. They cannot use internal hostnames likefalkordb.railway.internal. You must enable TCP Proxy for external connectivity.- Railway Dashboard →
falkordbservice - Settings → Networking → Enable TCP Proxy
- Note the public endpoint:
monorail.proxy.rlwy.net:12345(example)
Loadingflowchart LR subgraph railway [Railway Network] FalkorDB[(FalkorDB<br/>:6379)] TCPProxy[TCP Proxy<br/>monorail.proxy.rlwy.net:12345] end subgraph external [External] GHA[GitHub Actions] Local[Local Dev] end GHA -->|Public Internet| TCPProxy Local -->|Public Internet| TCPProxy TCPProxy -->|Internal| FalkorDB - Railway Dashboard →
-
Add GitHub secrets:
-
Go to: GitHub repo → Settings → Secrets and variables → Actions
-
Add these secrets:
FALKORDB_HOST = monorail.proxy.rlwy.net # TCP Proxy domain (NOT .railway.internal!) FALKORDB_PORT = 12345 # TCP Proxy port (NOT 6379!) FALKORDB_PASSWORD = (from Railway FalkorDB service variables) QDRANT_URL = (your Qdrant Cloud URL) QDRANT_API_KEY = (your Qdrant Cloud API key)
-
Optional for S3:
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_DEFAULT_REGION
Where to find TCP Proxy details:
- Railway Dashboard →
falkordbservice → Settings → Networking - Look for "TCP Proxy" section → shows
RAILWAY_TCP_PROXY_DOMAINandRAILWAY_TCP_PROXY_PORT
-
-
Push and test:
git push origin main
- Go to Actions tab → "AutoMem Backup" → Run workflow
Runs every 6 hours automatically. Free tier: 2000 minutes/month.
This means GitHub Actions can't connect to FalkorDB. Common causes:
- TCP Proxy not enabled: Enable it in Railway Dashboard →
falkordb→ Settings → Networking - Wrong host/port in secrets: Must use TCP proxy endpoint, not internal hostname
- Firewall: Railway TCP Proxy should be accessible from anywhere, but check if your Railway plan has restrictions
Verify your setup:
# Test from your local machine (should work if TCP proxy is enabled)
redis-cli -h monorail.proxy.rlwy.net -p 12345 -a YOUR_PASSWORD ping
# Should return: PONG
# If redis-cli is not installed:
# macOS: brew install redis
# Ubuntu: sudo apt-get install redis-tools
# Windows: choco install redisDebug checklist:
- TCP Proxy is enabled on FalkorDB service
-
FALKORDB_HOSTsecret uses TCP proxy domain (e.g.,monorail.proxy.rlwy.net) -
FALKORDB_PORTsecret uses TCP proxy port (e.g.,12345), NOT6379 -
FALKORDB_PASSWORDmatches the password in FalkorDB service variables
Advanced: Railway Backup Service
For Railway Pro users who want backups running on Railway:
The scripts/Dockerfile.backup exists and runs backups every 6 hours in a loop. However, deploying it requires CLI:
cd /path/to/automem
railway link
railway up --service backup-serviceThen configure in Railway dashboard:
- Set Builder to Dockerfile
- Dockerfile Path:
scripts/Dockerfile.backup - Add environment variables (same as the AutoMem API service)
Cost: ~$1-2/month
Recommendation: Use GitHub Actions instead unless you have specific requirements for Railway-hosted backups.
flowchart TD
Start{What data<br/>is lost?}
Start -->|Only FalkorDB| QdrantCheck{Is Qdrant<br/>intact?}
Start -->|Only Qdrant| BackupCheck1{Have recent<br/>backups?}
Start -->|Both databases| BackupCheck2{Have recent<br/>backups?}
Start -->|None just testing| NoAction[No action needed]
QdrantCheck -->|Yes| RecoverQdrant[Use recover_from_qdrant.py<br/>⚡ Fastest 5-10 min]
QdrantCheck -->|No| BackupCheck3{Have backups?}
BackupCheck1 -->|Yes| RestoreQdrant[Restore Qdrant from backup<br/>Then rebuild FalkorDB<br/>⏱️ 15-30 min]
BackupCheck1 -->|No| DataLoss1[⚠️ Partial data loss<br/>Rebuild from remaining data]
BackupCheck2 -->|Yes Railway| RailwayRestore[Railway volume restore<br/>FalkorDB only<br/>⏱️ 5 min]
BackupCheck2 -->|Yes S3/Local| FullRestore[Full restore from backups<br/>1. Restore Qdrant<br/>2. Rebuild FalkorDB<br/>⏱️ 20-40 min]
BackupCheck2 -->|No| DataLoss2[⚠️ Complete data loss<br/>Start fresh]
BackupCheck3 -->|Yes| FullRestore
BackupCheck3 -->|No| DataLoss3[⚠️ Complete data loss<br/>Start fresh]
RecoverQdrant --> Verify[Verify data integrity<br/>Check memory count]
RestoreQdrant --> Verify
RailwayRestore --> Verify
FullRestore --> Verify
If FalkorDB data is lost but Qdrant is intact:
railway run --service automem python scripts/recover_from_qdrant.pyThis rebuilds the FalkorDB graph from Qdrant vectors and payloads.
If both FalkorDB and Qdrant are lost, restore from backup:
# Download from S3
aws s3 cp s3://my-automem-backups/qdrant/qdrant_20251005_143000.json.gz ./restore/
# Extract
gunzip restore/qdrant_20251005_143000.json.gz
# Restore to Qdrant
python scripts/restore_from_backup.py restore/qdrant_20251005_143000.json
# Then restore FalkorDB from Qdrant
python scripts/recover_from_qdrant.pyNote: We'll create restore_from_backup.py if you need it.
Check system health via API:
curl https://your-automem-deployment.up.railway.app/health | jqResponse:
{
"status": "healthy",
"falkordb": "connected",
"qdrant": "connected",
"graph": "memories",
"timestamp": "2025-10-05T14:45:00Z"
}Monitor your services:
- Metrics: CPU, memory, network usage
- Logs: Real-time log streaming
- Deployments: Build history and status
- Health Checks: Automated uptime monitoring
Set up external monitoring with:
-
UptimeRobot - Free HTTP monitoring
- Monitor:
https://your-automem-deployment.up.railway.app/health - Alert when status != "healthy"
- Monitor:
-
Better Uptime - Advanced monitoring
- HTTP checks + keyword monitoring
- SMS/Slack/Email alerts
-
Grafana Cloud - Full observability
- Custom dashboards
- Metrics aggregation
- Log correlation
- Health checks: Every 5 minutes (alert-only)
- Backups: Every 24 hours, keep 7 days
- Recovery: Manual trigger
- Health checks: Every 2 minutes (with auto-recovery)
- Backups: Every 6 hours, keep 14 days + S3
- Recovery: Automatic on critical drift
- Health checks: Every 30 seconds (with auto-recovery)
- Backups: Every 1 hour, keep 30 days + S3 + cross-region replication
- Recovery: Automatic with alerts
# Get webhook URL from Slack App settings
# https://api.slack.com/messaging/webhooks
python scripts/health_monitor.py \
--webhook https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXX# Discord webhooks work the same as Slack
python scripts/health_monitor.py \
--webhook https://discord.com/api/webhooks/123456789/abcdefgThe health monitor sends JSON payloads:
{
"level": "critical",
"title": "Data Loss Detected",
"message": "FalkorDB has 52.3% drift from Qdrant",
"details": {
"drift_percent": 52.3,
"falkordb_count": 420,
"qdrant_count": 884
},
"timestamp": "2025-10-05T14:45:00Z",
"system": "AutoMem Health Monitor"
}- ✅ Main API service
- ✅ FalkorDB service with 1GB volume
- ❌ Not enough resources for monitoring service
- ✅ Main API service (~$5)
- ✅ FalkorDB service (~$10)
- ✅ Health monitoring service (~$2)
- ✅ Backup service (~$1)
- Total: ~$18/month
- Railway Pro for main services (~$15)
- GitHub Actions for backups (free)
- UptimeRobot for monitoring (free)
- Total: ~$15/month
- Storage: ~$0.023/GB/month (Standard)
- Requests: ~$0.005/1000 PUTs
- Example: 100MB backup every 6 hours = ~$0.30/month
Problem: FalkorDB and Qdrant counts don't match
Causes:
- In-flight writes during check (normal, <1% drift)
- Failed writes to one store (>5% drift - warning)
- Data loss event (>50% drift - critical)
Solution:
# Check health details
python scripts/health_monitor.py --once
# If critical, run recovery
python scripts/recover_from_qdrant.pyProblem: Backup script fails with connection error
Solution:
# Test connections
curl https://your-automem-deployment.up.railway.app/health
# Check credentials
echo $FALKORDB_PASSWORD
echo $QDRANT_API_KEY
# Try manual backup
python scripts/backup_automem.pyProblem: Backup created but S3 upload failed
Solution:
# Check AWS credentials
aws s3 ls s3://my-automem-backups/
# Test upload manually
aws s3 cp backups/falkordb/latest.json.gz s3://my-automem-backups/test/
# Check boto3 installation
python -c "import boto3; print(boto3.__version__)"- Set up health monitoring service on Railway
- Configure Slack/Discord webhook alerts
- Schedule automated backups (every 6 hours)
- Test recovery process in staging environment
- Set up S3 bucket with versioning enabled
- Configure cross-region replication (optional)
Questions? Check the main Railway deployment guide: RAILWAY_DEPLOYMENT.md