AutoMem Monitoring & Backups

Complete guide to setting up automated health monitoring and backups for AutoMem on Railway.

Overview

AutoMem includes three layers of data protection:

Persistent Volumes - Railway volumes for FalkorDB data
Dual Storage - Data stored in both FalkorDB (graph) and Qdrant (vectors)
Automated Backups - Scheduled exports to compressed JSON + optional S3 upload

flowchart TB
    subgraph layer1 [Layer 1: Persistent Volumes]
        RailwayVol[Railway Volume Snapshots<br/>Automatic every 24h<br/>One-click restore]
    end

    subgraph layer2 [Layer 2: Dual Storage]
        FalkorDB[(FalkorDB<br/>Graph Database<br/>Canonical record)]
        Qdrant[(Qdrant<br/>Vector Database<br/>Semantic search)]

        FalkorDB <-->|Redundancy| Qdrant
    end

    subgraph layer3 [Layer 3: Automated Backups]
        Script[Backup Scripts<br/>Compressed JSON]
        Local[Local Backups<br/>./backups/]
        S3[S3 Cloud Storage<br/>Cross-region]
        GitHub[GitHub Actions<br/>Free tier]

        Script --> Local
        Script --> S3
        GitHub --> Script
    end

    layer1 --> layer2
    layer2 --> layer3

    Recovery[Recovery Options]

    RailwayVol -.->|Quick restore| Recovery
    Qdrant -.->|Rebuild FalkorDB| Recovery
    Local -.->|Full restore| Recovery
    S3 -.->|Disaster recovery| Recovery

Health Monitoring

The health_monitor.py script continuously monitors system health and can automatically trigger recovery.

Quick Start

Option 1: Deploy as Railway Service (Recommended)

Create a new Railway service for continuous monitoring:

# In Railway dashboard
1. Create new service from GitHub repo
2. Set Dockerfile path: scripts/Dockerfile.health-monitor (we'll create this)
3. Configure environment variables (same as main service)
4. Deploy

Option 2: Run as Cron Job

# One-time health check (safe)
railway run --service automem python scripts/health_monitor.py --once

# Alert-only monitoring (no auto-recovery)
railway run --service automem python scripts/health_monitor.py --interval 300

# With Slack webhook alerts
railway run --service automem python scripts/health_monitor.py \
  --interval 300 \
  --webhook https://hooks.slack.com/services/YOUR/WEBHOOK/URL

Configuration

Set these environment variables on your monitoring service:

# Required (same as main service)
FALKORDB_HOST=falkordb.railway.internal
FALKORDB_PORT=6379
FALKORDB_PASSWORD=<your-password>
QDRANT_URL=<your-qdrant-url>
QDRANT_API_KEY=<your-qdrant-key>
AUTOMEM_API_URL=https://your-automem-deployment.up.railway.app

# Optional monitoring settings
HEALTH_MONITOR_DRIFT_THRESHOLD=5          # Warning at 5% drift
HEALTH_MONITOR_CRITICAL_THRESHOLD=50      # Critical at 50% drift
HEALTH_MONITOR_WEBHOOK=<slack-webhook>    # Alert webhook

Auto-Recovery (Use with Caution!)

Enable automatic recovery when data loss is detected:

python scripts/health_monitor.py \
  --auto-recover \
  --interval 300 \
  --critical-threshold 50

⚠️ Warning: Auto-recovery will automatically run the recovery script when critical drift is detected. Only enable this if you trust the system to self-heal.

Automated Backups

Railway Volume Backups (Built-in) ✅

Already configured! If you're using Railway, your FalkorDB service has automatic volume backups enabled.

Features:

✅ Automatic snapshots (default: every 24 hours)
✅ One-click restore from Railway dashboard
✅ Included with Railway Pro (no extra cost)
✅ Instant volume snapshots

Access backups:

Railway Dashboard → falkordb service
Click "Backups" tab
View backup history and schedule
Click "Restore" to recover from any snapshot

Limitations:

Only backs up FalkorDB (not Qdrant)
Platform-locked (can't export/download)
Use for quick recovery; combine with script backups for full protection

Script-Based Backups

For portable backups that cover both databases, use the backup_automem.py script:

Local Backups (Development)

The backup_automem.py script exports both FalkorDB and Qdrant to compressed JSON files:

# Basic backup to ./backups/
python scripts/backup_automem.py

# Backup with cleanup (keep last 7)
python scripts/backup_automem.py --cleanup --keep 7

# Custom backup directory
python scripts/backup_automem.py --backup-dir /mnt/backups

Cloud Backups (Production)

Upload backups to S3 for disaster recovery:

# Install AWS SDK
pip install boto3

# Configure AWS credentials (Railway secrets)
export AWS_ACCESS_KEY_ID=<your-key>
export AWS_SECRET_ACCESS_KEY=<your-secret>
export AWS_DEFAULT_REGION=us-east-1

# Backup with S3 upload
python scripts/backup_automem.py \
  --s3-bucket my-automem-backups \
  --cleanup --keep 7

Automated Script Backups

Recommended: GitHub Actions (Free)

GitHub Actions is the simplest way to automate backups - free and doesn't consume Railway resources.

Setup (5 minutes):

Workflow file already exists: .github/workflows/backup.yml
Enable TCP Proxy on FalkorDB (required for external access):

⚠️ Critical: GitHub Actions runners are external to Railway's network. They cannot use internal hostnames like falkordb.railway.internal. You must enable TCP Proxy for external connectivity.
- Railway Dashboard → falkordb service
- Settings → Networking → Enable TCP Proxy
- Note the public endpoint: monorail.proxy.rlwy.net:12345 (example)
```
flowchart LR
    subgraph railway [Railway Network]
        FalkorDB[(FalkorDB<br/>:6379)]
        TCPProxy[TCP Proxy<br/>monorail.proxy.rlwy.net:12345]
    end

    subgraph external [External]
        GHA[GitHub Actions]
        Local[Local Dev]
    end

    GHA -->|Public Internet| TCPProxy
    Local -->|Public Internet| TCPProxy
    TCPProxy -->|Internal| FalkorDB
```
Loading

Add GitHub secrets:

Go to: GitHub repo → Settings → Secrets and variables → Actions

Add these secrets:

FALKORDB_HOST         = monorail.proxy.rlwy.net   # TCP Proxy domain (NOT .railway.internal!)
FALKORDB_PORT         = 12345                      # TCP Proxy port (NOT 6379!)
FALKORDB_PASSWORD     = (from Railway FalkorDB service variables)
QDRANT_URL            = (your Qdrant Cloud URL)
QDRANT_API_KEY        = (your Qdrant Cloud API key)

Optional for S3: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION

Where to find TCP Proxy details:

Railway Dashboard → falkordb service → Settings → Networking
Look for "TCP Proxy" section → shows RAILWAY_TCP_PROXY_DOMAIN and RAILWAY_TCP_PROXY_PORT

Push and test:
```
git push origin main
```
- Go to Actions tab → "AutoMem Backup" → Run workflow

Runs every 6 hours automatically. Free tier: 2000 minutes/month.

Troubleshooting GitHub Actions Backup

Error: "Connection reset by peer" (error 104)

This means GitHub Actions can't connect to FalkorDB. Common causes:

TCP Proxy not enabled: Enable it in Railway Dashboard → falkordb → Settings → Networking
Wrong host/port in secrets: Must use TCP proxy endpoint, not internal hostname
Firewall: Railway TCP Proxy should be accessible from anywhere, but check if your Railway plan has restrictions

Verify your setup:

# Test from your local machine (should work if TCP proxy is enabled)
redis-cli -h monorail.proxy.rlwy.net -p 12345 -a YOUR_PASSWORD ping
# Should return: PONG

# If redis-cli is not installed:
# macOS:   brew install redis
# Ubuntu:  sudo apt-get install redis-tools
# Windows: choco install redis

Debug checklist:

TCP Proxy is enabled on FalkorDB service
FALKORDB_HOST secret uses TCP proxy domain (e.g., monorail.proxy.rlwy.net)
FALKORDB_PORT secret uses TCP proxy port (e.g., 12345), NOT 6379
FALKORDB_PASSWORD matches the password in FalkorDB service variables

Advanced: Railway Backup Service

For Railway Pro users who want backups running on Railway:

⚠️ Note: Railway's UI makes Dockerfile configuration complex. This method is for advanced users.

The scripts/Dockerfile.backup exists and runs backups every 6 hours in a loop. However, deploying it requires CLI:

cd /path/to/automem
railway link
railway up --service backup-service

Then configure in Railway dashboard:

Set Builder to Dockerfile
Dockerfile Path: scripts/Dockerfile.backup
Add environment variables (same as the AutoMem API service)

Cost: ~$1-2/month

Recommendation: Use GitHub Actions instead unless you have specific requirements for Railway-hosted backups.

Backup Restoration

flowchart TD
    Start{What data<br/>is lost?}

    Start -->|Only FalkorDB| QdrantCheck{Is Qdrant<br/>intact?}
    Start -->|Only Qdrant| BackupCheck1{Have recent<br/>backups?}
    Start -->|Both databases| BackupCheck2{Have recent<br/>backups?}
    Start -->|None just testing| NoAction[No action needed]

    QdrantCheck -->|Yes| RecoverQdrant[Use recover_from_qdrant.py<br/>⚡ Fastest 5-10 min]
    QdrantCheck -->|No| BackupCheck3{Have backups?}

    BackupCheck1 -->|Yes| RestoreQdrant[Restore Qdrant from backup<br/>Then rebuild FalkorDB<br/>⏱️ 15-30 min]
    BackupCheck1 -->|No| DataLoss1[⚠️ Partial data loss<br/>Rebuild from remaining data]

    BackupCheck2 -->|Yes Railway| RailwayRestore[Railway volume restore<br/>FalkorDB only<br/>⏱️ 5 min]
    BackupCheck2 -->|Yes S3/Local| FullRestore[Full restore from backups<br/>1. Restore Qdrant<br/>2. Rebuild FalkorDB<br/>⏱️ 20-40 min]
    BackupCheck2 -->|No| DataLoss2[⚠️ Complete data loss<br/>Start fresh]

    BackupCheck3 -->|Yes| FullRestore
    BackupCheck3 -->|No| DataLoss3[⚠️ Complete data loss<br/>Start fresh]

    RecoverQdrant --> Verify[Verify data integrity<br/>Check memory count]
    RestoreQdrant --> Verify
    RailwayRestore --> Verify
    FullRestore --> Verify

Restore from Qdrant (Fastest)

If FalkorDB data is lost but Qdrant is intact:

railway run --service automem python scripts/recover_from_qdrant.py

This rebuilds the FalkorDB graph from Qdrant vectors and payloads.

Restore from Backup Files

If both FalkorDB and Qdrant are lost, restore from backup:

# Download from S3
aws s3 cp s3://my-automem-backups/qdrant/qdrant_20251005_143000.json.gz ./restore/

# Extract
gunzip restore/qdrant_20251005_143000.json.gz

# Restore to Qdrant
python scripts/restore_from_backup.py restore/qdrant_20251005_143000.json

# Then restore FalkorDB from Qdrant
python scripts/recover_from_qdrant.py

Note: We'll create restore_from_backup.py if you need it.

Monitoring Dashboards

Built-in Health Endpoint

Check system health via API:

curl https://your-automem-deployment.up.railway.app/health | jq

Response:

{
  "status": "healthy",
  "falkordb": "connected",
  "qdrant": "connected",
  "graph": "memories",
  "timestamp": "2025-10-05T14:45:00Z"
}

Railway Dashboard

Monitor your services:

Metrics: CPU, memory, network usage
Logs: Real-time log streaming
Deployments: Build history and status
Health Checks: Automated uptime monitoring

External Monitoring (Optional)

Set up external monitoring with:

UptimeRobot - Free HTTP monitoring
- Monitor: https://your-automem-deployment.up.railway.app/health
- Alert when status != "healthy"
Better Uptime - Advanced monitoring
- HTTP checks + keyword monitoring
- SMS/Slack/Email alerts
Grafana Cloud - Full observability
- Custom dashboards
- Metrics aggregation
- Log correlation

Backup Schedule Recommendations

For Personal Use

Health checks: Every 5 minutes (alert-only)
Backups: Every 24 hours, keep 7 days
Recovery: Manual trigger

For Team Use

Health checks: Every 2 minutes (with auto-recovery)
Backups: Every 6 hours, keep 14 days + S3
Recovery: Automatic on critical drift

For Production Use

Health checks: Every 30 seconds (with auto-recovery)
Backups: Every 1 hour, keep 30 days + S3 + cross-region replication
Recovery: Automatic with alerts

Alerting Integrations

Slack Webhook

# Get webhook URL from Slack App settings
# https://api.slack.com/messaging/webhooks

python scripts/health_monitor.py \
  --webhook https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXX

Discord Webhook

# Discord webhooks work the same as Slack
python scripts/health_monitor.py \
  --webhook https://discord.com/api/webhooks/123456789/abcdefg

Custom Webhook

The health monitor sends JSON payloads:

{
  "level": "critical",
  "title": "Data Loss Detected",
  "message": "FalkorDB has 52.3% drift from Qdrant",
  "details": {
    "drift_percent": 52.3,
    "falkordb_count": 420,
    "qdrant_count": 884
  },
  "timestamp": "2025-10-05T14:45:00Z",
  "system": "AutoMem Health Monitor"
}

Cost Estimates

Railway (Hobby Plan - $5/month)

✅ Main API service
✅ FalkorDB service with 1GB volume
❌ Not enough resources for monitoring service

Railway (Pro Plan - $20/month)

✅ Main API service (~$5)
✅ FalkorDB service (~$10)
✅ Health monitoring service (~$2)
✅ Backup service (~$1)
Total: ~$18/month

Railway + External Services (Hybrid)

Railway Pro for main services (~$15)
GitHub Actions for backups (free)
UptimeRobot for monitoring (free)
Total: ~$15/month

AWS S3 Backup Costs

Storage: ~$0.023/GB/month (Standard)
Requests: ~$0.005/1000 PUTs
Example: 100MB backup every 6 hours = ~$0.30/month

Troubleshooting

Health Monitor Shows Drift

Problem: FalkorDB and Qdrant counts don't match

Causes:

In-flight writes during check (normal, <1% drift)
Failed writes to one store (>5% drift - warning)
Data loss event (>50% drift - critical)

Solution:

# Check health details
python scripts/health_monitor.py --once

# If critical, run recovery
python scripts/recover_from_qdrant.py

Backup Failed

Problem: Backup script fails with connection error

Solution:

# Test connections
curl https://your-automem-deployment.up.railway.app/health

# Check credentials
echo $FALKORDB_PASSWORD
echo $QDRANT_API_KEY

# Try manual backup
python scripts/backup_automem.py

S3 Upload Failed

Problem: Backup created but S3 upload failed

Solution:

# Check AWS credentials
aws s3 ls s3://my-automem-backups/

# Test upload manually
aws s3 cp backups/falkordb/latest.json.gz s3://my-automem-backups/test/

# Check boto3 installation
python -c "import boto3; print(boto3.__version__)"

Next Steps

Set up health monitoring service on Railway
Configure Slack/Discord webhook alerts
Schedule automated backups (every 6 hours)
Test recovery process in staging environment
Set up S3 bucket with versioning enabled
Configure cross-region replication (optional)

Questions? Check the main Railway deployment guide: RAILWAY_DEPLOYMENT.md

FilesExpand file tree

MONITORING_AND_BACKUPS.md

Latest commit

History

MONITORING_AND_BACKUPS.md

File metadata and controls

AutoMem Monitoring & Backups

Overview

Health Monitoring

Quick Start

Configuration

Auto-Recovery (Use with Caution!)

Automated Backups

Railway Volume Backups (Built-in) ✅

Script-Based Backups

Local Backups (Development)

Cloud Backups (Production)

Automated Script Backups

Troubleshooting GitHub Actions Backup

Error: "Connection reset by peer" (error 104)

Backup Restoration

Restore from Qdrant (Fastest)

Restore from Backup Files

Monitoring Dashboards

Built-in Health Endpoint

Railway Dashboard

External Monitoring (Optional)

Backup Schedule Recommendations

For Personal Use

For Team Use

For Production Use

Alerting Integrations

Slack Webhook

Discord Webhook

Custom Webhook

Cost Estimates

Railway (Hobby Plan - $5/month)

Railway (Pro Plan - $20/month)

Railway + External Services (Hybrid)

AWS S3 Backup Costs

Troubleshooting

Health Monitor Shows Drift

Backup Failed

S3 Upload Failed

Next Steps