Technical architecture and design of the predictive infrastructure monitoring system.
Tachyon Argus is a machine learning system that predicts server failures 8 hours in advance using a Temporal Fusion Transformer (TFT) model with cascading failure detection and continuous learning.
┌─────────────────────────────────────────────────────────────────┐
│ Tachyon Argus System │
│ Predictive Infrastructure Monitoring │
│ with Cascading Failure Detection │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Training │ │ Inference │ │ Dashboard │
│ Pipeline │ │ Daemon │ │ (Optional) │
│ │ │ │ │ │
│ - Fleet feats │ │ - REST API │ │ - Plotly Dash │
│ - TFT model │ │ - Predictions │ │ - Real-time │
│ - Multi-tgt │ │ - Cascade Det │ │ - Alerts │
│ - Streaming │ │ - Drift Mon │ │ - XAI │
└───────────────┘ └───────────────┘ └───────────────┘
The production service that makes predictions and detects cascading failures.
Responsibilities:
- Accept metrics via REST API
- Maintain rolling window of recent data
- Run TFT model predictions (multi-target)
- Calculate risk scores and alerts
- Detect cascading failures via correlation analysis
- Monitor model drift and trigger retraining
- Serve predictions via API
Architecture:
┌─────────────────────────────────────────────────────────────────┐
│ TFT Inference Daemon │
│ (Port 8000) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Rolling │ │ TFT │ │ Risk │ │
│ │ Window │ -> │ Model │ -> │ Scoring │ │
│ │ (6000 pts) │ │ (Multi-Tgt) │ │ + Alerts │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ↑ │ │ │
│ │ ▼ ▼ │
│ POST /feed/data ┌──────────────┐ GET /predictions │
│ │ │ Correlation │ │
│ │ │ Detector │─────► GET /cascade/status │
│ │ │ (Cascade) │ │
│ │ └──────────────┘ │
│ │ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Drift │ │ Auto │ │ XAI │ │
│ │ Monitor │───>│ Retrainer │ │ Explainers │ │
│ │ (PER/DSS) │ │ (Trigger) │ │ SHAP/Attn/CF │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ GET /drift/status GET /admin/training-status │
│ │
└─────────────────────────────────────────────────────────────────┘
Key Features:
- Rolling Window: Maintains last 6000 data points (~8 hours for 45 servers)
- Multi-Target Prediction: Predicts CPU, memory, I/O wait, swap, load simultaneously
- Risk Scoring: Converts predictions to 0-100 risk scores
- Cascading Failure Detection: Cross-server correlation analysis
- Drift Monitoring: Automatic detection of model degradation
- Auto-Retraining: Drift-triggered model updates
- XAI Explainability: SHAP, Attention visualization, Counterfactuals
- Hot Reload: Swap models without restart
Trains the TFT prediction model with fleet-level features.
Training Modes:
| Mode | Description | Use Case |
|---|---|---|
| Standard | Load all data into memory | Small datasets |
| Streaming | Process 2-hour chunks | Large datasets |
| Incremental | Continue from existing model | Add more epochs |
Fleet-Level Feature Engineering:
┌─────────────────────────────────────────────────────────────────┐
│ Fleet Feature Engineering │
├─────────────────────────────────────────────────────────────────┤
│ │
│ For each timestamp, compute: │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Fleet Averages (per metric): │ │
│ │ - fleet_cpu_user_pct_mean (fleet avg CPU) │ │
│ │ - fleet_cpu_user_pct_std (fleet variability) │ │
│ │ - fleet_cpu_user_pct_max (fleet peak) │ │
│ │ - fleet_mem_used_pct_mean (fleet avg memory) │ │
│ │ - fleet_cpu_iowait_pct_mean (fleet avg I/O wait) │ │
│ │ - fleet_load_average_mean (fleet avg load) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Fleet Stress Indicators: │ │
│ │ - fleet_pct_high_cpu (% servers with CPU > 80%) │ │
│ │ - fleet_pct_high_mem (% servers with mem > 85%) │ │
│ │ - fleet_pct_high_iowait (% servers with IO > 15%) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Per-Server Anomaly Signals: │ │
│ │ - cpu_vs_fleet (server CPU - fleet avg CPU) │ │
│ │ - mem_vs_fleet (server mem - fleet avg mem) │ │
│ │ - iowait_vs_fleet (server I/O - fleet avg I/O) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Total: 18 fleet-level features for cascading detection │
│ │
└─────────────────────────────────────────────────────────────────┘
Streaming Training Flow:
┌─────────────────────────────────────────────────────────────────┐
│ Streaming Training │
├─────────────────────────────────────────────────────────────────┤
│ │
│ chunk_01.parquet -> [Fleet Features] -> [Train] -> [Free Mem] │
│ ↓ │
│ chunk_02.parquet -> [Fleet Features] -> [Train] -> [Free Mem] │
│ ↓ │
│ chunk_03.parquet -> [Fleet Features] -> [Train] -> [Free Mem] │
│ ↓ │
│ ... ... │
│ ↓ │
│ chunk_N.parquet -> [Fleet Features] -> [Train] -> [Save] │
│ │
│ Checkpoint saved every 5 chunks for resume capability │
│ │
└─────────────────────────────────────────────────────────────────┘
NEW - Detects cascading failures by monitoring cross-server correlations.
How It Works:
┌─────────────────────────────────────────────────────────────────┐
│ Cascading Failure Detection │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Track metrics per server over sliding window │
│ │
│ 2. Detect anomalies per server (z-score > 2.0) │
│ │
│ 3. Calculate cross-server correlation: │
│ - If servers show similar patterns → high correlation │
│ - High correlation + anomalies → CASCADE DETECTED │
│ │
│ 4. Generate cascade alert with: │
│ - Affected servers list │
│ - Severity (critical/high/medium/low) │
│ - Correlation score │
│ - Actionable recommendations │
│ │
│ Example Cascade Detection: │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Time 10:00 - All servers normal │ │
│ │ Time 10:15 - Server A: CPU spike (anomaly) │ │
│ │ Time 10:20 - Server B,C,D: CPU spikes (correlated) │ │
│ │ Time 10:25 - CASCADE DETECTED: 4 servers, 87% corr │ │
│ │ → Alert: "Check shared infrastructure" │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Tracks model performance and triggers retraining when drift is detected.
Drift Metrics:
| Metric | Description | Threshold |
|---|---|---|
| PER | Prediction Error Rate | 10% |
| DSS | Distribution Shift Score | 20% |
| FDS | Feature Drift Score | 15% |
| Anomaly Rate | Unusual pattern rate | 5% |
Drift Detection Flow:
┌─────────────────────────────────────────────────────────────────┐
│ Drift Detection & Auto-Retraining │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Prediction Made │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Drift │ Track: prediction errors, feature shifts │
│ │ Monitor │ distribution changes, anomalies │
│ └──────────────┘ │
│ │ │
│ ▼ (Every hour) │
│ ┌──────────────┐ │
│ │ Calculate │ Combined Score = PER×0.4 + DSS×0.3 + │
│ │ Metrics │ FDS×0.2 + Anomaly×0.1 │
│ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Threshold │ If any metric > threshold: │
│ │ Check │ → needs_retraining = True │
│ └──────────────┘ │
│ │ │
│ ▼ (If drift detected) │
│ ┌──────────────┐ │
│ │ Auto │ - 5 epoch incremental training │
│ │ Retrainer │ - Non-blocking (background thread) │
│ │ Trigger │ - 24h cooldown between triggers │
│ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Model │ Hot-reload new model without restart │
│ │ Reload │ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Generates realistic synthetic server metrics.
Features:
- 7 server profiles (Database, ML, API, ETL, etc.)
- Realistic workload patterns (daily cycles, spikes)
- Failure scenarios for training
- Time-partitioned Parquet output
Shared utilities used across components:
| Module | Purpose |
|---|---|
alert_levels.py |
Alert thresholds and colors |
server_encoder.py |
Hash-based server ID encoding |
data_validator.py |
Schema validation |
gpu_profiles.py |
GPU detection and configuration |
historical_store.py |
Parquet-based data storage |
nordiq_metrics.py |
Metrics definitions |
correlation_detector.py |
NEW Cross-server cascade detection |
drift_monitor.py |
Model drift tracking |
auto_retrainer.py |
Drift-triggered retraining |
data_buffer.py |
Data accumulation for training |
metrics_generator.py
│
▼
┌─────────────────┐
│ training/ │
│ ├── partitioned/│ (2-hour Parquet chunks)
│ └── manifest │
└─────────────────┘
│
▼
tft_trainer.py
│ (Fleet feature engineering)
▼
┌─────────────────┐
│ models/ │
│ ├── model.safetensors │
│ ├── config.json │
│ └── dataset_params.pkl│ (Includes fleet feature encoders)
└─────────────────┘
Your Monitoring System
│
▼ POST /feed/data
┌─────────────────────────────────────────────────────────────────┐
│ Inference Daemon │
│ │
│ 1. Validate schema (16 required fields) │
│ 2. Add to rolling window (6000 points max) │
│ 3. Update correlation detector (cascade check) │
│ 4. Run TFT model prediction (96 steps = 8 hours, multi-target) │
│ 5. Calculate risk scores (0-100) │
│ 6. Determine alert levels (critical/warning/degraded/healthy) │
│ 7. Update drift monitor (track prediction accuracy) │
│ 8. Store in historical data (Parquet) │
│ 9. Check drift threshold (hourly) → trigger retraining if needed│
│ │
└─────────────────────────────────────────────────────────────────┘
│
├──────────────────────────────────────┐
▼ ▼
GET /predictions/current GET /cascade/status
┌────────────────────────┐ ┌────────────────────────┐
│ Response: │ │ Response: │
│ { │ │ { │
│ "predictions": {...},│ │ "cascade_detected": │
│ "summary": {...}, │ │ true, │
│ "cascade_alert": { │ │ "correlation_score": │
│ "severity": "high",│ │ 0.87, │
│ "affected": 5 │ │ "affected_servers": │
│ } │ │ ["srv1", "srv2"] │
│ } │ │ } │
└────────────────────────┘ └────────────────────────┘
The Temporal Fusion Transformer (TFT) is designed for multi-horizon forecasting with interpretability.
- Variable Selection Networks: Learn which input features are important (including fleet features)
- LSTM Encoder/Decoder: Capture temporal patterns
- Multi-Head Attention: Focus on relevant time periods
- Quantile Outputs: Predict confidence intervals
| Parameter | Value |
|---|---|
| Hidden Size | 128 |
| Attention Heads | 8 |
| LSTM Layers | 2 |
| Dropout | 0.1 |
| Input Window | 96 steps (8 hours) |
| Prediction Horizon | 96 steps (8 hours) |
| Input Features | 14 server + 18 fleet = 32 |
| Target Metrics | 5 (multi-target) |
| Total Parameters | ~120K |
The model now predicts multiple metrics simultaneously:
| Target | Description |
|---|---|
cpu_user_pct |
User CPU utilization |
cpu_iowait_pct |
I/O wait percentage |
mem_used_pct |
Memory utilization |
swap_used_pct |
Swap utilization |
load_average |
System load |
18 new features for cascading failure detection:
| Feature Category | Features | Purpose |
|---|---|---|
| Fleet Averages | fleet_*_mean (4) |
Cross-server baseline |
| Fleet Variability | fleet_*_std (4) |
Detect synchronized changes |
| Fleet Peaks | fleet_*_max (4) |
Track fleet-wide extremes |
| Stress Indicators | fleet_pct_high_* (3) |
% of fleet in stress |
| Server Deviation | *_vs_fleet (3) |
Individual vs fleet |
The model uses 7 server profiles for transfer learning:
| Profile | Prefix | Characteristics |
|---|---|---|
| Database | ppdb |
High memory, I/O heavy |
| ML Compute | ppml |
GPU utilization, batch patterns |
| Web API | ppapi |
Request patterns, connection pools |
| Conductor | ppcond |
Orchestration, periodic jobs |
| ETL/Ingest | ppetl |
Batch processing, overnight runs |
| Risk Analytics | pprisk |
Compute-heavy, market hours |
| Generic | other | General workload |
Predictions are converted to actionable risk scores:
TFT Predictions (96 timesteps × 5 targets)
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Risk Score Calculation │
│ │
│ 1. Forecast analysis: │
│ - Peak values across forecast horizon │
│ - Rate of change (acceleration) │
│ - Time to threshold breach │
│ │
│ 2. Multi-metric aggregation: │
│ - CPU (weight: 0.25) │
│ - Memory (weight: 0.30) │
│ - Disk (weight: 0.20) │
│ - Network (weight: 0.15) │
│ - Connections (weight: 0.10) │
│ │
│ 3. Profile adjustment: │
│ - Database servers: higher memory weight │
│ - ML servers: higher CPU/GPU weight │
│ │
│ 4. CASCADE BONUS (NEW): │
│ - If cascade detected: +20 to all affected servers │
│ - Elevates "green" servers in cascade to "warning" │
│ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Alert Levels │
│ │
│ Score 80-100: CRITICAL (Red) - Immediate action required │
│ Score 60-79: WARNING (Orange) - Investigate soon │
│ Score 50-59: DEGRADED (Yellow) - Monitor closely │
│ Score 0-49: HEALTHY (Green) - Normal operation │
│ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Tachyon Argus API (FastAPI) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Core Endpoints: │
│ GET /health - Health check (no auth) │
│ GET /status - Daemon status (no auth) │
│ POST /feed/data - Feed metrics data │
│ GET /predictions/current - Get all predictions │
│ GET /alerts/active - Get active alerts │
│ GET /explain/{server} - XAI explanation │
│ │
│ Cascade & Drift Endpoints (NEW): │
│ GET /cascade/status - Cascade detection status │
│ GET /cascade/health - Fleet health (correlation-based) │
│ GET /drift/status - Model drift metrics │
│ GET /drift/report - Human-readable drift report │
│ │
│ Historical Endpoints: │
│ GET /historical/summary - Summary statistics │
│ GET /historical/alerts - Alert history │
│ GET /historical/server/{name} - Server history │
│ GET /historical/export/{table} - CSV export │
│ │
│ Admin Endpoints: │
│ GET /admin/models - List available models │
│ POST /admin/reload-model - Hot reload model │
│ POST /admin/trigger-training - Start training │
│ GET /admin/training-status - Training progress │
│ │
└─────────────────────────────────────────────────────────────────┘
API key authentication via header:
X-API-Key: your-api-key
Rate limits:
/feed/data: 60/minute/predictions/*: 30/minute/cascade/*: 30/minute/drift/*: 30/minute/explain/*: 30/minute
┌─────────────────────────────────────────────────────────────────┐
│ Continuous Learning Loop │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ │
│ │ Metrics │ Your monitoring system sends metrics │
│ │ Ingestion │ │
│ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Data │ Accumulates metrics for retraining │
│ │ Buffer │ 60-day retention, auto-rotation │
│ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Drift │───>│ Auto │ Trigger if: │
│ │ Monitor │ │ Retrainer │ - PER > 10% │
│ └──────────────┘ └──────────────┘ - DSS > 20% │
│ │ - FDS > 15% │
│ │ - Anomaly > 5% │
│ ▼ │
│ ┌──────────────┐ │
│ │ Training │ 5 epochs, incremental │
│ │ Pipeline │ Non-blocking │
│ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Model │ Hot reload, no restart │
│ │ Reload │ │
│ └──────────────┘ │
│ │ │
│ ▼ │
│ Model adapts to new patterns │
│ │
└─────────────────────────────────────────────────────────────────┘
Shows which features drove a prediction:
{
"top_features": [
{"feature": "mem_pct", "importance": 0.85, "stars": "★★★★★"},
{"feature": "fleet_pct_high_cpu", "importance": 0.72, "stars": "★★★★☆"},
{"feature": "cpu_pct", "importance": 0.62, "stars": "★★★☆☆"}
]
}Shows which time periods the model focused on:
{
"focus_periods": ["Last 30 minutes", "2 hours ago"],
"analysis": "Model focused on recent memory spike and fleet-wide CPU increase"
}Shows what-if scenarios:
{
"scenarios": [
{
"action": "Reduce memory by 15%",
"impact": "Risk drops from 72.5 to 45.2",
"recommendation": "Consider restarting memory-heavy processes"
}
]
}Time-partitioned Parquet files:
training/server_metrics_partitioned/
├── chunk_20251201_00.parquet # 2-hour chunk
├── chunk_20251201_02.parquet
├── ...
└── chunk_manifest.json
models/tft_model_YYYYMMDD_HHMMSS/
├── model.safetensors # Model weights (SafeTensors format)
├── config.json # Model configuration
├── dataset_parameters.pkl # Data encoders (includes fleet features)
├── server_mapping.json # Server hash mapping
└── training_info.json # Training metadata
Parquet-based historical store:
data_buffer/
├── alerts_YYYYMMDD.parquet
├── environment_YYYYMMDD.parquet
└── drift_metrics.json # Current drift status
| Layer | Technology |
|---|---|
| ML Framework | PyTorch 2.0+, PyTorch Forecasting |
| Model Format | SafeTensors |
| API Framework | FastAPI |
| Data Format | Parquet (Apache Arrow) |
| Dashboard | Plotly Dash |
| GPU Support | CUDA, cuDNN |
| Training | PyTorch Lightning |
- API Authentication: Key-based auth via X-API-Key header
- Rate Limiting: Prevents abuse (slowapi)
- Input Validation: Schema validation on all inputs
- No PII: Server names can be hashed for privacy
- Handles 100+ servers on single GPU
- ~85ms inference latency
- 6000-point rolling window
- Horizontal scaling via load balancer
- Shared model storage
- Redis for state coordination
Argus/
├── src/
│ ├── daemons/
│ │ ├── tft_inference_daemon.py # Main inference service
│ │ ├── metrics_generator_daemon.py # Demo data generator
│ │ └── adaptive_retraining_daemon.py
│ ├── training/
│ │ ├── main.py # Training CLI
│ │ └── tft_trainer.py # Training engine (fleet features)
│ ├── generators/
│ │ └── metrics_generator.py # Data generation
│ └── core/
│ ├── alert_levels.py
│ ├── server_encoder.py
│ ├── data_validator.py
│ ├── gpu_profiles.py
│ ├── historical_store.py
│ ├── nordiq_metrics.py
│ ├── correlation_detector.py # NEW: Cascade detection
│ ├── drift_monitor.py # Drift tracking
│ ├── auto_retrainer.py # Auto-retraining
│ ├── data_buffer.py # Data accumulation
│ ├── config/
│ │ └── model_config.py # Model hyperparameters
│ └── explainers/
│ ├── shap_explainer.py
│ ├── attention_visualizer.py
│ └── counterfactual_generator.py
├── models/ # Trained models (gitignored)
├── training/ # Training data (gitignored)
├── checkpoints/ # Training checkpoints (gitignored)
└── data_buffer/ # Historical data (gitignored)