Skip to content

Latest commit

 

History

History
743 lines (608 loc) · 14 KB

File metadata and controls

743 lines (608 loc) · 14 KB

Tachyon Argus - API Reference

Complete REST API documentation for the TFT Inference Daemon.

Base URL: http://localhost:8000

Authentication

All endpoints except /health and /status require API key authentication.

Header:

X-API-Key: your-api-key

Environment Variable:

export TACHYON_API_KEY="your-api-key"

Health & Status

GET /health

Health check endpoint (no authentication required).

Response:

{
  "status": "healthy",
  "service": "tft_inference_daemon",
  "running": true
}

GET /status

Daemon operational status (no authentication required).

Response:

{
  "model_loaded": true,
  "model_path": "models/tft_model_20251215",
  "rolling_window_size": 2880,
  "servers_tracked": 45,
  "last_prediction": "2025-01-15T10:05:00",
  "uptime_seconds": 3600,
  "warmup_complete": true
}

Data Ingestion

POST /feed/data

Feed metrics data to the inference engine.

Rate Limit: 60 requests/minute

Request Body:

{
  "records": [
    {
      "timestamp": "2025-01-15T10:00:00",
      "server_name": "ppdb001",
      "status": "healthy",
      "cpu_user_pct": 25.3,
      "cpu_sys_pct": 8.2,
      "cpu_iowait_pct": 2.1,
      "cpu_idle_pct": 64.4,
      "java_cpu_pct": 15.5,
      "mem_used_pct": 72.1,
      "swap_used_pct": 5.2,
      "disk_usage_pct": 45.8,
      "net_in_mb_s": 12.5,
      "net_out_mb_s": 8.3,
      "back_close_wait": 5,
      "front_close_wait": 3,
      "load_average": 2.4,
      "uptime_days": 45
    }
  ]
}

Required Fields:

Field Type Range Description
timestamp string ISO 8601 e.g., "2025-01-15T10:00:00"
server_name string - Unique server identifier
status string enum See status values below
cpu_user_pct float 0-100 User CPU percentage
cpu_sys_pct float 0-100 System CPU percentage
cpu_iowait_pct float 0-100 I/O wait percentage
cpu_idle_pct float 0-100 Idle CPU percentage
java_cpu_pct float 0-100 Java process CPU
mem_used_pct float 0-100 Memory usage percentage
swap_used_pct float 0-100 Swap usage percentage
disk_usage_pct float 0-100 Disk usage percentage
net_in_mb_s float 0+ Network in (MB/s)
net_out_mb_s float 0+ Network out (MB/s)
back_close_wait int 0+ Backend CLOSE_WAIT
front_close_wait int 0+ Frontend CLOSE_WAIT
load_average float 0+ System load average
uptime_days int 0-365 Days since reboot

Valid Status Values:

  • healthy
  • critical_issue
  • heavy_load
  • idle
  • maintenance
  • morning_spike
  • offline
  • recovery

Response (Success):

{
  "status": "accepted",
  "records_received": 45,
  "tick": 1234,
  "rolling_window_size": 2880,
  "warmup_complete": true
}

Response (Warmup):

{
  "status": "warming_up",
  "records_received": 45,
  "tick": 50,
  "warmup_complete": false,
  "progress": "45% (need 100 records minimum)"
}

Predictions

GET /predictions/current

Get current predictions for all servers.

Rate Limit: 30 requests/minute

Response:

{
  "predictions": {
    "ppdb001": {
      "risk_score": 72.5,
      "profile": "Database",
      "alert": {
        "level": "warning",
        "score": 72.5,
        "color": "#FFA500",
        "emoji": "🟠",
        "label": "🟠 Warning"
      },
      "display_metrics": {
        "cpu": "45.2%",
        "memory": "78.1%",
        "disk_io": "23.4 MB/s"
      },
      "forecast": {
        "cpu_pct": [45.2, 46.1, 47.3],
        "mem_pct": [78.1, 78.4, 79.2],
        "timestamps": ["2025-01-15T10:00:00", "2025-01-15T10:05:00", "2025-01-15T10:10:00"]
      }
    }
  },
  "summary": {
    "total_servers": 45,
    "critical": 2,
    "warning": 5,
    "degraded": 3,
    "healthy": 35,
    "fleet_health": 87.2,
    "prob_30m": 0.15,
    "prob_8h": 0.42
  },
  "alerts": [
    {
      "server_name": "ppdb003",
      "level": "critical",
      "risk_score": 92.1,
      "message": "Memory exhaustion predicted in 45 minutes"
    }
  ],
  "timestamp": "2025-01-15T10:05:00"
}

Error (Warmup):

{
  "error": "insufficient_data",
  "message": "Need at least 100 records, have 45",
  "predictions": {}
}

Alerts

GET /alerts/active

Get currently active alerts.

Rate Limit: 30 requests/minute

Response:

{
  "timestamp": "2025-01-15T10:05:00",
  "count": 3,
  "alerts": [
    {
      "server_name": "ppdb003",
      "level": "critical",
      "risk_score": 92.1,
      "profile": "Database",
      "message": "Memory exhaustion predicted",
      "time_to_issue": "45 minutes"
    }
  ]
}

Explainability (XAI)

GET /explain/{server_name}

Get XAI explanation for a server's prediction.

Rate Limit: 30 requests/minute

Parameters:

  • server_name (path): Server identifier (e.g., "ppdb001")

Response:

{
  "server_name": "ppdb001",
  "prediction": {
    "risk_score": 72.5,
    "alert_level": "warning"
  },
  "shap": {
    "top_features": [
      {"feature": "mem_pct", "importance": 0.85, "stars": "★★★★★"},
      {"feature": "cpu_pct", "importance": 0.62, "stars": "★★★☆☆"},
      {"feature": "disk_io", "importance": 0.34, "stars": "★★☆☆☆"}
    ]
  },
  "attention": {
    "focus_periods": ["Last 30 minutes", "2 hours ago"],
    "analysis": "Model focused on recent memory spike"
  },
  "counterfactuals": [
    {
      "scenario": "Reduce memory by 15%",
      "impact": "Risk drops from 72.5 to 45.2",
      "recommendation": "Consider restarting memory-heavy processes"
    }
  ]
}

Historical Data

GET /historical/summary

Get summary statistics for executive reporting.

Parameters:

  • time_range (query): 30m, 1h, 8h, 1d, 1w, 1M (default: 1d)

Response:

{
  "success": true,
  "time_range": "1d",
  "total_alerts": 47,
  "alerts_by_level": {
    "critical": 5,
    "warning": 22,
    "degraded": 20
  },
  "avg_resolution_time_minutes": 23.4,
  "incidents_prevented": 3,
  "fleet_health_avg": 85.2
}

GET /historical/alerts

Get alert events for a time range.

Parameters:

  • time_range (query): 30m, 1h, 8h, 1d, 1w, 1M (default: 1h)
  • server_name (query, optional): Filter by server

Response:

{
  "success": true,
  "time_range": "8h",
  "count": 15,
  "alerts": [
    {
      "timestamp": "2025-01-15T08:30:00",
      "server_name": "ppdb001",
      "event_type": "escalated",
      "previous_level": "warning",
      "new_level": "critical",
      "risk_score": 85.3,
      "resolved_at": "2025-01-15T09:15:00",
      "resolution_duration_minutes": 45
    }
  ]
}

GET /historical/server/{server_name}

Get detailed history for a specific server.

Parameters:

  • server_name (path): Server identifier
  • time_range (query): 30m, 1h, 8h, 1d, 1w, 1M (default: 1d)

Response:

{
  "success": true,
  "server_name": "ppdb001",
  "time_range": "1d",
  "alert_count": 3,
  "avg_risk_score": 42.5,
  "max_risk_score": 85.3,
  "time_in_warning_pct": 15.2,
  "time_in_critical_pct": 2.1,
  "risk_trend": "improving"
}

GET /historical/environment

Get environment health snapshots over time.

Parameters:

  • time_range (query): 30m, 1h, 8h, 1d, 1w, 1M (default: 1h)

Response:

{
  "success": true,
  "time_range": "1h",
  "count": 12,
  "snapshots": [
    {
      "timestamp": "2025-01-15T09:00:00",
      "total_servers": 45,
      "critical_count": 1,
      "warning_count": 4,
      "healthy_count": 40,
      "fleet_health": 88.9
    }
  ]
}

GET /historical/export/{table}

Export historical data as CSV.

Parameters:

  • table (path): alerts or environment
  • time_range (query): 30m, 1h, 8h, 1d, 1w, 1M (default: 1d)

Response:

{
  "success": true,
  "table": "alerts",
  "time_range": "1w",
  "csv_data": "timestamp,server_name,event_type,...\n2025-01-15T08:30:00,ppdb001,...",
  "filename": "argus_alerts_1w_20250115_103000.csv"
}

Administration

GET /admin/models

List available models.

Response:

{
  "models": [
    {
      "name": "tft_model_20251215_143022",
      "path": "models/tft_model_20251215_143022",
      "created": "2025-12-15T14:30:22",
      "size_mb": 2.4,
      "is_current": true
    }
  ],
  "current_model": "tft_model_20251215_143022"
}

POST /admin/reload-model

Hot reload a model without daemon restart.

Request Body (optional):

{
  "model_path": "models/tft_model_20251215_143022"
}

If no model_path provided, reloads the latest model.

Response:

{
  "success": true,
  "model": "tft_model_20251215_143022",
  "message": "Model reloaded successfully"
}

POST /admin/trigger-training

Trigger model retraining.

Parameters:

  • epochs (query, optional): Number of epochs (default: 10)
  • incremental (query, optional): Continue from existing model (default: true)

Response:

{
  "success": true,
  "training_id": "train_20250115_103000",
  "message": "Training started",
  "epochs": 10,
  "incremental": true
}

GET /admin/training-status

Get current training status.

Response:

{
  "training_active": true,
  "training_id": "train_20250115_103000",
  "progress": {
    "epoch": 5,
    "total_epochs": 10,
    "percent": 50,
    "eta_minutes": 15
  }
}

GET /admin/training-stats

Get training statistics.

Response:

{
  "last_training": "2025-01-15T08:00:00",
  "total_trainings": 15,
  "avg_training_time_minutes": 45,
  "models_produced": 12
}

POST /admin/cancel-training

Cancel running training job.

Response:

{
  "success": true,
  "message": "Training cancelled"
}

Cascade & Drift Detection

GET /cascade/status

Get current cascading failure detection status.

Response:

{
  "current_status": {
    "cascade_detected": false,
    "timestamp": "2025-01-15T10:05:00",
    "total_servers": 45,
    "servers_with_anomalies": 3,
    "anomaly_rate": 0.067,
    "correlation_score": 0.42,
    "cascades": []
  },
  "tracking": {
    "servers": 45,
    "metrics_tracked": ["cpu_user_pct", "mem_used_pct", "cpu_iowait_pct", "load_average", "swap_used_pct"],
    "window_size": 100
  },
  "recent_events": [],
  "event_count": 0,
  "thresholds": {
    "correlation": 0.7,
    "cascade_servers": 3,
    "anomaly_z_score": 2.0
  }
}

GET /cascade/health

Get fleet health score based on cross-server correlations.

Response:

{
  "health_score": 85.2,
  "status": "healthy",
  "correlation_score": 0.25,
  "anomaly_rate": 0.04,
  "anomalous_servers": 2,
  "total_servers": 45,
  "cascade_risk": "low"
}

Health Status Levels:

Status Health Score Description
healthy 80-100 Normal operation
degraded 60-79 Minor issues detected
warning 40-59 Significant problems
critical 0-39 Cascading failure likely

GET /drift/status

Get current model drift detection status.

Response:

{
  "current_metrics": {
    "per": 0.05,
    "dss": 0.12,
    "fds": 0.08,
    "anomaly_rate": 0.02,
    "combined_score": 0.07,
    "needs_retraining": false,
    "timestamp": "2025-01-15T10:05:00"
  },
  "trends": {
    "per": "decreasing",
    "dss": "stable",
    "fds": "increasing",
    "anomaly_rate": "stable"
  },
  "window_size": 1000,
  "data_points_tracked": 5000,
  "recommendation": "OK",
  "thresholds": {
    "per_threshold": 0.10,
    "dss_threshold": 0.20,
    "fds_threshold": 0.15,
    "anomaly_threshold": 0.05
  }
}

Drift Metrics:

Metric Description Threshold
PER Prediction Error Rate 10%
DSS Distribution Shift Score 20%
FDS Feature Drift Score 15%
Anomaly Rate Unusual prediction patterns 5%

GET /drift/report

Get human-readable drift detection report.

Response:

{
  "success": true,
  "report": "============================================================\nDRIFT DETECTION REPORT\n..."
}

Error Responses

401 Unauthorized

{
  "detail": "Invalid or missing API key"
}

422 Validation Error

{
  "detail": [
    {
      "loc": ["body", "records", 0, "cpu_user_pct"],
      "msg": "value is not a valid float",
      "type": "type_error.float"
    }
  ]
}

429 Rate Limited

{
  "detail": "Rate limit exceeded. Try again in 60 seconds."
}

500 Internal Server Error

{
  "detail": "Internal server error",
  "error": "Model not loaded"
}

Rate Limits

Endpoint Limit
POST /feed/data 60/minute
GET /predictions/* 30/minute
GET /alerts/* 30/minute
GET /explain/* 30/minute
GET /historical/* 30/minute
POST /admin/* 10/minute

Alert Levels

Level Score Range Color Hex
Critical 80-100 Red #FF0000
Warning 60-79 Orange #FFA500
Degraded 50-59 Yellow #FFD700
Healthy 0-49 Green #00FF00

SDK Examples

Python

import requests

API_URL = "http://localhost:8000"
API_KEY = "your-api-key"
HEADERS = {"X-API-Key": API_KEY}

# Get predictions
response = requests.get(f"{API_URL}/predictions/current", headers=HEADERS)
predictions = response.json()

# Feed data
data = {"records": [...]}
response = requests.post(f"{API_URL}/feed/data", json=data, headers=HEADERS)

JavaScript

const API_URL = 'http://localhost:8000';
const API_KEY = 'your-api-key';
const headers = { 'X-API-Key': API_KEY };

// Get predictions
const response = await fetch(`${API_URL}/predictions/current`, { headers });
const predictions = await response.json();

cURL

# Get predictions
curl -H "X-API-Key: your-key" http://localhost:8000/predictions/current

# Feed data
curl -X POST http://localhost:8000/feed/data \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-key" \
  -d '{"records": [...]}'