Monitoring Guide

Guide for monitoring the TelemetryFlow platform itself — health checks, metrics, logging, and alerting.

Monitoring Architecture

TelemetryFlow is an observability platform that can also monitor itself (self-monitoring). The TFO Agent collects platform metrics, the Collector processes them, and the Backend stores them in ClickHouse for visualization in TFO Viz.

flowchart TD
    subgraph "TelemetryFlow Platform (Self-Monitoring)"
        BE["TFO Backend<br/>:8080"]
        COL["TFO Collector<br/>:4317/4318"]
        AG["TFO Agent"]
        CH[("ClickHouse")]
        VIZ["TFO Viz"]
    end

    subgraph "External Monitoring (Optional)"
        PROM["Prometheus<br/>ServiceMonitor"]
        GRAFANA["Grafana"]
        ALERTM["Alertmanager"]
    end

    subgraph "Infrastructure Metrics"
        BE_METRICS["Backend<br/>/metrics"]
        COL_METRICS["Collector<br/>:8889/metrics"]
        PG_EXP["Postgres Exporter"]
        CH_EXP["ClickHouse Exporter"]
        RD_EXP["Redis Exporter"]
        NT_EXP["NATS Exporter"]
    end

    AG -->|"Scrape host/container"| COL
    BE_METRICS -->|"Prometheus"| AG
    COL_METRICS -->|"Prometheus"| AG
    PG_EXP & CH_EXP & RD_EXP & NT_EXP -->|"Prometheus"| AG

    AG -->|"OTLP"| COL
    COL -->|"OTLP HTTP"| BE
    BE --> CH
    VIZ -->|"Query"| BE

    AG -->|"Scrape (if ServiceMonitor)"| PROM
    PROM --> GRAFANA
    PROM --> ALERTM

    style BE fill:#e8f5e9
    style COL fill:#fff3e0
    style CH fill:#fce4ec
    style PROM fill:#e1f5fe

Health Check Endpoints

Application Services

Service	Endpoint	Method	Expected	Interval	Purpose
TFO Backend	`/health/live`	GET	200	15s	Liveness — process is alive
TFO Backend	`/health/ready`	GET	200	10s	Readiness — can serve traffic
TFO Collector	`:13133/health`	GET	200	10s	Collector health check extension
TFO Agent	`:13133/`	GET	200	15s	Agent health check
TFO Viz	`/`	GET	200	15s	Frontend serving

Infrastructure Services

Service	Endpoint	Method	Expected	Purpose
PostgreSQL	`pg_isready -U <user> -d <db>`	Exec	exit 0	Database accepting connections
ClickHouse	`:8123/ping`	GET	`Ok.`	Server responding
Redis	`redis-cli -a <pass> ping`	Exec	`PONG`	Server responding
NATS	`:8222/healthz`	GET	200	JetStream healthy

Docker Compose Health Checks

All infrastructure services in Docker Compose have built-in health checks:

postgres:
  healthcheck:
    test: ["CMD-SHELL", "pg_isready -U telemetryflow -d telemetryflow"]
    interval: 10s
    timeout: 5s
    retries: 5
    start_period: 30s

clickhouse:
  healthcheck:
    test: ["CMD-SHELL", "wget --spider --tries 1 http://localhost:8123/ping"]
    interval: 10s
    timeout: 5s
    retries: 5
    start_period: 30s

redis:
  healthcheck:
    test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
    interval: 10s
    timeout: 5s
    retries: 5
    start_period: 15s

nats:
  healthcheck:
    test: ["CMD-SHELL", "wget --spider --tries 1 http://localhost:8222/healthz"]
    interval: 10s
    timeout: 5s
    retries: 5
    start_period: 15s

Metrics Endpoints

Service	Endpoint	Format	Description
TFO Collector	`:8889/metrics`	Prometheus	Collector pipeline metrics (spans, metrics, logs processed)
TFO Backend	`/metrics`	Prometheus	Application metrics (request rate, latency, errors)
TFO Agent	`:2025/metrics`	Prometheus	Agent collection metrics
PostgreSQL Exporter	`:9187/metrics`	Prometheus	Database metrics (connections, replication, locks)
ClickHouse Exporter	`:9090/metrics`	Prometheus	Database metrics (queries, parts, merges)
Redis Exporter	`:9121/metrics`	Prometheus	Cache metrics (hits, misses, memory, connections)
NATS Exporter	`:7777/metrics`	Prometheus	JetStream metrics (messages, bytes, consumers)

Key Metrics to Monitor

Metric	Source	Alert Threshold
`otelcol_receiver_accepted_spans`	Collector	Sustained zero = pipeline issue
`otelcol_exporter_sent_spans`	Collector	Sustained zero = export failure
`otelcol_processor_refused_spans`	Collector	> 0 = processor overloaded
`otelcol_processor_dropped_spans`	Collector	> 0 = dropping data
`http_server_request_duration_seconds`	Backend	p99 > 5s
`http_server_active_requests`	Backend	Sustained high = overload
`pg_stat_activity_count`	PostgreSQL	Near `max_connections`
`clickhouse_queries`	ClickHouse	High error rate
`redis_connected_clients`	Redis	Near `maxclients`
`redis_used_memory`	Redis	Near `maxmemory`
`nats_jetstream_messages`	NATS	Unbounded growth = consumer lag

Log Aggregation

Docker Compose

# View all logs
docker compose logs -f

# Specific service logs
docker compose logs -f backend
docker compose logs -f tfo-collector
docker compose logs -f tfo-agent

# Last 100 lines
docker compose logs --tail 100 backend

# With timestamps
docker compose logs -t backend

# Logs are stored in volumes
ls volumes/backend/logs/
ls volumes/clickhouse/logs/

Kubernetes

# Pod logs
kubectl logs -f deployment/tfo-backend -n telemetryflow
kubectl logs -f deployment/tfo-collector -n telemetryflow

# DaemonSet logs (agent on specific node)
kubectl logs -f daemonset/tfo-agent -n telemetryflow

# Previous container (crash)
kubectl logs deployment/tfo-backend --previous -n telemetryflow

# All pods with label
kubectl logs -l app=tfo-backend -n telemetryflow --all-containers

Structured Logging

TelemetryFlow services output structured JSON logs:

{
  "timestamp": "2026-05-30T12:00:00.000Z",
  "level": "info",
  "service": "tfo-backend",
  "message": "Request processed",
  "method": "GET",
  "path": "/api/v1/metrics",
  "statusCode": 200,
  "durationMs": 45
}

Alerting Configuration

Kubernetes Probes (Built-in)

# Already configured in Helm values
tfoBackend:
  healthChecks:
    liveness:
      enabled: true
      httpGet:
        path: /health/live
        port: http
      initialDelaySeconds: 30
      periodSeconds: 15
      failureThreshold: 3
    readiness:
      enabled: true
      httpGet:
        path: /health/ready
        port: http
      initialDelaySeconds: 10
      periodSeconds: 10
      failureThreshold: 3
    startup:
      enabled: true
      httpGet:
        path: /health/live
        port: http
      initialDelaySeconds: 5
      failureThreshold: 30

Prometheus Alert Rules (Example)

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: telemetryflow-alerts
  namespace: telemetryflow
spec:
  groups:
    - name: telemetryflow
      rules:
        - alert: TelemetryFlowBackendDown
          expr: up{job="tfo-backend"} == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "TFO Backend is down"
            description: "Backend has been unreachable for 2 minutes"

        - alert: TelemetryFlowCollectorDroppingSpans
          expr: rate(otelcol_processor_dropped_spans[5m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Collector dropping spans"

        - alert: TelemetryFlowPostgresConnectionsHigh
          expr: pg_stat_activity_count > 80
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "PostgreSQL connections approaching limit"

        - alert: TelemetryFlowClickHouseDiskSpace
          expr: clickhouse_disk_space_available_bytes < 10737418240
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "ClickHouse less than 10GB free"

        - alert: TelemetryFlowRedisMemoryHigh
          expr: redis_used_memory_bytes / redis_memory_max_bytes > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Redis using > 90% of max memory"

Enabling ServiceMonitors

# In Helm values
monitoring:
  serviceMonitor:
    enabled: true
    interval: 30s
    scrapeTimeout: 10s
    honorLabels: true

  exporters:
    redis:
      enabled: true
    nats:
      enabled: true
    postgres:
      enabled: true
    clickhouse:
      enabled: true

Dashboard Setup

TFO Viz (Built-in)

Access the built-in dashboard at http://localhost:8080 (or your configured URL). It provides:

Infrastructure overview (hosts, containers, pods)
Application performance metrics
Log search and filtering
Distributed trace visualization
Alert management

Grafana (External)

If using an external Prometheus + Grafana stack:

Add Prometheus data sources for each exporter endpoint
Import dashboards or create custom panels for:
- Backend request rate, latency, error rate
- Collector pipeline throughput and drops
- Database performance and connections
- Cache hit rates and memory usage
- NATS consumer lag

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring Guide

Monitoring Architecture

Health Check Endpoints

Application Services

Infrastructure Services

Docker Compose Health Checks

Metrics Endpoints

Key Metrics to Monitor

Log Aggregation

Docker Compose

Kubernetes

Structured Logging

Alerting Configuration

Kubernetes Probes (Built-in)

Prometheus Alert Rules (Example)

Enabling ServiceMonitors

Dashboard Setup

TFO Viz (Built-in)

Grafana (External)

FilesExpand file tree

MONITORING.md

Latest commit

History

MONITORING.md

File metadata and controls

Monitoring Guide

Monitoring Architecture

Health Check Endpoints

Application Services

Infrastructure Services

Docker Compose Health Checks

Metrics Endpoints

Key Metrics to Monitor

Log Aggregation

Docker Compose

Kubernetes

Structured Logging

Alerting Configuration

Kubernetes Probes (Built-in)

Prometheus Alert Rules (Example)

Enabling ServiceMonitors

Dashboard Setup

TFO Viz (Built-in)

Grafana (External)