Architecture Overview

System architecture for TelemetryFlow Hermes — a multi-agent AI incident response pipeline integrated with the TelemetryFlow Observability Platform.

High-Level Architecture

graph TB
    subgraph "TelemetryFlow Platform"
        TFO_API["TFO API<br/>/api/v2/*"]
        TFO_LLM["LLM Module<br/>Chat · Insights · Providers"]
        TFO_CH["ClickHouse<br/>Metrics · Logs · Traces · Exemplars"]
        TFO_ALERT["Alert Engine<br/>33 Alert Rules"]
        TFO_AUTH["Auth Module<br/>JWT · API Keys · RBAC"]
    end

    subgraph "Hermes Agent Framework"
        direction TB
        H_CORE["AIAgent Core<br/>ReAct Loop"]
        H_SOUL["SOUL.md<br/>Identity"]
        H_MEM["Memory<br/>Tier 1 · Tier 2 · Tier 3"]
        H_SKILLS["Skills<br/>29 Bundled · 18 Categories"]
        H_PLUGIN["TelemetryFlow Plugin<br/>40 Tools"]
    end

    subgraph "Agent Team (Adversarial Scientists)"
        TRIAGE["Triage<br/>Paranoid Gatekeeper<br/>glm-5.1"]
        INV["Investigator<br/>Hostile Scientist<br/>claude-sonnet-4-5"]
        REV["Reviewer<br/>Professional Skeptic<br/>glm-5.1"]
        REM["Remediator<br/>Cautious Pragmatist<br/>glm-5.1"]
    end

    subgraph "Human Interface"
        TG["Telegram<br/>4 Bot Tokens"]
        HUMAN["On-Call Engineer"]
    end

    TFO_ALERT -->|"webhook"| TRIAGE
    TRIAGE -->|"delegate"| INV
    INV -->|"delegate"| REV
    REV -->|"delegate"| REM
    REM -->|"approval request"| TG
    TG -->|"approve/reject"| HUMAN

    H_CORE --> H_SOUL
    H_CORE --> H_MEM
    H_CORE --> H_SKILLS
    H_CORE --> H_PLUGIN

    H_PLUGIN -->|"REST API"| TFO_API
    TFO_API --> TFO_LLM
    TFO_API --> TFO_CH
    TFO_API --> TFO_AUTH

    INV -.->|"query_metrics<br/>search_logs<br/>list_traces<br/>get_exemplars"| TFO_CH
    REV -.->|"read-only verification"| TFO_CH

Data Flow — Incident Response Pipeline

sequenceDiagram
    participant TF as TelemetryFlow
    participant TR as Triage Agent
    participant INV as Investigator Agent
    participant REV as Reviewer Agent
    participant REM as Remediator Agent
    participant H as Human (Telegram)

    TF->>TR: Alert webhook fires
    activate TR
    TR->>TR: Assume alert is lying
    TR->>TR: Classify severity (zero hallucination)
    TR->>TR: Check MEMORY.md for known patterns
    TR->>INV: Delegate investigation (with challenge if INCOMPLETE)
    deactivate TR

    activate INV
    INV->>TF: query_metrics (metrics_1m)
    INV->>TF: search_logs (otel_logs)
    INV->>TF: list_traces (otel_traces)
    INV->>TF: get_exemplars (memory_usage)
    INV->>INV: Form hypotheses + falsify each (guilty until proven innocent)
    INV->>INV: Document dead hypotheses
    INV->>REV: Delegate review (surviving hypothesis + evidence)
    deactivate INV

    activate REV
    REV->>TF: Verify evidence (read-only)
    REV->>REV: Hunt for disconfirming evidence
    REV->>REV: Verdict: CONFIRMED / NEEDS_MORE_EVIDENCE / REJECTED
    REV->>REM: CONFIRMED hypothesis
    deactivate REV

    activate REM
    REM->>REM: Gate 1: Root cause confirmed?
    REM->>REM: Gate 2: Proportional response?
    REM->>REM: Blast radius analysis
    REM->>H: Telegram: Alert + Evidence + Proposed Action
    H->>REM: Approve / Reject
    REM->>TF: Execute remediation
    REM->>REM: Post-action verification (mandatory)
    deactivate REM

Component Architecture

graph LR
    subgraph "Project Files"
        direction TB
        CFG["config.yaml<br/>Default agent config"]
        ENV[".env<br/>API keys & secrets"]
        SOUL["SOUL.md<br/>Agent identity"]
        MEM["memories/<br/>MEMORY.md · USER.md"]
        PROFILES["profiles/<br/>triage · investigator · reviewer · remediator"]
        SKILLS["skills/<br/>observability/ · database-monitoring/"]
        PLUGIN["plugins/telemetryflow/<br/>plugin.yaml + 40 tools"]
        CRON["cron/<br/>jobs.json (6 scheduled tasks)"]
        HOOKS["hooks/<br/>pre-investigation · post-remediation · on-alert-fired"]
        SEC["security/<br/>clickhouse-readonly.sql"]
        SCRIPTS["scripts/<br/>install · setup · verify · deploy"]
    end

    subgraph "TFO Platform Monolith"
        direction TB
        LLM_MOD["backend/src/modules/llm/"]
        AUTH_MOD["backend/src/modules/auth/"]
        CH_MOD["ClickHouse client"]
    end

    CFG --> PROFILES
    ENV --> PLUGIN
    PLUGIN -->|"tfo_request()"| LLM_MOD
    PLUGIN -->|"clickhouse_query()"| CH_MOD
    HOOKS --> SCRIPTS
    SEC --> CH_MOD

Technology Stack

Layer	Technology	Purpose
Agent Framework	Hermes Agent (Nous Research)	Self-improving AI agent with memory, skills, multi-agent
LLM Providers	Anthropic, Zhipu (OpenCode Go), OpenRouter, Ollama	Model inference for different agent roles
Observability	TelemetryFlow Platform	Metrics, Logs, Traces, Exemplars in ClickHouse
Database	ClickHouse	Columnar storage for all telemetry signals
Plugin Runtime	Python 3 (stdlib only)	40 tools using `urllib`, `json`, `sys`
Communication	Telegram Bot API	Human-in-the-loop notifications
Container Orchestration	Kubernetes	Pod/deployment management for remediation
Security	JWT, API Keys (tfk/tfs), AES-256-GCM	Authentication and encryption

Directory Structure

telemetryflow-hermes/
├── config.yaml                          # Default Hermes agent configuration
├── SOUL.md                              # Default agent identity
├── .env.example                         # API key template (3 auth methods)
├── Makefile                             # init, configure, deploy, verify, docker, clean targets
├── pyproject.toml                       # Python project config (pytest, ruff, coverage)
├── Dockerfile                           # Multi-stage Docker (python:3.13-slim-trixie)
├── docker-compose.yaml                  # 4 profiles: core, monitoring, tools, all
├── run-container.sh                     # Build, tag, push, compose orchestration
│
├── profiles/                            # Multi-agent team (4 profiles)
│   ├── triage/                          # Paranoid gatekeeper
│   │   ├── config.yaml                  #   glm-5.1, max_turns=30, readonly
│   │   ├── SOUL.md                      #   Paranoid gatekeeper identity
│   │   └── memories/                    #   MEMORY.md + USER.md
│   ├── investigator/                    # Hostile scientist
│   │   ├── config.yaml                  #   claude-sonnet-4-5, max_turns=45
│   │   ├── SOUL.md                      #   Hostile scientist identity
│   │   └── memories/
│   ├── reviewer/                        # Professional skeptic
│   │   ├── config.yaml                  #   glm-5.1, max_turns=20, readonly
│   │   ├── SOUL.md                      #   Professional skeptic identity
│   │   └── memories/
│   └── remediator/                      # Cautious pragmatist
│       ├── config.yaml                  #   glm-5.1, max_turns=15, require_approval
│       ├── SOUL.md                      #   Cautious pragmatist identity
│       └── memories/
│
├── skills/                              # 29 bundled skills (18 categories)
│   ├── monitoring/                      #   8 skills (uptime, vm, agent, k8s, service-map, network-map, ...)
│   ├── observability/                   #   9 skills (k8s-pod-debug, payments-api-oom-rca, ...)
│   ├── database-monitoring/             #   2 skills (slow-query-detection, qan-analysis)
│   ├── alerting/                        #   alert-management
│   ├── dashboard/                       #   dashboard-management
│   ├── reporting/                       #   report-automation
│   ├── retention/                       #   retention-management
│   ├── audit/                           #   audit-compliance
│   ├── subscription/                    #   subscription-management
│   ├── tenancy/                         #   tenancy-administration
│   ├── iam/                             #   iam-administration
│   ├── sso/                             #   sso-configuration
│   ├── query/                           #   tfql-query
│   ├── ai-intelligence/                 #   ai-intelligence
│   └── ...                              #   18 categories total
│
├── plugins/                             # TelemetryFlow plugin
│   └── telemetryflow/
│       ├── plugin.yaml                  # v1.2.0 — 40 tools
│       └── tools/                       # 40 Python tools (stdlib only)
│           ├── _shared.py               # TFO API helpers, 74 ContextTypes, 15 ProviderTypes
│           │
│           │ # ── Core Telemetry (5) ──
│           ├── query_metrics.py
│           ├── search_logs.py
│           ├── list_traces.py
│           ├── get_exemplars.py
│           ├── query_correlations.py
│           │
│           │ # ── Monitoring (8) ──
│           ├── check_k8s.py
│           ├── check_infra.py
│           ├── check_uptime.py
│           ├── check_vm.py
│           ├── check_agent.py
│           ├── check_service_map.py
│           ├── check_network_map.py
│           ├── check_db_monitoring.py
│           │
│           │ # ── AI & LLM (7) ──
│           ├── chat_with_context.py
│           ├── stream_chat.py
│           ├── manage_conversation.py
│           ├── generate_insight.py
│           ├── query_llm_usage.py
│           ├── manage_provider.py
│           ├── query_ai_intelligence.py
│           │
│           │ # ── Platform (8) ──
│           ├── query_platform.py
│           ├── query_account.py
│           ├── query_audit.py
│           ├── query_subscription.py
│           ├── manage_dashboards.py
│           ├── manage_alerts.py
│           ├── manage_reports.py
│           ├── manage_data_masking.py
│           │
│           │ # ── Infrastructure (6) ──
│           ├── manage_retention.py
│           ├── manage_tenancy.py
│           ├── manage_iam.py
│           ├── manage_sso.py
│           ├── query_tfql.py
│           │
│           │ # ── Remediation (3+1) ⚠ requires_approval ──
│           ├── scale_deployment.py       # ⚠
│           ├── restart_pod.py            # ⚠
│           ├── rollback_deploy.py        # ⚠
│           └── update_alert.py           # ⚠
│           │
│           │ # ── RCA & Postmortem (3) ──
│           ├── generate_rca_report.py    # 5W analysis, Jira/Trello
│           ├── generate_postmortem.py    # Lessons learned, timeline
│           └── generate_rca_template.py  # Blank template for manual review
│
├── tests/                               # 521 tests, 99% coverage
│   ├── conftest.py                      # Shared fixtures
│   ├── mocks/                           # MockTFOApi, response factories
│   ├── unit/                            # 37 tool test files
│   └── integration/                     # Pipeline integration tests
│
├── cron/                                # Scheduled tasks
│   ├── jobs.json                        # 6 cron jobs
│   └── output/                          # Cron run outputs
│
├── scripts/                             # Deployment scripts
│   ├── install.sh                       # Hermes Agent installer
│   ├── setup-profiles.sh                # Create 4 agent profiles
│   ├── setup-telegram.sh                # Configure Telegram gateways
│   ├── verify-pipeline.sh               # End-to-end pipeline verification
│   └── deploy-air-gapped.sh             # Air-gapped deployment
│
├── security/                            # Database security
│   ├── clickhouse-readonly.sql          # Read-only user (20 tables)
│   └── setup-readonly-user.sh           # Automated setup script
│
├── hooks/                               # Lifecycle hooks
│   ├── on-alert-fired.sh                # Alert enrichment before triage
│   ├── pre-investigation.sh             # Investigation context logging
│   └── post-remediation.sh              # Remediation outcome tracking
│
├── .github/workflows/                   # GitHub Actions CI/CD
│   ├── ci.yml                           #   lint → test-unit → test-integration → security → coverage
│   ├── docker.yml                       #   Multi-platform Docker build (amd64/arm64)
│   └── release.yml                      #   Tag-triggered release
├── .gitlab-ci.yml                       # GitLab CI/CD pipeline
│
└── docs/                                # Documentation wiki (28+ pages)
    ├── README.md                        # Wiki index
    ├── getting-started.md
    ├── architecture.md
    ├── tfo-hermes.md                    # Marp presentation
    ├── agents/                          # Agent docs
    ├── tools/                           # Tool overview + reference
    ├── skills/                          # Skill overview + reference
    ├── api/                             # Auth, LLM module, context types
    ├── deployment/                      # Standard, Docker, air-gapped
    ├── security/                        # Security overview + ClickHouse
    ├── configuration/                   # Environment variables reference
    └── operations/                      # Cron, hooks, troubleshooting

Key Design Decisions

1. Python stdlib only (no pip dependencies)

All 40 plugin tools use only Python standard library (urllib, json, sys, os). No requests, httpx, or external packages. This maximizes portability and eliminates supply chain risk.

2. All queries go through TFO API

ClickHouse queries are routed through POST /api/v2/telemetry/query, not direct connections. This ensures:

Authentication and authorization via TFO's auth guards
Workspace-scoped data isolation
Audit logging of all queries
Rate limiting per organization

3. Cost-optimized model selection

Agent	Model	Cost/Incident	Why
Triage	glm-5.1 (OpenCode Go)	~$0.01	Simple classification, fast response
Investigator	claude-sonnet-4-5	~$0.05-0.15	Complex reasoning, evidence correlation
Reviewer	glm-5.1	~$0.03-0.08	Verification, not creative reasoning
Remediator	glm-5.1	~$0.01-0.03	Action proposal, not investigation

Total: ~$0.10-0.27/incident (vs ~$0.39 with Claude-only)

4. Adversarial scientist architecture with separate contexts

All agents operate as adversarial scientists who assume inputs are wrong and actively falsify hypotheses. The Reviewer agent runs in a completely separate context — it only sees the evidence and hypothesis, not the Investigator's thought process or dead hypotheses. This prevents:

Confirmation bias (defending conclusions)
Anchoring bias (overweighting initial findings)
Sunk cost bias (continuing a failed approach)
Conspiracy bias (accepting pre-falsified narratives)

Verdicts are strictly CONFIRMED / NEEDS_MORE_EVIDENCE / REJECTED — "Looks good to me" is banned.

5. Human-in-the-loop for all mutations (three-gate model)

The Remediator enforces three gates before any action:

Confirmed root cause — Reviewer verdict must be CONFIRMED
Proportional response — action must fit the incident severity
Human approval — explicit Telegram approval required

Four tools require explicit human approval via Telegram:

scale_deployment — changes replicas
restart_pod — kills running pods
rollback_deploy — reverts deployments
update_alert — modifies alert rules

The Remediator has a 600-second approval timeout with auto-escalation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture Overview

High-Level Architecture

Data Flow — Incident Response Pipeline

Component Architecture

Technology Stack

Directory Structure

Key Design Decisions

1. Python stdlib only (no pip dependencies)

2. All queries go through TFO API

3. Cost-optimized model selection

4. Adversarial scientist architecture with separate contexts

5. Human-in-the-loop for all mutations (three-gate model)

FilesExpand file tree

architecture.md

Latest commit

History

architecture.md

File metadata and controls

Architecture Overview

High-Level Architecture

Data Flow — Incident Response Pipeline

Component Architecture

Technology Stack

Directory Structure

Key Design Decisions

1. Python stdlib only (no pip dependencies)

2. All queries go through TFO API

3. Cost-optimized model selection

4. Adversarial scientist architecture with separate contexts

5. Human-in-the-loop for all mutations (three-gate model)