HA And Failover Topology

The primary validated deployment contract in this repository remains a single-node host-first deployment. In addition, the repository now validates a customer-operated manual active-passive failover drill evidence surface for a two-host reference pattern. This document explains what is primary today, what HA proof is now validated, and what ACP still does not automate.

This is topology and claim-boundary guidance, not runtime automation. Do not describe ACP as automatic failover, cluster orchestration, or ACP-managed PostgreSQL replication.

Current Supported Topology

The current supported production topology is:

one Debian 12+ or Ubuntu 24.04+ host
one Docker/Compose runtime
one LiteLLM gateway
one PostgreSQL instance
optional overlays on the same host: Caddy TLS, Presidio DLP, LibreChat UI, offline mode
one canonical secrets file: /etc/ai-control-plane/secrets.env
automated local backups via ai-control-plane-backup.timer (default: daily, keep 7)
recovery by restoring host access, restoring the database, re-applying the host deployment, and re-running health checks

The tracked Ansible playbook, deploy/ansible/playbooks/gateway_host.yml, still converges one gateway host at a time. The validated two-host reference pattern reuses that single-host convergence surface against each host before and after a manual promotion. ACP still does not ship a cluster playbook, replication controller, fencing controller, or automatic traffic-cutover controller.

Clients
   |
   v
[ optional Caddy TLS ]
   |
   v
[ LiteLLM gateway ] <----> [ PostgreSQL ]
        \______________________________/
             same host, same storage,
             same operator recovery path

Any selected overlays share that same host-level failure domain.

What "High Availability" And "Failover" Mean Here

For this repository:

High availability means service remains available across a host or database failure without rebuilding the primary service from backup on that same node.
Failover means client traffic and authoritative state move to a secondary host or secondary database role after a failure.
Recovery means operators restore service after a failure by restarting services, restoring data, re-applying the deployment, or rebuilding a host.

Truthful boundary for the current supported topology:

Service restarts on the same host are recovery, not failover.
Backup/restore workflows are disaster recovery, not automatic HA.
The current supported host-first contract does not provide automatic failover for host loss, local storage loss, or database loss.
The current supported contract is single-node operation plus documented recovery.

Failure Domains In The Current Supported Topology

Failure domain	What fails together	Expected impact	Supported response today	Automatic failover?
LiteLLM process/container failure	Gateway request path on the host	Gateway requests fail until service is restarted or re-converged	`make health`, `./scripts/acpctl.sh doctor`, `./scripts/acpctl.sh host apply ...`	No
PostgreSQL process/container failure	Gateway state and request handling	The gateway can become unavailable or unhealthy because its database is unavailable	DB restore/status workflows, then re-apply and re-verify	No
Docker daemon / Compose runtime failure	All ACP containers on the host	Full control-plane outage on that host	Restore Docker health, re-run host workflow, verify health/smoke	No
Host OS / VM / hardware failure	Gateway, database, overlays, systemd timers, local artifacts	Full service outage	Restore a host, restore secrets, restore the database, re-apply the deployment	No
Local storage failure or corruption	Database files, repo checkout, local backups, local cert/runtime state	Outage plus possible data loss	Restore from a known-good backup copy and re-converge	No
Customer DNS / load balancer / network failure	External reachability	Clients may lose access even if ACP services are healthy	Customer-operated DNS/LB/network recovery	No

Backup, RPO, And RTO Truth

The default repo-backed protection mechanisms are scheduled backups and documented restore drills.

Default automated backup cadence: daily
Default retention: keep newest 7 backups
Default backup location in the tracked runtime: demo/backups/

Truthful implications:

With the default daily backup timer, the worst-case reference RPO is up to 24 hours of data loss since the last successful backup.
That RPO is a reference based on current timer defaults, not a universal SLA or guarantee.
If backups exist only on the same host/storage that failed, a disk or host-loss event can destroy both the live database and the local backup artifacts. In that case, effective data loss can be total.
Off-host backup copies and retention remain customer-owned. The repo now validates a staged off-host copy through an explicit manifest plus scratch-restore drill, with truthful staged-local and separate-host evidence labeling, but it still does not automate replication transport into S3, NFS, rsync targets, or other customer storage, and neither drill mode is HA or failover automation.
The repo does not currently publish a fixed minute-based RTO. Recovery time depends on spare-host availability, operator response, backup accessibility, network cutover, and environment-specific constraints.
The validated recovery sequence is: restore host access -> restore /etc/ai-control-plane/secrets.env -> restore the database -> re-apply the deployment -> verify with make health and make prod-smoke.

See DISASTER_RECOVERY.md for the supported restore workflow.

Customer-Owned Availability Responsibilities

The current host-first support boundary does not own these HA building blocks:

off-host backup copy/replication and longer-term retention remain customer-owned; ACP validates a staged recovery copy but does not perform the transport
replacement-host provisioning or standby-host readiness
DNS failover, load balancer health checks, VIP ownership, or traffic-manager cutover
network routing, firewall policy, and external reachability
PostgreSQL replication, fencing, and split-brain prevention in any multi-host design
monitoring/on-call procedures for customer infrastructure
customer-specific failover drills and acceptance criteria

Validated Active-Passive Reference Pattern: Manual Failover With PostgreSQL Replication

The next credible HA pattern is now validated as a customer-operated manual drill evidence surface.

Validated components now present in the repository:

typed failover-drill contract validation in internal/ha
operator command surface: acpctl host failover-drill
make entrypoint: make ha-failover-drill
repeatable private evidence bundling under demo/logs/evidence/ha-failover-drill/
operator runbook: HA_FAILOVER_RUNBOOK.md
example two-host inventory: deploy/ansible/inventory/hosts.ha.example.yml
production-only readiness-gate wiring in demo/config/readiness_evidence.yaml

Manual customer-operated active-passive failover proof only. ACP validates the drill contract and archives evidence for replication readiness, fencing, promotion, traffic cutover, and post-cutover checks. ACP does not automate PostgreSQL replication, promotion, or customer-owned DNS/load-balancer/VIP cutover.

Why this is the validated next step:

it preserves one authoritative PostgreSQL writer
it reduces the single-host outage gap without pretending ACP ships automatic HA
it fits the existing host-first operating model by converging one host at a time
it keeps customer-owned fencing and traffic cutover explicit

Reference pattern:

                    +-------------------------------+
Clients ----------->| Customer-owned DNS / LB / VIP |
                    +-------------------------------+
                               |
                               v
                     +-------------------+
                     | Active ACP host   |
                     | LiteLLM + Caddy   |
                     | PostgreSQL primary|
                     +-------------------+
                               |
                     streaming replication
                               |
                               v
                     +-------------------+
                     | Passive ACP host  |
                     | warm standby ACP  |
                     | PostgreSQL replica|
                     +-------------------+

Recommended characteristics:

Active host: serves client traffic and runs the PostgreSQL primary.
Passive host: remains provisioned and ready, with PostgreSQL receiving streaming replication and ACP deployment assets staged.
Customer-owned cutover: DNS, load balancer, or virtual IP determines which host receives traffic.
Promotion discipline: on primary loss, fence the failed primary first, promote the standby database, start and verify the standby gateway path, then swing traffic.
Secrets discipline: the canonical secrets material required for promotion must exist on both hosts through a secure customer-owned distribution process.
Backup discipline: keep backups off-host even when replication exists; replication protects availability, not accidental deletion or corruption.

Important boundary:

This pattern is a validated customer-operated manual failover-drill evidence surface in the current repository.
ACP still does not automate multi-host failover, PostgreSQL promotion, fencing, split-brain prevention, or customer-owned cutover.
Do not present this pattern as automatic failover or a managed cutover controller.

Decision Guide

Requirement	Truthful answer today
"Can the supported host-first deployment survive a single-host outage without downtime?"	The primary deployment topology is still single-node, so not by itself.
"Do scheduled backups equal HA?"	No. They improve recovery, not failover.
"Is disaster recovery supported?"	Yes. Backup, restore, scratch-restore drills, and re-apply workflows are part of the current contract.
"Is there a validated HA pattern?"	Yes. A customer-operated active-passive reference pattern is validated through the typed failover-drill contract, runbook, inventory example, and repeatable evidence workflow.
"Is automatic failover supported?"	No. ACP does not automate PostgreSQL replication, promotion, fencing, split-brain prevention, or customer-owned DNS/load-balancer/VIP cutover.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HA And Failover Topology

Current Supported Topology

What "High Availability" And "Failover" Mean Here

Failure Domains In The Current Supported Topology

Backup, RPO, And RTO Truth

Customer-Owned Availability Responsibilities

Validated Active-Passive Reference Pattern: Manual Failover With PostgreSQL Replication

Decision Guide

Related Documents

FilesExpand file tree

HA_FAILOVER_TOPOLOGY.md

Latest commit

History

HA_FAILOVER_TOPOLOGY.md

File metadata and controls

HA And Failover Topology

Current Supported Topology

What "High Availability" And "Failover" Mean Here

Failure Domains In The Current Supported Topology

Backup, RPO, And RTO Truth

Customer-Owned Availability Responsibilities

Validated Active-Passive Reference Pattern: Manual Failover With PostgreSQL Replication

Decision Guide

Related Documents