docs: expand cloud-storage section with Overview, Liquid-Ceph, and Observability

senolcolak · senolcolak · commit 9201f7be511d · 2026-06-09T14:47:47.000+02:00
Enrich the cloud-storage sidebar with:

- **Overview** (index.md): rewritten with full architecture description,
  component table, storage interface summary (RBD/CephFS/RGW), data flow
  diagram, and HA explanation. Replaces the previous thin component list.

- **Liquid-Ceph** (liquid-ceph.md): new page introducing the dynamic
  storage allocation component with a placeholder for upcoming detail.

- **Observability/** (new subfolder): dedicated subsection covering the
  monitoring stack for cloud storage.
  - index.md: overview of key metric categories and alerting strategy
  - prometheus.md: exporter sources, retention, and alert rule catalogue
  - perses.md: dashboard inventory and dashboard-as-code approach

Sidebar will now render:
  Cloud Storage &gt; Overview, Ceph, Rook, Prysm, Chorus, Arbiter,
  Liquid-Ceph, Observability &gt; (Overview, Prometheus, Perses)
diff --git a/docs/architecture/cloud-storage/index.md b/docs/architecture/cloud-storage/index.md
@@ -1,15 +1,74 @@
 ---
-title: Cloud Storage
+title: Overview
 ---
 
-# Cloud Storage
+# Cloud Storage Overview
 
-CobaltCore's cloud storage layer is built on [Ceph](./ceph.md), an open-source, distributed storage system that delivers object, block, and file storage in a single unified platform. The surrounding components manage its lifecycle, replication, observability, and high-availability configuration.
+CobaltCore's cloud storage layer is built on [Ceph](./ceph.md), a distributed storage system that delivers object, block, and file storage in a single unified platform. The surrounding components handle lifecycle automation, data replication, high-availability quorum, observability, and liquid storage allocation — each with a focused responsibility.
 
-| Component | Role |
-|-----------|------|
-| [Ceph](./ceph.md) | Distributed storage engine |
-| [Rook](./rook.md) | Kubernetes operator that automates Ceph deployment and management |
-| [Chorus](./chorus.md) | Data replication across object storage systems |
-| [Arbiter](./arbiter.md) | External arbiter monitors for Ceph quorum in stretched clusters |
-| [Prysm](./prysm.md) | Observability CLI for Ceph clusters and RGW deployments |
+## Architecture
+
+The storage stack is organized into three layers:
+
+**Foundation** — Ceph provides the core distributed storage engine. All other components either operate it, extend it, or observe it.
+
+**Operations** — [Rook](./rook.md) runs as a Kubernetes operator and manages the full lifecycle of Ceph daemons (monitors, managers, OSDs, MDS, RGW) as containerized workloads. [Arbiter](./arbiter.md) extends quorum into stretched cluster topologies by deploying external Ceph monitors that Rook does not manage directly.
+
+**Data Services** — [Chorus](./chorus.md) provides zero-downtime data replication and migration between object storage systems (S3 and Swift). [Liquid-Ceph](./liquid-ceph.md) enables dynamic, on-demand storage allocation across the cluster. [Prysm](./prysm.md) delivers a CLI-based observability layer over Ceph clusters and RGW deployments.
+
+## Components
+
+| Component | Layer | Role |
+|-----------|-------|------|
+| [Ceph](./ceph.md) | Foundation | Distributed storage engine — block (RBD), file (CephFS), object (RGW) |
+| [Rook](./rook.md) | Operations | Kubernetes operator for Ceph lifecycle management |
+| [Arbiter](./arbiter.md) | Operations | External Ceph monitors for quorum in stretched clusters |
+| [Chorus](./chorus.md) | Data Services | Zero-downtime object storage replication and migration |
+| [Liquid-Ceph](./liquid-ceph.md) | Data Services | Dynamic storage allocation across the Ceph cluster |
+| [Prysm](./prysm.md) | Data Services | Observability CLI for Ceph clusters and RGW |
+| [Observability](./observability/) | Observability | Metrics, dashboards, and alerting for the storage stack |
+
+## Storage Interfaces
+
+Ceph exposes three storage interfaces that CobaltCore services consume:
+
+- **RBD (RADOS Block Device)** — thin-provisioned, resizable block volumes used by virtual machines and databases. Striped across OSDs for parallel I/O and backed by RADOS snapshots and replication.
+- **CephFS** — POSIX-compliant distributed filesystem. Metadata is managed by a dedicated MDS cluster; data is striped across OSDs. Supports snapshots, quotas, and multiple active MDS daemons for horizontal metadata scaling.
+- **RGW (RADOS Gateway)** — S3 and Swift-compatible object storage gateway. Supports multi-tenancy, versioning, lifecycle policies, server-side encryption, and multi-site active-active replication.
+
+## Data Flow
+
+```
+Applications / VMs
+        │
+┌───────┴────────────────────┐
+│  RBD  │  CephFS  │  RGW   │  ← Ceph interfaces
+└───────┴────────────────────┘
+        │
+    RADOS (Reliable Autonomic Distributed Object Store)
+        │
+   OSDs across cluster nodes
+        │
+   ┌────┴─────┐
+   │  Rook    │  ← manages daemon lifecycle via Kubernetes CRDs
+   └──────────┘
+        │
+   ┌────┴──────┐   ┌─────────┐   ┌────────────┐
+   │  Arbiter  │   │  Chorus │   │ Liquid-Ceph│
+   └───────────┘   └─────────┘   └────────────┘
+   (quorum)        (replication)  (allocation)
+        │
+   ┌────┴──────┐
+   │   Prysm   │  ← observability CLI
+   └───────────┘
+```
+
+## High Availability
+
+Ceph achieves HA through monitor quorum (typically 3 or 5 monitors), OSD replication or erasure coding, and MDS standby daemons. In stretched deployments that span two sites, [Arbiter](./arbiter.md) deploys a third monitor at a tiebreaker site so that quorum is maintained even if one full site goes offline.
+
+## See Also
+
+- [Observability](./observability/) — Prometheus metrics and Perses dashboards for the storage stack
+- [Ceph upstream architecture docs](https://docs.ceph.com/en/latest/architecture/)
+- [Rook documentation](https://rook.io/docs/rook/latest-release/Getting-Started/intro/)
diff --git a/docs/architecture/cloud-storage/liquid-ceph.md b/docs/architecture/cloud-storage/liquid-ceph.md
@@ -0,0 +1,16 @@
+---
+title: Liquid-Ceph
+---
+
+# Liquid-Ceph
+
+Liquid-Ceph enables dynamic, on-demand storage allocation across the CobaltCore Ceph cluster. It abstracts the complexity of pool and quota management, allowing workloads to claim storage capacity fluidly without manual pre-provisioning steps.
+
+::: info
+Detailed documentation for Liquid-Ceph is in progress. This page will be updated as the component matures.
+:::
+
+## See Also
+
+- [Ceph](./ceph.md) — the underlying distributed storage engine
+- [Rook](./rook.md) — Kubernetes operator managing Ceph lifecycle
diff --git a/docs/architecture/cloud-storage/observability/index.md b/docs/architecture/cloud-storage/observability/index.md
@@ -0,0 +1,35 @@
+---
+title: Overview
+---
+
+# Observability Overview
+
+CobaltCore monitors the cloud storage stack through a combination of Prometheus-based metrics collection and Perses dashboards. Together they provide real-time visibility into Ceph cluster health, OSD performance, RGW throughput, and storage capacity trends.
+
+## Stack
+
+| Component | Role |
+|-----------|------|
+| [Prometheus](./prometheus.md) | Scrapes and stores time-series metrics from Ceph, Rook, and RGW exporters |
+| [Perses](./perses.md) | Dashboard platform for visualizing storage metrics and defining alerts |
+
+## Key Metrics
+
+The following signal categories are covered by the observability stack:
+
+- **Cluster health** — overall Ceph health status, OSD up/in counts, monitor quorum state
+- **Capacity** — raw and usable capacity, per-pool usage, growth rate projections
+- **Performance** — OSD read/write latency, IOPS, throughput per interface (RBD, CephFS, RGW)
+- **RGW** — request rates, error rates, bandwidth per bucket and user
+- **Replication** — Chorus replication lag, sync success/failure rates
+- **Availability** — Arbiter monitor reachability, MDS active/standby state
+
+## Alerting
+
+Alerts are defined as Prometheus rules and surfaced through the CobaltCore alerting pipeline. Critical thresholds include OSD near-full (85%), cluster degraded state, monitor quorum loss, and RGW error rate spikes.
+
+## See Also
+
+- [Prometheus](./prometheus.md)
+- [Perses](./perses.md)
+- [Prysm](../prysm.md) — CLI-based observability for Ceph and RGW
diff --git a/docs/architecture/cloud-storage/observability/perses.md b/docs/architecture/cloud-storage/observability/perses.md
@@ -0,0 +1,30 @@
+---
+title: Perses
+---
+
+# Perses
+
+Perses is the dashboard platform used in CobaltCore to visualize cloud storage metrics collected by [Prometheus](./prometheus.md). It provides pre-built dashboards for Ceph cluster health, OSD performance, RGW traffic, and capacity planning.
+
+## Dashboards
+
+| Dashboard | Purpose |
+|-----------|---------|
+| Ceph Cluster Overview | Health status, OSD counts, monitor quorum, capacity summary |
+| OSD Performance | Per-OSD read/write latency, IOPS, throughput |
+| Pool Usage | Capacity and object counts per Ceph pool |
+| RGW Traffic | Request rate, error rate, bandwidth per bucket and user |
+| Replication Status | Chorus sync lag and success/failure rates |
+
+## Dashboard-as-Code
+
+Dashboards are managed as code using the Perses CUE SDK and deployed via CI. This ensures dashboards are version-controlled alongside the rest of the CobaltCore configuration.
+
+::: info
+Dashboard definitions and deployment configuration are in progress.
+:::
+
+## See Also
+
+- [Prometheus](./prometheus.md) — metrics source for all dashboards
+- [Observability Overview](./index.md)
diff --git a/docs/architecture/cloud-storage/observability/prometheus.md b/docs/architecture/cloud-storage/observability/prometheus.md
@@ -0,0 +1,37 @@
+---
+title: Prometheus
+---
+
+# Prometheus
+
+Prometheus collects and stores time-series metrics from the CobaltCore cloud storage stack. It scrapes exporters provided by Ceph, Rook, and the RADOS Gateway, making storage metrics available for alerting and dashboard queries.
+
+## Exporters
+
+| Exporter | Source | Metrics |
+|----------|--------|---------|
+| `ceph-exporter` | Ceph daemons | OSD stats, pool usage, cluster health, latency histograms |
+| `rook-ceph-mgr` | Rook Ceph manager | Operator status, daemon lifecycle events |
+| `radosgw-exporter` | RGW | Request rates, error rates, per-user and per-bucket bandwidth |
+
+## Retention and Storage
+
+Metrics are retained according to the cluster-wide Prometheus retention policy. Long-term storage is handled by the remote-write pipeline configured in the CobaltCore monitoring stack.
+
+## Alert Rules
+
+Storage-specific alert rules are maintained alongside the other CobaltCore alerting rules. Key rules include:
+
+- `CephHealthWarning` / `CephHealthError` — cluster health degradation
+- `CephOSDNearFull` — OSD usage exceeding 85%
+- `CephMonQuorumLost` — loss of monitor quorum
+- `RGWHighErrorRate` — elevated 5xx rate on the gateway
+
+::: info
+Detailed rule definitions and Prometheus configuration are in progress.
+:::
+
+## See Also
+
+- [Perses](./perses.md) — dashboard platform consuming these metrics
+- [Observability Overview](./index.md)