cobaltcore-dev · senolcolak · Jun 8, 2026 · Jun 8, 2026 · Jun 9, 2026 · Jun 9, 2026
diff --git a/docs/architecture/arbiter.md → docs/architecture/cloud-storage/arbiter.md b/docs/architecture/arbiter.md → docs/architecture/cloud-storage/arbiter.md
diff --git a/docs/architecture/ceph.md → docs/architecture/cloud-storage/ceph.md b/docs/architecture/ceph.md → docs/architecture/cloud-storage/ceph.md
diff --git a/docs/architecture/chorus.md → docs/architecture/cloud-storage/chorus.md b/docs/architecture/chorus.md → docs/architecture/cloud-storage/chorus.md
diff --git a/docs/architecture/cloud-storage/index.md b/docs/architecture/cloud-storage/index.md
@@ -0,0 +1,74 @@
+---
+title: Overview
+---
+
+# Cloud Storage Overview
+
+CobaltCore's cloud storage layer is built on [Ceph](./ceph.md), a distributed storage system that delivers object, block, and file storage in a single unified platform. The surrounding components handle lifecycle automation, data replication, high-availability quorum, observability, and liquid storage allocation — each with a focused responsibility.
+
+## Architecture
+
+The storage stack is organized into three layers:
+
+**Foundation** — Ceph provides the core distributed storage engine. All other components either operate it, extend it, or observe it.
+
+**Operations** — [Rook](./rook.md) runs as a Kubernetes operator and manages the full lifecycle of Ceph daemons (monitors, managers, OSDs, MDS, RGW) as containerized workloads. [Arbiter](./arbiter.md) extends quorum into stretched cluster topologies by deploying external Ceph monitors that Rook does not manage directly.
+
+**Data Services** — [Chorus](./chorus.md) provides zero-downtime data replication and migration between object storage systems (S3 and Swift). [Liquid-Ceph](./liquid-ceph.md) enables dynamic, on-demand storage allocation across the cluster.
+
+## Components
+
+| Component | Layer | Role |
+|-----------|-------|------|
+| [Ceph](./ceph.md) | Foundation | Distributed storage engine — block (RBD), file (CephFS), object (RGW) |
+| [Rook](./rook.md) | Operations | Kubernetes operator for Ceph lifecycle management |
+| [Arbiter](./arbiter.md) | Operations | External Ceph monitors for quorum in stretched clusters |
+| [Chorus](./chorus.md) | Data Services | Zero-downtime object storage replication and migration |
+| [Liquid-Ceph](./liquid-ceph.md) | Data Services | Dynamic storage allocation across the Ceph cluster |
+| [Observability & Audit](./observability/) | Observability | Metrics, dashboards, alerting, and audit — Prometheus, Perses, Prysm |
+
+## Storage Interfaces
+
+Ceph exposes three storage interfaces that CobaltCore services consume:
+
+- **RBD (RADOS Block Device)** — thin-provisioned, resizable block volumes used by virtual machines and databases. Striped across OSDs for parallel I/O and backed by RADOS snapshots and replication.
+- **CephFS** — POSIX-compliant distributed filesystem. Metadata is managed by a dedicated MDS cluster; data is striped across OSDs. Supports snapshots, quotas, and multiple active MDS daemons for horizontal metadata scaling.
+- **RGW (RADOS Gateway)** — S3 and Swift-compatible object storage gateway. Supports multi-tenancy, versioning, lifecycle policies, server-side encryption, and multi-site active-active replication.
+
+## Data Flow
+
+```text
+Applications / VMs
+        │
+┌───────┴────────────────────┐
+│  RBD  │  CephFS  │  RGW   │  ← Ceph interfaces
+└───────┴────────────────────┘
+        │
+    RADOS (Reliable Autonomic Distributed Object Store)
+        │
+   OSDs across cluster nodes
+        │
+   ┌────┴─────┐
+   │  Rook    │  ← manages daemon lifecycle via Kubernetes CRDs
+   └──────────┘
+        │
+   ┌────┴──────┐   ┌─────────┐   ┌────────────┐
+   │  Arbiter  │   │  Chorus │   │ Liquid-Ceph│
+   └───────────┘   └─────────┘   └────────────┘
+   (quorum)        (replication)  (allocation)
+        │
+   ┌────┴──────────────────────────┐
+   │  Observability & Audit        │
+   │  Prometheus · Perses · Prysm  │
+   └───────────────────────────────┘
+```
+
+## High Availability
+
+Ceph achieves HA through monitor quorum (typically 3 or 5 monitors), OSD replication or erasure coding, and MDS standby daemons. In stretched deployments that span two sites, [Arbiter](./arbiter.md) deploys a third monitor at a tiebreaker site so that quorum is maintained even if one full site goes offline.
+
+## See Also
+
+- [Observability & Audit](./observability/) — Prometheus metrics, Perses dashboards, and Prysm CLI for the storage stack
+- [Ceph upstream architecture docs](https://docs.ceph.com/en/latest/architecture/)
+- [Rook documentation](https://rook.io/docs/rook/latest-release/Getting-Started/intro/)
diff --git a/docs/architecture/cloud-storage/liquid-ceph.md b/docs/architecture/cloud-storage/liquid-ceph.md
@@ -0,0 +1,16 @@
+---
+title: Liquid-Ceph
+---
+
+# Liquid-Ceph
+
+Liquid-Ceph enables dynamic, on-demand storage allocation across the CobaltCore Ceph cluster. It abstracts the complexity of pool and quota management, allowing workloads to claim storage capacity fluidly without manual pre-provisioning steps.
+
+::: info
+Detailed documentation for Liquid-Ceph is in progress. This page will be updated as the component matures.
+:::
+
+## See Also
+
+- [Ceph](./ceph.md) — the underlying distributed storage engine
+- [Rook](./rook.md) — Kubernetes operator managing Ceph lifecycle
diff --git a/docs/architecture/cloud-storage/observability/index.md b/docs/architecture/cloud-storage/observability/index.md
@@ -0,0 +1,37 @@
+---
+title: Observability & Audit
+---
+
+# Observability & Audit Overview
+
+CobaltCore monitors the cloud storage stack through a combination of Prometheus-based metrics collection, Perses dashboards, and the Prysm observability CLI. Together they provide real-time visibility into Ceph cluster health, OSD performance, RGW throughput, storage capacity trends, and audit compliance.
+
+## Stack
+
+| Component | Role |
+|-----------|------|
+| [Prometheus](./prometheus.md) | Scrapes and stores time-series metrics from Ceph, Rook, and RGW exporters |
+| [Perses](./perses.md) | Dashboard platform for visualizing storage metrics and defining alerts |
+| [Prysm](./prysm.md) | CLI-based observability tool for Ceph clusters and RGW — real-time monitoring, SMART disk health, log compliance |
+
+## Key Metrics
+
+The following signal categories are covered by the observability stack:
+
+- **Cluster health** — overall Ceph health status, OSD up/in counts, monitor quorum state
+- **Capacity** — raw and usable capacity, per-pool usage, growth rate projections
+- **Performance** — OSD read/write latency, IOPS, throughput per interface (RBD, CephFS, RGW)
+- **RGW** — request rates, error rates, bandwidth per bucket and user
+- **Replication** — Chorus replication lag, sync success/failure rates
+- **Availability** — Arbiter monitor reachability, MDS active/standby state
+- **Audit** — log compliance analysis and access audit via Prysm consumers
+
+## Alerting
+
+Alerts are defined as Prometheus rules and surfaced through the CobaltCore alerting pipeline. Critical thresholds include OSD near-full (85%), cluster degraded state, monitor quorum loss, and RGW error rate spikes.
+
+## See Also
+
+- [Prometheus](./prometheus.md)
+- [Perses](./perses.md)
+- [Prysm](./prysm.md)
diff --git a/docs/architecture/cloud-storage/observability/perses.md b/docs/architecture/cloud-storage/observability/perses.md
@@ -0,0 +1,30 @@
+---
+title: Perses
+---
+
+# Perses
+
+Perses is the dashboard platform used in CobaltCore to visualize cloud storage metrics collected by [Prometheus](./prometheus.md). It provides pre-built dashboards for Ceph cluster health, OSD performance, RGW traffic, and capacity planning.
+
+## Dashboards
+
+| Dashboard | Purpose |
+|-----------|---------|
+| Ceph Cluster Overview | Health status, OSD counts, monitor quorum, capacity summary |
+| OSD Performance | Per-OSD read/write latency, IOPS, throughput |
+| Pool Usage | Capacity and object counts per Ceph pool |
+| RGW Traffic | Request rate, error rate, bandwidth per bucket and user |
+| Replication Status | Chorus sync lag and success/failure rates |
+
+## Dashboard-as-Code
+
+Dashboards are managed as code using the Perses CUE SDK and deployed via CI. This ensures dashboards are version-controlled alongside the rest of the CobaltCore configuration.
+
+::: info
+Dashboard definitions and deployment configuration are in progress.
+:::
+
+## See Also
+
+- [Prometheus](./prometheus.md) — metrics source for all dashboards
+- [Observability Overview](./index.md)
diff --git a/docs/architecture/cloud-storage/observability/prometheus.md b/docs/architecture/cloud-storage/observability/prometheus.md
@@ -0,0 +1,37 @@
+---
+title: Prometheus
+---
+
+# Prometheus
+
+Prometheus collects and stores time-series metrics from the CobaltCore cloud storage stack. It scrapes exporters provided by Ceph, Rook, and the RADOS Gateway, making storage metrics available for alerting and dashboard queries.
+
+## Exporters
+
+| Exporter | Source | Metrics |
+|----------|--------|---------|
+| `ceph-exporter` | Ceph daemons | OSD stats, pool usage, cluster health, latency histograms |
+| `rook-ceph-mgr` | Rook Ceph manager | Operator status, daemon lifecycle events |
+| `radosgw-exporter` | RGW | Request rates, error rates, per-user and per-bucket bandwidth |
+
+## Retention and Storage
+
+Metrics are retained according to the cluster-wide Prometheus retention policy. Long-term storage is handled by the remote-write pipeline configured in the CobaltCore monitoring stack.
+
+## Alert Rules
+
+Storage-specific alert rules are maintained alongside the other CobaltCore alerting rules. Key rules include:
+
+- `CephHealthWarning` / `CephHealthError` — cluster health degradation
+- `CephOSDNearFull` — OSD usage exceeding 85%
+- `CephMonQuorumLost` — loss of monitor quorum
+- `RGWHighErrorRate` — elevated 5xx rate on the gateway
+
+::: info
+Detailed rule definitions and Prometheus configuration are in progress.
+:::
+
+## See Also
+
+- [Perses](./perses.md) — dashboard platform consuming these metrics
+- [Observability Overview](./index.md)
diff --git a/docs/architecture/prysm.md → ...ture/cloud-storage/observability/prysm.md b/docs/architecture/prysm.md → ...ture/cloud-storage/observability/prysm.md
@@ -5,7 +5,7 @@ title: Prysm
 # Prysm 
 
 Prysm is a comprehensive observability CLI tool developed by CobaltCore for
-monitoring [Ceph](./ceph.md) storage clusters and RADOS Gateway (RGW)
+monitoring [Ceph](../ceph.md) storage clusters and RADOS Gateway (RGW)
 deployments. Prysm provides a multi-layered architecture designed to deliver
 real-time monitoring, data collection, and analysis across Ceph environments.
 

diff --git a/docs/architecture/rook.md → docs/architecture/cloud-storage/rook.md b/docs/architecture/rook.md → docs/architecture/cloud-storage/rook.md
@@ -28,7 +28,7 @@ mechanisms.
 Rook continuously monitors cluster health and automatically responds to
 failures by restarting failed daemons, replacing unhealthy OSDs, and
 maintaining desired state as defined in the cluster specifications. It
-integrates with [Kubernetes](./cluster.md) monitoring and logging systems,
+integrates with [Kubernetes](../cluster.md) monitoring and logging systems,
 providing visibility into storage operations alongside application workloads.
 
 ## See Also 

diff --git a/docs/architecture/index.md b/docs/architecture/index.md
@@ -13,4 +13,4 @@ CobaltCore is built on top of OpenStack and IronCore, leveraging their capabilit
 - **Greenhouse**: The monitoring and management tool that provides insights into the health and performance of the CobaltCore environment.
 - [**HA Service**](./cluster#ha-service): The high availability service that ensures critical workloads remain operational even in the event of failures.
 - [**Cortex**](./cortex): Smart initial placement and scheduling service for compute, storage, and network in cloud-native cloud environments.
-- [**Ceph**](./ceph): An all-in-one storage system that provides object, block, and file storage and delivers extraordinary scalability. 
+- [**Cloud Storage**](./cloud-storage/): Ceph-based distributed storage stack including Rook, Chorus, Arbiter, and Prysm for lifecycle management, replication, quorum, and observability.