Skip to content

Commit 9201f7b

Browse files
committed
docs: expand cloud-storage section with Overview, Liquid-Ceph, and Observability
Enrich the cloud-storage sidebar with: - **Overview** (index.md): rewritten with full architecture description, component table, storage interface summary (RBD/CephFS/RGW), data flow diagram, and HA explanation. Replaces the previous thin component list. - **Liquid-Ceph** (liquid-ceph.md): new page introducing the dynamic storage allocation component with a placeholder for upcoming detail. - **Observability/** (new subfolder): dedicated subsection covering the monitoring stack for cloud storage. - index.md: overview of key metric categories and alerting strategy - prometheus.md: exporter sources, retention, and alert rule catalogue - perses.md: dashboard inventory and dashboard-as-code approach Sidebar will now render: Cloud Storage > Overview, Ceph, Rook, Prysm, Chorus, Arbiter, Liquid-Ceph, Observability > (Overview, Prometheus, Perses)
1 parent 9b53667 commit 9201f7b

5 files changed

Lines changed: 187 additions & 10 deletions

File tree

Lines changed: 69 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,74 @@
11
---
2-
title: Cloud Storage
2+
title: Overview
33
---
44

5-
# Cloud Storage
5+
# Cloud Storage Overview
66

7-
CobaltCore's cloud storage layer is built on [Ceph](./ceph.md), an open-source, distributed storage system that delivers object, block, and file storage in a single unified platform. The surrounding components manage its lifecycle, replication, observability, and high-availability configuration.
7+
CobaltCore's cloud storage layer is built on [Ceph](./ceph.md), a distributed storage system that delivers object, block, and file storage in a single unified platform. The surrounding components handle lifecycle automation, data replication, high-availability quorum, observability, and liquid storage allocation — each with a focused responsibility.
88

9-
| Component | Role |
10-
|-----------|------|
11-
| [Ceph](./ceph.md) | Distributed storage engine |
12-
| [Rook](./rook.md) | Kubernetes operator that automates Ceph deployment and management |
13-
| [Chorus](./chorus.md) | Data replication across object storage systems |
14-
| [Arbiter](./arbiter.md) | External arbiter monitors for Ceph quorum in stretched clusters |
15-
| [Prysm](./prysm.md) | Observability CLI for Ceph clusters and RGW deployments |
9+
## Architecture
10+
11+
The storage stack is organized into three layers:
12+
13+
**Foundation** — Ceph provides the core distributed storage engine. All other components either operate it, extend it, or observe it.
14+
15+
**Operations**[Rook](./rook.md) runs as a Kubernetes operator and manages the full lifecycle of Ceph daemons (monitors, managers, OSDs, MDS, RGW) as containerized workloads. [Arbiter](./arbiter.md) extends quorum into stretched cluster topologies by deploying external Ceph monitors that Rook does not manage directly.
16+
17+
**Data Services**[Chorus](./chorus.md) provides zero-downtime data replication and migration between object storage systems (S3 and Swift). [Liquid-Ceph](./liquid-ceph.md) enables dynamic, on-demand storage allocation across the cluster. [Prysm](./prysm.md) delivers a CLI-based observability layer over Ceph clusters and RGW deployments.
18+
19+
## Components
20+
21+
| Component | Layer | Role |
22+
|-----------|-------|------|
23+
| [Ceph](./ceph.md) | Foundation | Distributed storage engine — block (RBD), file (CephFS), object (RGW) |
24+
| [Rook](./rook.md) | Operations | Kubernetes operator for Ceph lifecycle management |
25+
| [Arbiter](./arbiter.md) | Operations | External Ceph monitors for quorum in stretched clusters |
26+
| [Chorus](./chorus.md) | Data Services | Zero-downtime object storage replication and migration |
27+
| [Liquid-Ceph](./liquid-ceph.md) | Data Services | Dynamic storage allocation across the Ceph cluster |
28+
| [Prysm](./prysm.md) | Data Services | Observability CLI for Ceph clusters and RGW |
29+
| [Observability](./observability/) | Observability | Metrics, dashboards, and alerting for the storage stack |
30+
31+
## Storage Interfaces
32+
33+
Ceph exposes three storage interfaces that CobaltCore services consume:
34+
35+
- **RBD (RADOS Block Device)** — thin-provisioned, resizable block volumes used by virtual machines and databases. Striped across OSDs for parallel I/O and backed by RADOS snapshots and replication.
36+
- **CephFS** — POSIX-compliant distributed filesystem. Metadata is managed by a dedicated MDS cluster; data is striped across OSDs. Supports snapshots, quotas, and multiple active MDS daemons for horizontal metadata scaling.
37+
- **RGW (RADOS Gateway)** — S3 and Swift-compatible object storage gateway. Supports multi-tenancy, versioning, lifecycle policies, server-side encryption, and multi-site active-active replication.
38+
39+
## Data Flow
40+
41+
```
42+
Applications / VMs
43+
44+
┌───────┴────────────────────┐
45+
│ RBD │ CephFS │ RGW │ ← Ceph interfaces
46+
└───────┴────────────────────┘
47+
48+
RADOS (Reliable Autonomic Distributed Object Store)
49+
50+
OSDs across cluster nodes
51+
52+
┌────┴─────┐
53+
│ Rook │ ← manages daemon lifecycle via Kubernetes CRDs
54+
└──────────┘
55+
56+
┌────┴──────┐ ┌─────────┐ ┌────────────┐
57+
│ Arbiter │ │ Chorus │ │ Liquid-Ceph│
58+
└───────────┘ └─────────┘ └────────────┘
59+
(quorum) (replication) (allocation)
60+
61+
┌────┴──────┐
62+
│ Prysm │ ← observability CLI
63+
└───────────┘
64+
```
65+
66+
## High Availability
67+
68+
Ceph achieves HA through monitor quorum (typically 3 or 5 monitors), OSD replication or erasure coding, and MDS standby daemons. In stretched deployments that span two sites, [Arbiter](./arbiter.md) deploys a third monitor at a tiebreaker site so that quorum is maintained even if one full site goes offline.
69+
70+
## See Also
71+
72+
- [Observability](./observability/) — Prometheus metrics and Perses dashboards for the storage stack
73+
- [Ceph upstream architecture docs](https://docs.ceph.com/en/latest/architecture/)
74+
- [Rook documentation](https://rook.io/docs/rook/latest-release/Getting-Started/intro/)
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
---
2+
title: Liquid-Ceph
3+
---
4+
5+
# Liquid-Ceph
6+
7+
Liquid-Ceph enables dynamic, on-demand storage allocation across the CobaltCore Ceph cluster. It abstracts the complexity of pool and quota management, allowing workloads to claim storage capacity fluidly without manual pre-provisioning steps.
8+
9+
::: info
10+
Detailed documentation for Liquid-Ceph is in progress. This page will be updated as the component matures.
11+
:::
12+
13+
## See Also
14+
15+
- [Ceph](./ceph.md) — the underlying distributed storage engine
16+
- [Rook](./rook.md) — Kubernetes operator managing Ceph lifecycle
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
---
2+
title: Overview
3+
---
4+
5+
# Observability Overview
6+
7+
CobaltCore monitors the cloud storage stack through a combination of Prometheus-based metrics collection and Perses dashboards. Together they provide real-time visibility into Ceph cluster health, OSD performance, RGW throughput, and storage capacity trends.
8+
9+
## Stack
10+
11+
| Component | Role |
12+
|-----------|------|
13+
| [Prometheus](./prometheus.md) | Scrapes and stores time-series metrics from Ceph, Rook, and RGW exporters |
14+
| [Perses](./perses.md) | Dashboard platform for visualizing storage metrics and defining alerts |
15+
16+
## Key Metrics
17+
18+
The following signal categories are covered by the observability stack:
19+
20+
- **Cluster health** — overall Ceph health status, OSD up/in counts, monitor quorum state
21+
- **Capacity** — raw and usable capacity, per-pool usage, growth rate projections
22+
- **Performance** — OSD read/write latency, IOPS, throughput per interface (RBD, CephFS, RGW)
23+
- **RGW** — request rates, error rates, bandwidth per bucket and user
24+
- **Replication** — Chorus replication lag, sync success/failure rates
25+
- **Availability** — Arbiter monitor reachability, MDS active/standby state
26+
27+
## Alerting
28+
29+
Alerts are defined as Prometheus rules and surfaced through the CobaltCore alerting pipeline. Critical thresholds include OSD near-full (85%), cluster degraded state, monitor quorum loss, and RGW error rate spikes.
30+
31+
## See Also
32+
33+
- [Prometheus](./prometheus.md)
34+
- [Perses](./perses.md)
35+
- [Prysm](../prysm.md) — CLI-based observability for Ceph and RGW
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
---
2+
title: Perses
3+
---
4+
5+
# Perses
6+
7+
Perses is the dashboard platform used in CobaltCore to visualize cloud storage metrics collected by [Prometheus](./prometheus.md). It provides pre-built dashboards for Ceph cluster health, OSD performance, RGW traffic, and capacity planning.
8+
9+
## Dashboards
10+
11+
| Dashboard | Purpose |
12+
|-----------|---------|
13+
| Ceph Cluster Overview | Health status, OSD counts, monitor quorum, capacity summary |
14+
| OSD Performance | Per-OSD read/write latency, IOPS, throughput |
15+
| Pool Usage | Capacity and object counts per Ceph pool |
16+
| RGW Traffic | Request rate, error rate, bandwidth per bucket and user |
17+
| Replication Status | Chorus sync lag and success/failure rates |
18+
19+
## Dashboard-as-Code
20+
21+
Dashboards are managed as code using the Perses CUE SDK and deployed via CI. This ensures dashboards are version-controlled alongside the rest of the CobaltCore configuration.
22+
23+
::: info
24+
Dashboard definitions and deployment configuration are in progress.
25+
:::
26+
27+
## See Also
28+
29+
- [Prometheus](./prometheus.md) — metrics source for all dashboards
30+
- [Observability Overview](./index.md)
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
---
2+
title: Prometheus
3+
---
4+
5+
# Prometheus
6+
7+
Prometheus collects and stores time-series metrics from the CobaltCore cloud storage stack. It scrapes exporters provided by Ceph, Rook, and the RADOS Gateway, making storage metrics available for alerting and dashboard queries.
8+
9+
## Exporters
10+
11+
| Exporter | Source | Metrics |
12+
|----------|--------|---------|
13+
| `ceph-exporter` | Ceph daemons | OSD stats, pool usage, cluster health, latency histograms |
14+
| `rook-ceph-mgr` | Rook Ceph manager | Operator status, daemon lifecycle events |
15+
| `radosgw-exporter` | RGW | Request rates, error rates, per-user and per-bucket bandwidth |
16+
17+
## Retention and Storage
18+
19+
Metrics are retained according to the cluster-wide Prometheus retention policy. Long-term storage is handled by the remote-write pipeline configured in the CobaltCore monitoring stack.
20+
21+
## Alert Rules
22+
23+
Storage-specific alert rules are maintained alongside the other CobaltCore alerting rules. Key rules include:
24+
25+
- `CephHealthWarning` / `CephHealthError` — cluster health degradation
26+
- `CephOSDNearFull` — OSD usage exceeding 85%
27+
- `CephMonQuorumLost` — loss of monitor quorum
28+
- `RGWHighErrorRate` — elevated 5xx rate on the gateway
29+
30+
::: info
31+
Detailed rule definitions and Prometheus configuration are in progress.
32+
:::
33+
34+
## See Also
35+
36+
- [Perses](./perses.md) — dashboard platform consuming these metrics
37+
- [Observability Overview](./index.md)

0 commit comments

Comments
 (0)