diff --git a/docs/architecture/arbiter.md b/docs/architecture/cloud-storage/arbiter.md similarity index 100% rename from docs/architecture/arbiter.md rename to docs/architecture/cloud-storage/arbiter.md diff --git a/docs/architecture/ceph.md b/docs/architecture/cloud-storage/ceph.md similarity index 100% rename from docs/architecture/ceph.md rename to docs/architecture/cloud-storage/ceph.md diff --git a/docs/architecture/chorus.md b/docs/architecture/cloud-storage/chorus.md similarity index 100% rename from docs/architecture/chorus.md rename to docs/architecture/cloud-storage/chorus.md diff --git a/docs/architecture/cloud-storage/index.md b/docs/architecture/cloud-storage/index.md new file mode 100644 index 0000000..896c7c3 --- /dev/null +++ b/docs/architecture/cloud-storage/index.md @@ -0,0 +1,74 @@ +--- +title: Cloud Storage +--- + +# Cloud Storage + +CobaltCore's cloud storage layer is built on [Ceph](./ceph.md), a distributed storage system that delivers object, block, and file storage in a single unified platform. The surrounding components handle lifecycle automation, data replication, high-availability quorum, observability, and liquid storage allocation — each with a focused responsibility. + +## Architecture + +The storage stack is organized into three layers: + +**Foundation** — Ceph provides the core distributed storage engine. All other components either operate it, extend it, or observe it. + +**Operations** — [Rook](./rook.md) runs as a Kubernetes operator and manages the full lifecycle of Ceph daemons (monitors, managers, OSDs, MDS, RGW) as containerized workloads. [Arbiter](./arbiter.md) extends quorum into stretched cluster topologies by deploying external Ceph monitors that Rook does not manage directly. + +**Data Services** — [Chorus](./chorus.md) provides zero-downtime data replication and migration between object storage systems (S3 and Swift). [Liquid-Ceph](./liquid-ceph.md) enables dynamic, on-demand storage allocation across the cluster. + +## Components + +| Component | Layer | Role | +|-----------|-------|------| +| [Ceph](./ceph.md) | Foundation | Distributed storage engine — block (RBD), file (CephFS), object (RGW) | +| [Rook](./rook.md) | Operations | Kubernetes operator for Ceph lifecycle management | +| [Arbiter](./arbiter.md) | Operations | External Ceph monitors for quorum in stretched clusters | +| [Chorus](./chorus.md) | Data Services | Zero-downtime object storage replication and migration | +| [Liquid-Ceph](./liquid-ceph.md) | Data Services | Dynamic storage allocation across the Ceph cluster | +| [Observability & Audit](./observability/) | Observability | Metrics, dashboards, alerting, and audit — Prometheus, Perses, Prysm | + +## Storage Interfaces + +Ceph exposes three storage interfaces that CobaltCore services consume: + +- **RBD (RADOS Block Device)** — thin-provisioned, resizable block volumes used by virtual machines and databases. Striped across OSDs for parallel I/O and backed by RADOS snapshots and replication. +- **CephFS** — POSIX-compliant distributed filesystem. Metadata is managed by a dedicated MDS cluster; data is striped across OSDs. Supports snapshots, quotas, and multiple active MDS daemons for horizontal metadata scaling. +- **RGW (RADOS Gateway)** — S3 and Swift-compatible object storage gateway. Supports multi-tenancy, versioning, lifecycle policies, server-side encryption, and multi-site active-active replication. + +## Data Flow + +```text +Applications / VMs + │ +┌───────┴────────────────────┐ +│ RBD │ CephFS │ RGW │ ← Ceph interfaces +└───────┴────────────────────┘ + │ + RADOS (Reliable Autonomic Distributed Object Store) + │ + OSDs across cluster nodes + │ + ┌────┴─────┐ + │ Rook │ ← manages daemon lifecycle via Kubernetes CRDs + └──────────┘ + │ + ┌────┴──────┐ ┌─────────┐ ┌────────────┐ + │ Arbiter │ │ Chorus │ │ Liquid-Ceph│ + └───────────┘ └─────────┘ └────────────┘ + (quorum) (replication) (allocation) + │ + ┌────┴──────────────────────────┐ + │ Observability & Audit │ + │ Prometheus · Perses · Prysm │ + └───────────────────────────────┘ +``` + +## High Availability + +Ceph achieves HA through monitor quorum (typically 3 or 5 monitors), OSD replication or erasure coding, and MDS standby daemons. In stretched deployments that span two sites, [Arbiter](./arbiter.md) deploys a third monitor at a tiebreaker site so that quorum is maintained even if one full site goes offline. + +## See Also + +- [Observability & Audit](./observability/) — Prometheus metrics, Perses dashboards, and Prysm CLI for the storage stack +- [Ceph upstream architecture docs](https://docs.ceph.com/en/latest/architecture/) +- [Rook documentation](https://rook.io/docs/rook/latest-release/Getting-Started/intro/) diff --git a/docs/architecture/cloud-storage/liquid-ceph.md b/docs/architecture/cloud-storage/liquid-ceph.md new file mode 100644 index 0000000..7b1943f --- /dev/null +++ b/docs/architecture/cloud-storage/liquid-ceph.md @@ -0,0 +1,16 @@ +--- +title: Liquid-Ceph +--- + +# Liquid-Ceph + +Liquid-Ceph enables dynamic, on-demand storage allocation across the CobaltCore Ceph cluster. It abstracts the complexity of pool and quota management, allowing workloads to claim storage capacity fluidly without manual pre-provisioning steps. + +::: info +Detailed documentation for Liquid-Ceph is in progress. This page will be updated as the component matures. +::: + +## See Also + +- [Ceph](./ceph.md) — the underlying distributed storage engine +- [Rook](./rook.md) — Kubernetes operator managing Ceph lifecycle diff --git a/docs/architecture/cloud-storage/observability/index.md b/docs/architecture/cloud-storage/observability/index.md new file mode 100644 index 0000000..b01cb1a --- /dev/null +++ b/docs/architecture/cloud-storage/observability/index.md @@ -0,0 +1,37 @@ +--- +title: Observability & Audit +--- + +# Observability & Audit Overview + +CobaltCore monitors the cloud storage stack through a combination of Prometheus-based metrics collection, Perses dashboards, and the Prysm observability CLI. Together they provide real-time visibility into Ceph cluster health, OSD performance, RGW throughput, storage capacity trends, and audit compliance. + +## Stack + +| Component | Role | +|-----------|------| +| [Prometheus](./prometheus.md) | Scrapes and stores time-series metrics from Ceph, Rook, and RGW exporters | +| [Perses](./perses.md) | Dashboard platform for visualizing storage metrics (alert rules are defined as Prometheus rules) | +| [Prysm](./prysm.md) | CLI-based observability tool for Ceph clusters and RGW — real-time monitoring, SMART disk health, log compliance | + +## Key Metrics + +The following signal categories are covered by the observability stack: + +- **Cluster health** — overall Ceph health status, OSD up/in counts, monitor quorum state +- **Capacity** — raw and usable capacity, per-pool usage, growth rate projections +- **Performance** — OSD read/write latency, IOPS, throughput per interface (RBD, CephFS, RGW) +- **RGW** — request rates, error rates, bandwidth per bucket and user +- **Replication** — Chorus replication lag, sync success/failure rates +- **Availability** — Arbiter monitor reachability, MDS active/standby state +- **Audit** — log compliance analysis and access audit via Prysm consumers + +## Alerting + +Alerts are defined as Prometheus rules and surfaced through the CobaltCore alerting pipeline. Critical thresholds include OSD near-full (85%), cluster degraded state, monitor quorum loss, and RGW error rate spikes. + +## See Also + +- [Prometheus](./prometheus.md) +- [Perses](./perses.md) +- [Prysm](./prysm.md) diff --git a/docs/architecture/cloud-storage/observability/perses.md b/docs/architecture/cloud-storage/observability/perses.md new file mode 100644 index 0000000..36d1f5c --- /dev/null +++ b/docs/architecture/cloud-storage/observability/perses.md @@ -0,0 +1,30 @@ +--- +title: Perses +--- + +# Perses + +Perses is the dashboard platform used in CobaltCore to visualize cloud storage metrics collected by [Prometheus](./prometheus.md). It provides pre-built dashboards for Ceph cluster health, OSD performance, RGW traffic, and capacity planning. + +## Dashboards + +| Dashboard | Purpose | +|-----------|---------| +| Ceph Cluster Overview | Health status, OSD counts, monitor quorum, capacity summary | +| OSD Performance | Per-OSD read/write latency, IOPS, throughput | +| Pool Usage | Capacity and object counts per Ceph pool | +| RGW Traffic | Request rate, error rate, bandwidth per bucket and user | +| Replication Status | Chorus sync lag and success/failure rates | + +## Dashboard-as-Code + +Dashboards are managed as code using the Perses CUE SDK and deployed via CI. This ensures dashboards are version-controlled alongside the rest of the CobaltCore configuration. + +::: info +Dashboard definitions and deployment configuration are in progress. +::: + +## See Also + +- [Prometheus](./prometheus.md) — metrics source for all dashboards +- [Observability Overview](./index.md) diff --git a/docs/architecture/cloud-storage/observability/prometheus.md b/docs/architecture/cloud-storage/observability/prometheus.md new file mode 100644 index 0000000..c85bb54 --- /dev/null +++ b/docs/architecture/cloud-storage/observability/prometheus.md @@ -0,0 +1,37 @@ +--- +title: Prometheus +--- + +# Prometheus + +Prometheus collects and stores time-series metrics from the CobaltCore cloud storage stack. It scrapes exporters provided by Ceph, Rook, and the RADOS Gateway, making storage metrics available for alerting and dashboard queries. + +## Exporters + +| Exporter | Source | Metrics | +|----------|--------|---------| +| `ceph-exporter` | Ceph daemons | OSD stats, pool usage, cluster health, latency histograms | +| `rook-ceph-mgr` | Rook Ceph manager | Operator status, daemon lifecycle events | +| `radosgw-exporter` | RGW | Request rates, error rates, per-user and per-bucket bandwidth | + +## Retention and Storage + +Metrics are retained according to the cluster-wide Prometheus retention policy. Long-term storage is handled by the remote-write pipeline configured in the CobaltCore monitoring stack. + +## Alert Rules + +Storage-specific alert rules are maintained alongside the other CobaltCore alerting rules. Key rules include: + +- `CephHealthWarning` / `CephHealthError` — cluster health degradation +- `CephOSDNearFull` — OSD usage exceeding 85% +- `CephMonQuorumLost` — loss of monitor quorum +- `RGWHighErrorRate` — elevated 5xx rate on the gateway + +::: info +Detailed rule definitions and Prometheus configuration are in progress. +::: + +## See Also + +- [Perses](./perses.md) — dashboard platform consuming these metrics +- [Observability Overview](./index.md) diff --git a/docs/architecture/prysm.md b/docs/architecture/cloud-storage/observability/prysm.md similarity index 96% rename from docs/architecture/prysm.md rename to docs/architecture/cloud-storage/observability/prysm.md index 2b9553b..47b02b6 100644 --- a/docs/architecture/prysm.md +++ b/docs/architecture/cloud-storage/observability/prysm.md @@ -5,7 +5,7 @@ title: Prysm # Prysm Prysm is a comprehensive observability CLI tool developed by CobaltCore for -monitoring [Ceph](./ceph.md) storage clusters and RADOS Gateway (RGW) +monitoring [Ceph](../ceph.md) storage clusters and RADOS Gateway (RGW) deployments. Prysm provides a multi-layered architecture designed to deliver real-time monitoring, data collection, and analysis across Ceph environments. diff --git a/docs/architecture/rook.md b/docs/architecture/cloud-storage/rook.md similarity index 95% rename from docs/architecture/rook.md rename to docs/architecture/cloud-storage/rook.md index 3466087..b803e14 100644 --- a/docs/architecture/rook.md +++ b/docs/architecture/cloud-storage/rook.md @@ -28,7 +28,7 @@ mechanisms. Rook continuously monitors cluster health and automatically responds to failures by restarting failed daemons, replacing unhealthy OSDs, and maintaining desired state as defined in the cluster specifications. It -integrates with [Kubernetes](./cluster.md) monitoring and logging systems, +integrates with [Kubernetes](../cluster.md) monitoring and logging systems, providing visibility into storage operations alongside application workloads. ## See Also diff --git a/docs/architecture/index.md b/docs/architecture/index.md index 5d44602..1c01177 100644 --- a/docs/architecture/index.md +++ b/docs/architecture/index.md @@ -13,4 +13,4 @@ CobaltCore is built on top of OpenStack and IronCore, leveraging their capabilit - **Greenhouse**: The monitoring and management tool that provides insights into the health and performance of the CobaltCore environment. - [**HA Service**](./cluster#ha-service): The high availability service that ensures critical workloads remain operational even in the event of failures. - [**Cortex**](./cortex): Smart initial placement and scheduling service for compute, storage, and network in cloud-native cloud environments. -- [**Ceph**](./ceph): An all-in-one storage system that provides object, block, and file storage and delivers extraordinary scalability. +- [**Cloud Storage**](./cloud-storage/): Ceph-based distributed storage stack including Rook, Chorus, Arbiter, and Prysm for lifecycle management, replication, quorum, and observability.