Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
File renamed without changes.
File renamed without changes.
74 changes: 74 additions & 0 deletions docs/architecture/cloud-storage/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
title: Overview
---

# Cloud Storage Overview

CobaltCore's cloud storage layer is built on [Ceph](./ceph.md), a distributed storage system that delivers object, block, and file storage in a single unified platform. The surrounding components handle lifecycle automation, data replication, high-availability quorum, observability, and liquid storage allocation — each with a focused responsibility.

## Architecture

The storage stack is organized into three layers:

**Foundation** — Ceph provides the core distributed storage engine. All other components either operate it, extend it, or observe it.

**Operations** — [Rook](./rook.md) runs as a Kubernetes operator and manages the full lifecycle of Ceph daemons (monitors, managers, OSDs, MDS, RGW) as containerized workloads. [Arbiter](./arbiter.md) extends quorum into stretched cluster topologies by deploying external Ceph monitors that Rook does not manage directly.

**Data Services** — [Chorus](./chorus.md) provides zero-downtime data replication and migration between object storage systems (S3 and Swift). [Liquid-Ceph](./liquid-ceph.md) enables dynamic, on-demand storage allocation across the cluster.

## Components

| Component | Layer | Role |
|-----------|-------|------|
| [Ceph](./ceph.md) | Foundation | Distributed storage engine — block (RBD), file (CephFS), object (RGW) |
| [Rook](./rook.md) | Operations | Kubernetes operator for Ceph lifecycle management |
| [Arbiter](./arbiter.md) | Operations | External Ceph monitors for quorum in stretched clusters |
| [Chorus](./chorus.md) | Data Services | Zero-downtime object storage replication and migration |
| [Liquid-Ceph](./liquid-ceph.md) | Data Services | Dynamic storage allocation across the Ceph cluster |
| [Observability & Audit](./observability/) | Observability | Metrics, dashboards, alerting, and audit — Prometheus, Perses, Prysm |

## Storage Interfaces

Ceph exposes three storage interfaces that CobaltCore services consume:

- **RBD (RADOS Block Device)** — thin-provisioned, resizable block volumes used by virtual machines and databases. Striped across OSDs for parallel I/O and backed by RADOS snapshots and replication.
- **CephFS** — POSIX-compliant distributed filesystem. Metadata is managed by a dedicated MDS cluster; data is striped across OSDs. Supports snapshots, quotas, and multiple active MDS daemons for horizontal metadata scaling.
- **RGW (RADOS Gateway)** — S3 and Swift-compatible object storage gateway. Supports multi-tenancy, versioning, lifecycle policies, server-side encryption, and multi-site active-active replication.

## Data Flow

```text
Applications / VMs
┌───────┴────────────────────┐
│ RBD │ CephFS │ RGW │ ← Ceph interfaces
└───────┴────────────────────┘
RADOS (Reliable Autonomic Distributed Object Store)
OSDs across cluster nodes
┌────┴─────┐
│ Rook │ ← manages daemon lifecycle via Kubernetes CRDs
└──────────┘
┌────┴──────┐ ┌─────────┐ ┌────────────┐
│ Arbiter │ │ Chorus │ │ Liquid-Ceph│
└───────────┘ └─────────┘ └────────────┘
(quorum) (replication) (allocation)
┌────┴──────────────────────────┐
│ Observability & Audit │
│ Prometheus · Perses · Prysm │
└───────────────────────────────┘
```

## High Availability

Ceph achieves HA through monitor quorum (typically 3 or 5 monitors), OSD replication or erasure coding, and MDS standby daemons. In stretched deployments that span two sites, [Arbiter](./arbiter.md) deploys a third monitor at a tiebreaker site so that quorum is maintained even if one full site goes offline.

## See Also

- [Observability & Audit](./observability/) — Prometheus metrics, Perses dashboards, and Prysm CLI for the storage stack
- [Ceph upstream architecture docs](https://docs.ceph.com/en/latest/architecture/)
- [Rook documentation](https://rook.io/docs/rook/latest-release/Getting-Started/intro/)
16 changes: 16 additions & 0 deletions docs/architecture/cloud-storage/liquid-ceph.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
title: Liquid-Ceph
---

# Liquid-Ceph

Liquid-Ceph enables dynamic, on-demand storage allocation across the CobaltCore Ceph cluster. It abstracts the complexity of pool and quota management, allowing workloads to claim storage capacity fluidly without manual pre-provisioning steps.

::: info
Detailed documentation for Liquid-Ceph is in progress. This page will be updated as the component matures.
:::

## See Also

- [Ceph](./ceph.md) — the underlying distributed storage engine
- [Rook](./rook.md) — Kubernetes operator managing Ceph lifecycle
37 changes: 37 additions & 0 deletions docs/architecture/cloud-storage/observability/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
title: Observability & Audit
---

# Observability & Audit Overview

CobaltCore monitors the cloud storage stack through a combination of Prometheus-based metrics collection, Perses dashboards, and the Prysm observability CLI. Together they provide real-time visibility into Ceph cluster health, OSD performance, RGW throughput, storage capacity trends, and audit compliance.

## Stack

| Component | Role |
|-----------|------|
| [Prometheus](./prometheus.md) | Scrapes and stores time-series metrics from Ceph, Rook, and RGW exporters |
| [Perses](./perses.md) | Dashboard platform for visualizing storage metrics and defining alerts |
| [Prysm](./prysm.md) | CLI-based observability tool for Ceph clusters and RGW — real-time monitoring, SMART disk health, log compliance |

## Key Metrics

The following signal categories are covered by the observability stack:

- **Cluster health** — overall Ceph health status, OSD up/in counts, monitor quorum state
- **Capacity** — raw and usable capacity, per-pool usage, growth rate projections
- **Performance** — OSD read/write latency, IOPS, throughput per interface (RBD, CephFS, RGW)
- **RGW** — request rates, error rates, bandwidth per bucket and user
- **Replication** — Chorus replication lag, sync success/failure rates
- **Availability** — Arbiter monitor reachability, MDS active/standby state
- **Audit** — log compliance analysis and access audit via Prysm consumers

## Alerting

Alerts are defined as Prometheus rules and surfaced through the CobaltCore alerting pipeline. Critical thresholds include OSD near-full (85%), cluster degraded state, monitor quorum loss, and RGW error rate spikes.

## See Also

- [Prometheus](./prometheus.md)
- [Perses](./perses.md)
- [Prysm](./prysm.md)
30 changes: 30 additions & 0 deletions docs/architecture/cloud-storage/observability/perses.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
title: Perses
---

# Perses

Perses is the dashboard platform used in CobaltCore to visualize cloud storage metrics collected by [Prometheus](./prometheus.md). It provides pre-built dashboards for Ceph cluster health, OSD performance, RGW traffic, and capacity planning.

## Dashboards

| Dashboard | Purpose |
|-----------|---------|
| Ceph Cluster Overview | Health status, OSD counts, monitor quorum, capacity summary |
| OSD Performance | Per-OSD read/write latency, IOPS, throughput |
| Pool Usage | Capacity and object counts per Ceph pool |
| RGW Traffic | Request rate, error rate, bandwidth per bucket and user |
| Replication Status | Chorus sync lag and success/failure rates |

## Dashboard-as-Code

Dashboards are managed as code using the Perses CUE SDK and deployed via CI. This ensures dashboards are version-controlled alongside the rest of the CobaltCore configuration.

::: info
Dashboard definitions and deployment configuration are in progress.
:::

## See Also

- [Prometheus](./prometheus.md) — metrics source for all dashboards
- [Observability Overview](./index.md)
37 changes: 37 additions & 0 deletions docs/architecture/cloud-storage/observability/prometheus.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
title: Prometheus
---

# Prometheus

Prometheus collects and stores time-series metrics from the CobaltCore cloud storage stack. It scrapes exporters provided by Ceph, Rook, and the RADOS Gateway, making storage metrics available for alerting and dashboard queries.

## Exporters

| Exporter | Source | Metrics |
|----------|--------|---------|
| `ceph-exporter` | Ceph daemons | OSD stats, pool usage, cluster health, latency histograms |
| `rook-ceph-mgr` | Rook Ceph manager | Operator status, daemon lifecycle events |
| `radosgw-exporter` | RGW | Request rates, error rates, per-user and per-bucket bandwidth |

## Retention and Storage

Metrics are retained according to the cluster-wide Prometheus retention policy. Long-term storage is handled by the remote-write pipeline configured in the CobaltCore monitoring stack.

## Alert Rules

Storage-specific alert rules are maintained alongside the other CobaltCore alerting rules. Key rules include:

- `CephHealthWarning` / `CephHealthError` — cluster health degradation
- `CephOSDNearFull` — OSD usage exceeding 85%
- `CephMonQuorumLost` — loss of monitor quorum
- `RGWHighErrorRate` — elevated 5xx rate on the gateway

::: info
Detailed rule definitions and Prometheus configuration are in progress.
:::

## See Also

- [Perses](./perses.md) — dashboard platform consuming these metrics
- [Observability Overview](./index.md)
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ title: Prysm
# Prysm

Prysm is a comprehensive observability CLI tool developed by CobaltCore for
monitoring [Ceph](./ceph.md) storage clusters and RADOS Gateway (RGW)
monitoring [Ceph](../ceph.md) storage clusters and RADOS Gateway (RGW)
deployments. Prysm provides a multi-layered architecture designed to deliver
real-time monitoring, data collection, and analysis across Ceph environments.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ mechanisms.
Rook continuously monitors cluster health and automatically responds to
failures by restarting failed daemons, replacing unhealthy OSDs, and
maintaining desired state as defined in the cluster specifications. It
integrates with [Kubernetes](./cluster.md) monitoring and logging systems,
integrates with [Kubernetes](../cluster.md) monitoring and logging systems,
providing visibility into storage operations alongside application workloads.

## See Also
Expand Down
2 changes: 1 addition & 1 deletion docs/architecture/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ CobaltCore is built on top of OpenStack and IronCore, leveraging their capabilit
- **Greenhouse**: The monitoring and management tool that provides insights into the health and performance of the CobaltCore environment.
- [**HA Service**](./cluster#ha-service): The high availability service that ensures critical workloads remain operational even in the event of failures.
- [**Cortex**](./cortex): Smart initial placement and scheduling service for compute, storage, and network in cloud-native cloud environments.
- [**Ceph**](./ceph): An all-in-one storage system that provides object, block, and file storage and delivers extraordinary scalability.
- [**Cloud Storage**](./cloud-storage/): Ceph-based distributed storage stack including Rook, Chorus, Arbiter, and Prysm for lifecycle management, replication, quorum, and observability.
Loading