Skip to content

Commit 4021e59

Browse files
committed
doc: add architecture information to ceph.md
Add information about the architecture of a Ceph cluster to ceph.md. Signed-off-by: Zac Dover <zac.dover@proton.me>
1 parent f6d3249 commit 4021e59

1 file changed

Lines changed: 131 additions & 0 deletions

File tree

docs/architecture/ceph.md

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,137 @@ Storage Cluster accommodates large numbers of nodes, which communicate with
1313
each other to replicate and redistribute data dynamically.
1414

1515
## Architecture
16+
17+
### The Ceph Storage Cluster
18+
19+
At its core, Ceph provides an infinitely scalable storage cluster based on
20+
RADOS (Reliable Autonomic Distributed Object Store), a distributed storage
21+
service that uses the intelligence in each node to secure data and provide it
22+
to clients. A Ceph Storage Cluster consists of four daemon types: Ceph
23+
Monitors, which maintain the master copy of the cluster map; Ceph OSD Daemons,
24+
which check their own state and that of other OSDs; Ceph Managers, serving as
25+
endpoints for monitoring and orchestration; and Ceph Metadata Servers (MDS),
26+
which manage file metadata when CephFS provides file services.
27+
28+
Storage cluster clients and Ceph OSD Daemons use the CRUSH (Controlled
29+
Scalable Decentralized Placement of Replicated Data) algorithm to compute data
30+
location information, avoiding bottlenecks from central lookup tables. This
31+
algorithmic approach enables Ceph's high-level features, including a native
32+
interface to the storage cluster via librados and numerous service interfaces
33+
built atop it.
34+
35+
### Data Storage and Organization
36+
37+
The Ceph Storage Cluster receives data from clients through various
38+
interfaces—Ceph Block Device, Ceph Object Storage, CephFS, or custom
39+
implementations using librados—and stores it as RADOS objects. Each object
40+
resides on an Object Storage Device (OSD), with Ceph OSD Daemons controlling
41+
read, write, and replication operations. The default BlueStore backend stores
42+
objects in a monolithic, database-like fashion within a flat namespace, meaning
43+
objects lack hierarchical directory structures. Each object has an identifier,
44+
binary data, and name/value pair metadata, with clients determining object data
45+
semantics.
46+
47+
### Eliminating Centralization
48+
49+
Traditional architectures rely on centralized components—gateways, brokers, or
50+
APIs—that act as single points of entry, creating failure points and
51+
performance limits. Ceph eliminates these centralized components, enabling
52+
clients to interact directly with Ceph OSDs. OSDs create object replicas on
53+
other nodes to ensure data safety and high availability, while monitor clusters
54+
ensure high availability. The CRUSH algorithm replaces centralized lookup
55+
tables, providing better data management by distributing work across all OSD
56+
daemons and communicating clients, using intelligent data replication to ensure
57+
resiliency suitable for hyper-scale storage.
58+
59+
### Cluster Map and High Availability
60+
61+
For proper functioning, Ceph clients and OSDs require current cluster topology
62+
information stored in the Cluster Map, actually a collection of five maps: the
63+
Monitor Map (containing cluster fsid, monitor positions, names, addresses, and
64+
ports), the OSD Map (containing cluster fsid, pool lists, replica sizes, PG
65+
numbers, and OSD statuses), the PG Map (containing PG versions, timestamps, and
66+
placement group details), the CRUSH Map (containing storage devices, failure
67+
domain hierarchy, and traversal rules), and the MDS Map (containing MDS map
68+
epoch, metadata storage pool, and metadata server information). Each map
69+
maintains operational state change history, with Ceph Monitors maintaining
70+
master copies including cluster members, states, changes, and overall health.
71+
72+
Ceph uses monitor clusters for reliability and fault tolerance. To establish
73+
consensus about cluster state, Ceph employs the Paxos algorithm, requiring a
74+
majority of monitors to agree (one in single-monitor clusters, two in
75+
three-monitor clusters, three in five-monitor clusters, and so forth). This
76+
prevents issues when monitors fall behind due to latency or faults.
77+
78+
### Authentication and Security
79+
80+
The cephx authentication system authenticates users and daemons while
81+
protecting against man-in-the-middle attacks, though it doesn't address
82+
transport encryption or encryption at rest. Using shared secret keys, cephx
83+
enables mutual authentication without revealing keys. Like Kerberos, each
84+
monitor can authenticate users and distribute keys, eliminating single points
85+
of failure. The system issues session keys encrypted with users' permanent
86+
secret keys, which clients use to request services. Monitors provide tickets
87+
authenticating clients against OSDs handling data, with monitors and OSDs
88+
sharing secrets enabling ticket use across any cluster OSD or metadata server.
89+
Tickets expire to prevent attackers from using obtained credentials, protecting
90+
against message forgery and alteration as long as secret keys remain secure
91+
before expiration.
92+
93+
### Smart Daemons and Hyperscale
94+
95+
Ceph's architecture makes OSD Daemons and clients cluster-aware, unlike
96+
centralized storage clusters requiring double dispatches that bottleneck at
97+
petabyte-to-exabyte scale. Each Ceph OSD Daemon knows other OSDs in the
98+
cluster, enabling direct interaction with other OSDs and monitors. This
99+
awareness allows clients to interact directly with OSDs, and because monitors
100+
and OSD daemons interact directly, OSDs leverage aggregate cluster CPU and RAM
101+
resources.
102+
103+
This distributed intelligence provides several benefits: OSDs service clients
104+
directly, improving performance by avoiding centralized interface connection
105+
limits; OSDs report membership and status (up or down), with neighboring OSDs
106+
detecting and reporting failures; data scrubbing maintains consistency by
107+
comparing object metadata across replicas, with deeper scrubbing comparing data
108+
bit-for-bit against checksums to find bad drive sectors; and replication
109+
involves client-OSD collaboration, with clients using CRUSH to determine object
110+
locations, mapping objects to pools and placement groups, then writing to
111+
primary OSDs that replicate to secondary OSDs.
112+
113+
### Dynamic Cluster Management
114+
115+
Pools are logical partitions for storing objects, with clients retrieving
116+
cluster maps from monitors and writing RADOS objects to pools. CRUSH
117+
dynamically maps placement groups (PGs) to OSDs, with clients storing objects
118+
by having CRUSH map each RADOS object to a PG. This abstraction layer between
119+
OSDs and clients enables adaptive cluster growth, shrinkage, and data
120+
redistribution when topology changes. The indirection allows dynamic
121+
rebalancing when new OSDs come online.
122+
123+
Clients compute object locations rather than querying, requiring only object ID
124+
and pool name. Ceph hashes object IDs, calculates hash modulo PG numbers,
125+
retrieves pool IDs from pool names, and prepends pool IDs to PG IDs. This
126+
computation proves faster than query sessions, with CRUSH enabling clients to
127+
compute expected object locations and contact primary OSDs for storage or
128+
retrieval.
129+
130+
### Client Interfaces
131+
132+
Ceph provides three client types: Ceph Block Device (RBD) offers resizable,
133+
thin-provisioned, snapshottable block devices striped across clusters for high
134+
performance; Ceph Object Storage (RGW) provides RESTful APIs compatible with
135+
Amazon S3 and OpenStack Swift; and CephFS provides POSIX-compliant filesystems
136+
mountable as kernel objects or FUSE. Modern applications access storage through
137+
librados, which provides direct parallel cluster access supporting pool
138+
operations, snapshots, copy-on-write cloning, object read/write operations,
139+
extended attributes, key/value pairs, and object classes.
140+
141+
The architecture demonstrates how Ceph's distributed, intelligent design
142+
eliminates traditional storage limitations, enabling massive scalability while
143+
maintaining reliability and performance through algorithmic data placement,
144+
autonomous daemon operations, and direct client-storage interactions.
145+
146+
## See Also
16147
The architecture of the Ceph cluster is explained in [the Architecture
17148
chapter of the upstream Ceph
18149
documentation](https://docs.ceph.com/en/latest/architecture/)

0 commit comments

Comments
 (0)