@@ -135,6 +135,134 @@ capable of addressing diverse organizational storage requirements through a
135135single infrastructure platform. This convergence of capabilities, combined with
136136proven integration with major virtualization and cloud platforms, establishes
137137Ceph block devices as a viable solution for modern data center storage needs.
138+ ### The Ceph Storage Cluster
139+
140+ At its core, Ceph provides an infinitely scalable storage cluster based on
141+ RADOS (Reliable Autonomic Distributed Object Store), a distributed storage
142+ service that uses the intelligence in each node to secure data and provide it
143+ to clients. A Ceph Storage Cluster consists of four daemon types: Ceph
144+ Monitors, which maintain the master copy of the cluster map; Ceph OSD Daemons,
145+ which check their own state and that of other OSDs; Ceph Managers, serving as
146+ endpoints for monitoring and orchestration; and Ceph Metadata Servers (MDS),
147+ which manage file metadata when CephFS provides file services.
148+
149+ Storage cluster clients and Ceph OSD Daemons use the CRUSH (Controlled
150+ Scalable Decentralized Placement of Replicated Data) algorithm to compute data
151+ location information, avoiding bottlenecks from central lookup tables. This
152+ algorithmic approach enables Ceph's high-level features, including a native
153+ interface to the storage cluster via librados and numerous service interfaces
154+ built atop it.
155+
156+ ### Data Storage and Organization
157+
158+ The Ceph Storage Cluster receives data from clients through various
159+ interfaces—Ceph Block Device, Ceph Object Storage, CephFS, or custom
160+ implementations using librados—and stores it as RADOS objects. Each object
161+ resides on an Object Storage Device (OSD), with Ceph OSD Daemons controlling
162+ read, write, and replication operations. The default BlueStore backend stores
163+ objects in a monolithic, database-like fashion within a flat namespace, meaning
164+ objects lack hierarchical directory structures. Each object has an identifier,
165+ binary data, and name/value pair metadata, with clients determining object data
166+ semantics.
167+
168+ ### Eliminating Centralization
169+
170+ Traditional architectures rely on centralized components—gateways, brokers, or
171+ APIs—that act as single points of entry, creating failure points and
172+ performance limits. Ceph eliminates these centralized components, enabling
173+ clients to interact directly with Ceph OSDs. OSDs create object replicas on
174+ other nodes to ensure data safety and high availability, while monitor clusters
175+ ensure high availability. The CRUSH algorithm replaces centralized lookup
176+ tables, providing better data management by distributing work across all OSD
177+ daemons and communicating clients, using intelligent data replication to ensure
178+ resiliency suitable for hyper-scale storage.
179+
180+ ### Cluster Map and High Availability
181+
182+ For proper functioning, Ceph clients and OSDs require current cluster topology
183+ information stored in the Cluster Map, actually a collection of five maps: the
184+ Monitor Map (containing cluster fsid, monitor positions, names, addresses, and
185+ ports), the OSD Map (containing cluster fsid, pool lists, replica sizes, PG
186+ numbers, and OSD statuses), the PG Map (containing PG versions, timestamps, and
187+ placement group details), the CRUSH Map (containing storage devices, failure
188+ domain hierarchy, and traversal rules), and the MDS Map (containing MDS map
189+ epoch, metadata storage pool, and metadata server information). Each map
190+ maintains operational state change history, with Ceph Monitors maintaining
191+ master copies including cluster members, states, changes, and overall health.
192+
193+ Ceph uses monitor clusters for reliability and fault tolerance. To establish
194+ consensus about cluster state, Ceph employs the Paxos algorithm, requiring a
195+ majority of monitors to agree (one in single-monitor clusters, two in
196+ three-monitor clusters, three in five-monitor clusters, and so forth). This
197+ prevents issues when monitors fall behind due to latency or faults.
198+
199+ ### Authentication and Security
200+
201+ The cephx authentication system authenticates users and daemons while
202+ protecting against man-in-the-middle attacks, though it doesn't address
203+ transport encryption or encryption at rest. Using shared secret keys, cephx
204+ enables mutual authentication without revealing keys. Like Kerberos, each
205+ monitor can authenticate users and distribute keys, eliminating single points
206+ of failure. The system issues session keys encrypted with users' permanent
207+ secret keys, which clients use to request services. Monitors provide tickets
208+ authenticating clients against OSDs handling data, with monitors and OSDs
209+ sharing secrets enabling ticket use across any cluster OSD or metadata server.
210+ Tickets expire to prevent attackers from using obtained credentials, protecting
211+ against message forgery and alteration as long as secret keys remain secure
212+ before expiration.
213+
214+ ### Smart Daemons and Hyperscale
215+
216+ Ceph's architecture makes OSD Daemons and clients cluster-aware, unlike
217+ centralized storage clusters requiring double dispatches that bottleneck at
218+ petabyte-to-exabyte scale. Each Ceph OSD Daemon knows other OSDs in the
219+ cluster, enabling direct interaction with other OSDs and monitors. This
220+ awareness allows clients to interact directly with OSDs, and because monitors
221+ and OSD daemons interact directly, OSDs leverage aggregate cluster CPU and RAM
222+ resources.
223+
224+ This distributed intelligence provides several benefits: OSDs service clients
225+ directly, improving performance by avoiding centralized interface connection
226+ limits; OSDs report membership and status (up or down), with neighboring OSDs
227+ detecting and reporting failures; data scrubbing maintains consistency by
228+ comparing object metadata across replicas, with deeper scrubbing comparing data
229+ bit-for-bit against checksums to find bad drive sectors; and replication
230+ involves client-OSD collaboration, with clients using CRUSH to determine object
231+ locations, mapping objects to pools and placement groups, then writing to
232+ primary OSDs that replicate to secondary OSDs.
233+
234+ ### Dynamic Cluster Management
235+
236+ Pools are logical partitions for storing objects, with clients retrieving
237+ cluster maps from monitors and writing RADOS objects to pools. CRUSH
238+ dynamically maps placement groups (PGs) to OSDs, with clients storing objects
239+ by having CRUSH map each RADOS object to a PG. This abstraction layer between
240+ OSDs and clients enables adaptive cluster growth, shrinkage, and data
241+ redistribution when topology changes. The indirection allows dynamic
242+ rebalancing when new OSDs come online.
243+
244+ Clients compute object locations rather than querying, requiring only object ID
245+ and pool name. Ceph hashes object IDs, calculates hash modulo PG numbers,
246+ retrieves pool IDs from pool names, and prepends pool IDs to PG IDs. This
247+ computation proves faster than query sessions, with CRUSH enabling clients to
248+ compute expected object locations and contact primary OSDs for storage or
249+ retrieval.
250+
251+ ### Client Interfaces
252+
253+ Ceph provides three client types: Ceph Block Device (RBD) offers resizable,
254+ thin-provisioned, snapshottable block devices striped across clusters for high
255+ performance; Ceph Object Storage (RGW) provides RESTful APIs compatible with
256+ Amazon S3 and OpenStack Swift; and CephFS provides POSIX-compliant filesystems
257+ mountable as kernel objects or FUSE. Modern applications access storage through
258+ librados, which provides direct parallel cluster access supporting pool
259+ operations, snapshots, copy-on-write cloning, object read/write operations,
260+ extended attributes, key/value pairs, and object classes.
261+
262+ The architecture demonstrates how Ceph's distributed, intelligent design
263+ eliminates traditional storage limitations, enabling massive scalability while
264+ maintaining reliability and performance through algorithmic data placement,
265+ autonomous daemon operations, and direct client-storage interactions.
138266
139267## See Also
140268The architecture of the Ceph cluster is explained in [ the Architecture
0 commit comments