@@ -13,6 +13,137 @@ Storage Cluster accommodates large numbers of nodes, which communicate with
1313each other to replicate and redistribute data dynamically.
1414
1515## Architecture
16+
17+ ### The Ceph Storage Cluster
18+
19+ At its core, Ceph provides an infinitely scalable storage cluster based on
20+ RADOS (Reliable Autonomic Distributed Object Store), a distributed storage
21+ service that uses the intelligence in each node to secure data and provide it
22+ to clients. A Ceph Storage Cluster consists of four daemon types: Ceph
23+ Monitors, which maintain the master copy of the cluster map; Ceph OSD Daemons,
24+ which check their own state and that of other OSDs; Ceph Managers, serving as
25+ endpoints for monitoring and orchestration; and Ceph Metadata Servers (MDS),
26+ which manage file metadata when CephFS provides file services.
27+
28+ Storage cluster clients and Ceph OSD Daemons use the CRUSH (Controlled
29+ Scalable Decentralized Placement of Replicated Data) algorithm to compute data
30+ location information, avoiding bottlenecks from central lookup tables. This
31+ algorithmic approach enables Ceph's high-level features, including a native
32+ interface to the storage cluster via librados and numerous service interfaces
33+ built atop it.
34+
35+ ### Data Storage and Organization
36+
37+ The Ceph Storage Cluster receives data from clients through various
38+ interfaces—Ceph Block Device, Ceph Object Storage, CephFS, or custom
39+ implementations using librados—and stores it as RADOS objects. Each object
40+ resides on an Object Storage Device (OSD), with Ceph OSD Daemons controlling
41+ read, write, and replication operations. The default BlueStore backend stores
42+ objects in a monolithic, database-like fashion within a flat namespace, meaning
43+ objects lack hierarchical directory structures. Each object has an identifier,
44+ binary data, and name/value pair metadata, with clients determining object data
45+ semantics.
46+
47+ ### Eliminating Centralization
48+
49+ Traditional architectures rely on centralized components—gateways, brokers, or
50+ APIs—that act as single points of entry, creating failure points and
51+ performance limits. Ceph eliminates these centralized components, enabling
52+ clients to interact directly with Ceph OSDs. OSDs create object replicas on
53+ other nodes to ensure data safety and high availability, while monitor clusters
54+ ensure high availability. The CRUSH algorithm replaces centralized lookup
55+ tables, providing better data management by distributing work across all OSD
56+ daemons and communicating clients, using intelligent data replication to ensure
57+ resiliency suitable for hyper-scale storage.
58+
59+ ### Cluster Map and High Availability
60+
61+ For proper functioning, Ceph clients and OSDs require current cluster topology
62+ information stored in the Cluster Map, actually a collection of five maps: the
63+ Monitor Map (containing cluster fsid, monitor positions, names, addresses, and
64+ ports), the OSD Map (containing cluster fsid, pool lists, replica sizes, PG
65+ numbers, and OSD statuses), the PG Map (containing PG versions, timestamps, and
66+ placement group details), the CRUSH Map (containing storage devices, failure
67+ domain hierarchy, and traversal rules), and the MDS Map (containing MDS map
68+ epoch, metadata storage pool, and metadata server information). Each map
69+ maintains operational state change history, with Ceph Monitors maintaining
70+ master copies including cluster members, states, changes, and overall health.
71+
72+ Ceph uses monitor clusters for reliability and fault tolerance. To establish
73+ consensus about cluster state, Ceph employs the Paxos algorithm, requiring a
74+ majority of monitors to agree (one in single-monitor clusters, two in
75+ three-monitor clusters, three in five-monitor clusters, and so forth). This
76+ prevents issues when monitors fall behind due to latency or faults.
77+
78+ ### Authentication and Security
79+
80+ The cephx authentication system authenticates users and daemons while
81+ protecting against man-in-the-middle attacks, though it doesn't address
82+ transport encryption or encryption at rest. Using shared secret keys, cephx
83+ enables mutual authentication without revealing keys. Like Kerberos, each
84+ monitor can authenticate users and distribute keys, eliminating single points
85+ of failure. The system issues session keys encrypted with users' permanent
86+ secret keys, which clients use to request services. Monitors provide tickets
87+ authenticating clients against OSDs handling data, with monitors and OSDs
88+ sharing secrets enabling ticket use across any cluster OSD or metadata server.
89+ Tickets expire to prevent attackers from using obtained credentials, protecting
90+ against message forgery and alteration as long as secret keys remain secure
91+ before expiration.
92+
93+ ### Smart Daemons and Hyperscale
94+
95+ Ceph's architecture makes OSD Daemons and clients cluster-aware, unlike
96+ centralized storage clusters requiring double dispatches that bottleneck at
97+ petabyte-to-exabyte scale. Each Ceph OSD Daemon knows other OSDs in the
98+ cluster, enabling direct interaction with other OSDs and monitors. This
99+ awareness allows clients to interact directly with OSDs, and because monitors
100+ and OSD daemons interact directly, OSDs leverage aggregate cluster CPU and RAM
101+ resources.
102+
103+ This distributed intelligence provides several benefits: OSDs service clients
104+ directly, improving performance by avoiding centralized interface connection
105+ limits; OSDs report membership and status (up or down), with neighboring OSDs
106+ detecting and reporting failures; data scrubbing maintains consistency by
107+ comparing object metadata across replicas, with deeper scrubbing comparing data
108+ bit-for-bit against checksums to find bad drive sectors; and replication
109+ involves client-OSD collaboration, with clients using CRUSH to determine object
110+ locations, mapping objects to pools and placement groups, then writing to
111+ primary OSDs that replicate to secondary OSDs.
112+
113+ ### Dynamic Cluster Management
114+
115+ Pools are logical partitions for storing objects, with clients retrieving
116+ cluster maps from monitors and writing RADOS objects to pools. CRUSH
117+ dynamically maps placement groups (PGs) to OSDs, with clients storing objects
118+ by having CRUSH map each RADOS object to a PG. This abstraction layer between
119+ OSDs and clients enables adaptive cluster growth, shrinkage, and data
120+ redistribution when topology changes. The indirection allows dynamic
121+ rebalancing when new OSDs come online.
122+
123+ Clients compute object locations rather than querying, requiring only object ID
124+ and pool name. Ceph hashes object IDs, calculates hash modulo PG numbers,
125+ retrieves pool IDs from pool names, and prepends pool IDs to PG IDs. This
126+ computation proves faster than query sessions, with CRUSH enabling clients to
127+ compute expected object locations and contact primary OSDs for storage or
128+ retrieval.
129+
130+ ### Client Interfaces
131+
132+ Ceph provides three client types: Ceph Block Device (RBD) offers resizable,
133+ thin-provisioned, snapshottable block devices striped across clusters for high
134+ performance; Ceph Object Storage (RGW) provides RESTful APIs compatible with
135+ Amazon S3 and OpenStack Swift; and CephFS provides POSIX-compliant filesystems
136+ mountable as kernel objects or FUSE. Modern applications access storage through
137+ librados, which provides direct parallel cluster access supporting pool
138+ operations, snapshots, copy-on-write cloning, object read/write operations,
139+ extended attributes, key/value pairs, and object classes.
140+
141+ The architecture demonstrates how Ceph's distributed, intelligent design
142+ eliminates traditional storage limitations, enabling massive scalability while
143+ maintaining reliability and performance through algorithmic data placement,
144+ autonomous daemon operations, and direct client-storage interactions.
145+
146+ ## See Also
16147The architecture of the Ceph cluster is explained in [ the Architecture
17148chapter of the upstream Ceph
18149documentation] ( https://docs.ceph.com/en/latest/architecture/ )
0 commit comments