@@ -14,134 +14,127 @@ each other to replicate and redistribute data dynamically.
1414
1515## Architecture
1616
17- ### The Ceph Storage Cluster
18-
19- At its core, Ceph provides an infinitely scalable storage cluster based on
20- RADOS (Reliable Autonomic Distributed Object Store), a distributed storage
21- service that uses the intelligence in each node to secure data and provide it
22- to clients. A Ceph Storage Cluster consists of four daemon types: Ceph
23- Monitors, which maintain the master copy of the cluster map; Ceph OSD Daemons,
24- which check their own state and that of other OSDs; Ceph Managers, serving as
25- endpoints for monitoring and orchestration; and Ceph Metadata Servers (MDS),
26- which manage file metadata when CephFS provides file services.
27-
28- Storage cluster clients and Ceph OSD Daemons use the CRUSH (Controlled
29- Scalable Decentralized Placement of Replicated Data) algorithm to compute data
30- location information, avoiding bottlenecks from central lookup tables. This
31- algorithmic approach enables Ceph's high-level features, including a native
32- interface to the storage cluster via librados and numerous service interfaces
33- built atop it.
34-
35- ### Data Storage and Organization
36-
37- The Ceph Storage Cluster receives data from clients through various
38- interfaces—Ceph Block Device, Ceph Object Storage, CephFS, or custom
39- implementations using librados—and stores it as RADOS objects. Each object
40- resides on an Object Storage Device (OSD), with Ceph OSD Daemons controlling
41- read, write, and replication operations. The default BlueStore backend stores
42- objects in a monolithic, database-like fashion within a flat namespace, meaning
43- objects lack hierarchical directory structures. Each object has an identifier,
44- binary data, and name/value pair metadata, with clients determining object data
45- semantics.
46-
47- ### Eliminating Centralization
48-
49- Traditional architectures rely on centralized components—gateways, brokers, or
50- APIs—that act as single points of entry, creating failure points and
51- performance limits. Ceph eliminates these centralized components, enabling
52- clients to interact directly with Ceph OSDs. OSDs create object replicas on
53- other nodes to ensure data safety and high availability, while monitor clusters
54- ensure high availability. The CRUSH algorithm replaces centralized lookup
55- tables, providing better data management by distributing work across all OSD
56- daemons and communicating clients, using intelligent data replication to ensure
57- resiliency suitable for hyper-scale storage.
58-
59- ### Cluster Map and High Availability
60-
61- For proper functioning, Ceph clients and OSDs require current cluster topology
62- information stored in the Cluster Map, actually a collection of five maps: the
63- Monitor Map (containing cluster fsid, monitor positions, names, addresses, and
64- ports), the OSD Map (containing cluster fsid, pool lists, replica sizes, PG
65- numbers, and OSD statuses), the PG Map (containing PG versions, timestamps, and
66- placement group details), the CRUSH Map (containing storage devices, failure
67- domain hierarchy, and traversal rules), and the MDS Map (containing MDS map
68- epoch, metadata storage pool, and metadata server information). Each map
69- maintains operational state change history, with Ceph Monitors maintaining
70- master copies including cluster members, states, changes, and overall health.
71-
72- Ceph uses monitor clusters for reliability and fault tolerance. To establish
73- consensus about cluster state, Ceph employs the Paxos algorithm, requiring a
74- majority of monitors to agree (one in single-monitor clusters, two in
75- three-monitor clusters, three in five-monitor clusters, and so forth). This
76- prevents issues when monitors fall behind due to latency or faults.
77-
78- ### Authentication and Security
79-
80- The cephx authentication system authenticates users and daemons while
81- protecting against man-in-the-middle attacks, though it doesn't address
82- transport encryption or encryption at rest. Using shared secret keys, cephx
83- enables mutual authentication without revealing keys. Like Kerberos, each
84- monitor can authenticate users and distribute keys, eliminating single points
85- of failure. The system issues session keys encrypted with users' permanent
86- secret keys, which clients use to request services. Monitors provide tickets
87- authenticating clients against OSDs handling data, with monitors and OSDs
88- sharing secrets enabling ticket use across any cluster OSD or metadata server.
89- Tickets expire to prevent attackers from using obtained credentials, protecting
90- against message forgery and alteration as long as secret keys remain secure
91- before expiration.
92-
93- ### Smart Daemons and Hyperscale
94-
95- Ceph's architecture makes OSD Daemons and clients cluster-aware, unlike
96- centralized storage clusters requiring double dispatches that bottleneck at
97- petabyte-to-exabyte scale. Each Ceph OSD Daemon knows other OSDs in the
98- cluster, enabling direct interaction with other OSDs and monitors. This
99- awareness allows clients to interact directly with OSDs, and because monitors
100- and OSD daemons interact directly, OSDs leverage aggregate cluster CPU and RAM
101- resources.
102-
103- This distributed intelligence provides several benefits: OSDs service clients
104- directly, improving performance by avoiding centralized interface connection
105- limits; OSDs report membership and status (up or down), with neighboring OSDs
106- detecting and reporting failures; data scrubbing maintains consistency by
107- comparing object metadata across replicas, with deeper scrubbing comparing data
108- bit-for-bit against checksums to find bad drive sectors; and replication
109- involves client-OSD collaboration, with clients using CRUSH to determine object
110- locations, mapping objects to pools and placement groups, then writing to
111- primary OSDs that replicate to secondary OSDs.
112-
113- ### Dynamic Cluster Management
114-
115- Pools are logical partitions for storing objects, with clients retrieving
116- cluster maps from monitors and writing RADOS objects to pools. CRUSH
117- dynamically maps placement groups (PGs) to OSDs, with clients storing objects
118- by having CRUSH map each RADOS object to a PG. This abstraction layer between
119- OSDs and clients enables adaptive cluster growth, shrinkage, and data
120- redistribution when topology changes. The indirection allows dynamic
121- rebalancing when new OSDs come online.
122-
123- Clients compute object locations rather than querying, requiring only object ID
124- and pool name. Ceph hashes object IDs, calculates hash modulo PG numbers,
125- retrieves pool IDs from pool names, and prepends pool IDs to PG IDs. This
126- computation proves faster than query sessions, with CRUSH enabling clients to
127- compute expected object locations and contact primary OSDs for storage or
128- retrieval.
129-
130- ### Client Interfaces
131-
132- Ceph provides three client types: Ceph Block Device (RBD) offers resizable,
133- thin-provisioned, snapshottable block devices striped across clusters for high
134- performance; Ceph Object Storage (RGW) provides RESTful APIs compatible with
135- Amazon S3 and OpenStack Swift; and CephFS provides POSIX-compliant filesystems
136- mountable as kernel objects or FUSE. Modern applications access storage through
137- librados, which provides direct parallel cluster access supporting pool
138- operations, snapshots, copy-on-write cloning, object read/write operations,
139- extended attributes, key/value pairs, and object classes.
140-
141- The architecture demonstrates how Ceph's distributed, intelligent design
142- eliminates traditional storage limitations, enabling massive scalability while
143- maintaining reliability and performance through algorithmic data placement,
144- autonomous daemon operations, and direct client-storage interactions.
17+ ## Ceph Block Device Summary (RBD)
18+
19+ ### Overview of RBD
20+
21+ A block is a sequence of bytes, often 512 bytes in size. Block-based storage
22+ interfaces represent a mature and common method for storing data on various
23+ media types including hard disk drives (HDDs), solid-state drives (SSDs),
24+ compact discs (CDs), floppy disks, and magnetic tape. The widespread adoption
25+ of block device interfaces makes them an ideal fit for mass data storage
26+ applications, including their integration with Ceph storage systems.
27+
28+ ### Core Features
29+
30+ Ceph block devices are designed with three fundamental characteristics:
31+ thin-provisioning, resizability, and data striping across multiple Object
32+ Storage Daemons (OSDs). These devices leverage the full capabilities of RADOS
33+ (Reliable Autonomic Distributed Object Store), including snapshotting,
34+ replication, and strong consistency guarantees. Ceph block storage clients
35+ establish communication with Ceph clusters through two primary methods: kernel
36+ modules or the librbd library.
37+
38+ An important distinction exists between these two communication methods
39+ regarding caching behavior. Kernel modules have the capability to utilize Linux
40+ page caching for performance optimization. For applications that rely on the
41+ librbd library, Ceph provides its own RBD (RADOS Block Device) caching
42+ mechanism to enhance performance.
43+
44+ ### Performance and Scalability
45+
46+ Ceph's block devices are engineered to deliver high performance combined with
47+ vast scalability capabilities. This performance extends to various deployment
48+ scenarios, including direct integration with kernel modules and virtualization
49+ environments. The architecture supports Key-Value Machines (KVMs) such as QEMU,
50+ enabling efficient virtualized storage operations.
51+
52+ Cloud-based computing platforms have embraced Ceph block devices as a storage
53+ backend solution. Major cloud computing systems including OpenStack, OpenNebula,
54+ and CloudStack integrate with Ceph block devices through their reliance on
55+ libvirt and QEMU technologies. This integration allows these cloud platforms to
56+ leverage Ceph's distributed storage capabilities for their virtual machine
57+ storage requirements.
58+
59+ ### Unified Storage Cluster
60+
61+ One of Ceph's significant architectural advantages is its ability to support
62+ multiple storage interfaces simultaneously within a single cluster. The same
63+ Ceph cluster can concurrently operate the Ceph RADOS Gateway for object
64+ storage, the Ceph File System (CephFS) for file-based storage, and Ceph block
65+ devices for block-based storage. This unified approach eliminates the need for
66+ separate storage infrastructure for different storage paradigms, simplifying
67+ management and reducing operational overhead.
68+
69+ This multi-interface capability allows organizations to deploy a single storage
70+ solution that addresses diverse storage requirements, from traditional block
71+ storage for databases and virtual machines to object storage for unstructured
72+ data and file storage for shared filesystems. The convergence of these storage
73+ types within one cluster provides operational efficiency and cost-effectiveness
74+ while maintaining the performance and reliability characteristics required for
75+ enterprise deployments.
76+
77+ ### Technical Implementation
78+
79+ The thin-provisioning feature of Ceph block devices means that storage space is
80+ allocated only as data is written, rather than pre-allocating the entire volume
81+ capacity upfront. This approach optimizes storage utilization by avoiding waste
82+ from unused pre-allocated space and allows for oversubscription strategies
83+ where the sum of provisioned capacity can exceed physical capacity, based on
84+ actual usage patterns.
85+
86+ The resizable nature of Ceph block devices provides operational flexibility,
87+ allowing administrators to expand or contract volume sizes based on changing
88+ application requirements without disrupting service availability. This dynamic
89+ sizing capability supports evolving storage needs without requiring complex
90+ migration procedures or extended downtime windows.
91+
92+ Data striping across multiple OSDs distributes data blocks across the cluster's
93+ storage nodes. This distribution achieves two critical objectives: it increases
94+ aggregate throughput by allowing parallel I/O operations across multiple
95+ devices, and it ensures data availability through the replication mechanisms
96+ built into RADOS. The striping process breaks data into smaller chunks that are
97+ distributed according to the cluster's CRUSH (Controlled Scalable Decentralized
98+ Placement of Replicated Data) algorithm, which determines optimal placement
99+ based on cluster topology and configured policies.
100+
101+ ### RADOS Integration
102+
103+ The integration with RADOS provides Ceph block devices with enterprise-grade
104+ features. Snapshotting capability enables point-in-time copies of block devices,
105+ supporting backup operations, testing scenarios, and recovery procedures.
106+ Snapshots are space-efficient, storing only changed data rather than full
107+ copies, and can be created instantaneously without impacting ongoing operations.
108+
109+ Replication ensures data durability by maintaining multiple copies of data
110+ across different cluster nodes. The replication factor is configurable,
111+ allowing organizations to balance storage efficiency against data protection
112+ requirements. Strong consistency guarantees ensure that all replicas reflect the
113+ same data state, preventing split-brain scenarios and ensuring data integrity
114+ even during failure conditions.
115+
116+ The communication architecture between block storage clients and Ceph clusters
117+ through kernel modules or librbd provides flexibility in deployment scenarios.
118+ Kernel module integration enables direct access from operating systems, while
119+ librbd allows applications to interact with Ceph block devices programmatically,
120+ supporting a wide range of use cases from bare-metal servers to containerized
121+ applications.
122+
123+ ### Conclusion
124+
125+ Ceph block devices represent a sophisticated implementation of block storage
126+ that combines the traditional simplicity of block-based interfaces with modern
127+ distributed storage capabilities. The thin-provisioned, resizable architecture
128+ with data striping across multiple OSDs provides a foundation for scalable,
129+ high-performance storage. Integration with RADOS brings enterprise features
130+ including snapshotting, replication, and strong consistency, while support for
131+ both kernel modules and librbd ensures broad compatibility across deployment
132+ scenarios. The ability to run block devices alongside object and file storage
133+ within a unified cluster positions Ceph as a comprehensive storage solution
134+ capable of addressing diverse organizational storage requirements through a
135+ single infrastructure platform. This convergence of capabilities, combined with
136+ proven integration with major virtualization and cloud platforms, establishes
137+ Ceph block devices as a viable solution for modern data center storage needs.
145138
146139## See Also
147140The architecture of the Ceph cluster is explained in [ the Architecture
0 commit comments