Skip to content

Commit 102a6ff

Browse files
authored
Merge pull request #13 from zdover23/docs-2026-03-26-architecture-rbd
Add "RBD" section to architecture.md
2 parents 8663f22 + 3042e7f commit 102a6ff

1 file changed

Lines changed: 121 additions & 128 deletions

File tree

docs/architecture/ceph.md

Lines changed: 121 additions & 128 deletions
Original file line numberDiff line numberDiff line change
@@ -14,134 +14,127 @@ each other to replicate and redistribute data dynamically.
1414

1515
## Architecture
1616

17-
### The Ceph Storage Cluster
18-
19-
At its core, Ceph provides an infinitely scalable storage cluster based on
20-
RADOS (Reliable Autonomic Distributed Object Store), a distributed storage
21-
service that uses the intelligence in each node to secure data and provide it
22-
to clients. A Ceph Storage Cluster consists of four daemon types: Ceph
23-
Monitors, which maintain the master copy of the cluster map; Ceph OSD Daemons,
24-
which check their own state and that of other OSDs; Ceph Managers, serving as
25-
endpoints for monitoring and orchestration; and Ceph Metadata Servers (MDS),
26-
which manage file metadata when CephFS provides file services.
27-
28-
Storage cluster clients and Ceph OSD Daemons use the CRUSH (Controlled
29-
Scalable Decentralized Placement of Replicated Data) algorithm to compute data
30-
location information, avoiding bottlenecks from central lookup tables. This
31-
algorithmic approach enables Ceph's high-level features, including a native
32-
interface to the storage cluster via librados and numerous service interfaces
33-
built atop it.
34-
35-
### Data Storage and Organization
36-
37-
The Ceph Storage Cluster receives data from clients through various
38-
interfaces—Ceph Block Device, Ceph Object Storage, CephFS, or custom
39-
implementations using librados—and stores it as RADOS objects. Each object
40-
resides on an Object Storage Device (OSD), with Ceph OSD Daemons controlling
41-
read, write, and replication operations. The default BlueStore backend stores
42-
objects in a monolithic, database-like fashion within a flat namespace, meaning
43-
objects lack hierarchical directory structures. Each object has an identifier,
44-
binary data, and name/value pair metadata, with clients determining object data
45-
semantics.
46-
47-
### Eliminating Centralization
48-
49-
Traditional architectures rely on centralized components—gateways, brokers, or
50-
APIs—that act as single points of entry, creating failure points and
51-
performance limits. Ceph eliminates these centralized components, enabling
52-
clients to interact directly with Ceph OSDs. OSDs create object replicas on
53-
other nodes to ensure data safety and high availability, while monitor clusters
54-
ensure high availability. The CRUSH algorithm replaces centralized lookup
55-
tables, providing better data management by distributing work across all OSD
56-
daemons and communicating clients, using intelligent data replication to ensure
57-
resiliency suitable for hyper-scale storage.
58-
59-
### Cluster Map and High Availability
60-
61-
For proper functioning, Ceph clients and OSDs require current cluster topology
62-
information stored in the Cluster Map, actually a collection of five maps: the
63-
Monitor Map (containing cluster fsid, monitor positions, names, addresses, and
64-
ports), the OSD Map (containing cluster fsid, pool lists, replica sizes, PG
65-
numbers, and OSD statuses), the PG Map (containing PG versions, timestamps, and
66-
placement group details), the CRUSH Map (containing storage devices, failure
67-
domain hierarchy, and traversal rules), and the MDS Map (containing MDS map
68-
epoch, metadata storage pool, and metadata server information). Each map
69-
maintains operational state change history, with Ceph Monitors maintaining
70-
master copies including cluster members, states, changes, and overall health.
71-
72-
Ceph uses monitor clusters for reliability and fault tolerance. To establish
73-
consensus about cluster state, Ceph employs the Paxos algorithm, requiring a
74-
majority of monitors to agree (one in single-monitor clusters, two in
75-
three-monitor clusters, three in five-monitor clusters, and so forth). This
76-
prevents issues when monitors fall behind due to latency or faults.
77-
78-
### Authentication and Security
79-
80-
The cephx authentication system authenticates users and daemons while
81-
protecting against man-in-the-middle attacks, though it doesn't address
82-
transport encryption or encryption at rest. Using shared secret keys, cephx
83-
enables mutual authentication without revealing keys. Like Kerberos, each
84-
monitor can authenticate users and distribute keys, eliminating single points
85-
of failure. The system issues session keys encrypted with users' permanent
86-
secret keys, which clients use to request services. Monitors provide tickets
87-
authenticating clients against OSDs handling data, with monitors and OSDs
88-
sharing secrets enabling ticket use across any cluster OSD or metadata server.
89-
Tickets expire to prevent attackers from using obtained credentials, protecting
90-
against message forgery and alteration as long as secret keys remain secure
91-
before expiration.
92-
93-
### Smart Daemons and Hyperscale
94-
95-
Ceph's architecture makes OSD Daemons and clients cluster-aware, unlike
96-
centralized storage clusters requiring double dispatches that bottleneck at
97-
petabyte-to-exabyte scale. Each Ceph OSD Daemon knows other OSDs in the
98-
cluster, enabling direct interaction with other OSDs and monitors. This
99-
awareness allows clients to interact directly with OSDs, and because monitors
100-
and OSD daemons interact directly, OSDs leverage aggregate cluster CPU and RAM
101-
resources.
102-
103-
This distributed intelligence provides several benefits: OSDs service clients
104-
directly, improving performance by avoiding centralized interface connection
105-
limits; OSDs report membership and status (up or down), with neighboring OSDs
106-
detecting and reporting failures; data scrubbing maintains consistency by
107-
comparing object metadata across replicas, with deeper scrubbing comparing data
108-
bit-for-bit against checksums to find bad drive sectors; and replication
109-
involves client-OSD collaboration, with clients using CRUSH to determine object
110-
locations, mapping objects to pools and placement groups, then writing to
111-
primary OSDs that replicate to secondary OSDs.
112-
113-
### Dynamic Cluster Management
114-
115-
Pools are logical partitions for storing objects, with clients retrieving
116-
cluster maps from monitors and writing RADOS objects to pools. CRUSH
117-
dynamically maps placement groups (PGs) to OSDs, with clients storing objects
118-
by having CRUSH map each RADOS object to a PG. This abstraction layer between
119-
OSDs and clients enables adaptive cluster growth, shrinkage, and data
120-
redistribution when topology changes. The indirection allows dynamic
121-
rebalancing when new OSDs come online.
122-
123-
Clients compute object locations rather than querying, requiring only object ID
124-
and pool name. Ceph hashes object IDs, calculates hash modulo PG numbers,
125-
retrieves pool IDs from pool names, and prepends pool IDs to PG IDs. This
126-
computation proves faster than query sessions, with CRUSH enabling clients to
127-
compute expected object locations and contact primary OSDs for storage or
128-
retrieval.
129-
130-
### Client Interfaces
131-
132-
Ceph provides three client types: Ceph Block Device (RBD) offers resizable,
133-
thin-provisioned, snapshottable block devices striped across clusters for high
134-
performance; Ceph Object Storage (RGW) provides RESTful APIs compatible with
135-
Amazon S3 and OpenStack Swift; and CephFS provides POSIX-compliant filesystems
136-
mountable as kernel objects or FUSE. Modern applications access storage through
137-
librados, which provides direct parallel cluster access supporting pool
138-
operations, snapshots, copy-on-write cloning, object read/write operations,
139-
extended attributes, key/value pairs, and object classes.
140-
141-
The architecture demonstrates how Ceph's distributed, intelligent design
142-
eliminates traditional storage limitations, enabling massive scalability while
143-
maintaining reliability and performance through algorithmic data placement,
144-
autonomous daemon operations, and direct client-storage interactions.
17+
## Ceph Block Device Summary (RBD)
18+
19+
### Overview of RBD
20+
21+
A block is a sequence of bytes, often 512 bytes in size. Block-based storage
22+
interfaces represent a mature and common method for storing data on various
23+
media types including hard disk drives (HDDs), solid-state drives (SSDs),
24+
compact discs (CDs), floppy disks, and magnetic tape. The widespread adoption
25+
of block device interfaces makes them an ideal fit for mass data storage
26+
applications, including their integration with Ceph storage systems.
27+
28+
### Core Features
29+
30+
Ceph block devices are designed with three fundamental characteristics:
31+
thin-provisioning, resizability, and data striping across multiple Object
32+
Storage Daemons (OSDs). These devices leverage the full capabilities of RADOS
33+
(Reliable Autonomic Distributed Object Store), including snapshotting,
34+
replication, and strong consistency guarantees. Ceph block storage clients
35+
establish communication with Ceph clusters through two primary methods: kernel
36+
modules or the librbd library.
37+
38+
An important distinction exists between these two communication methods
39+
regarding caching behavior. Kernel modules have the capability to utilize Linux
40+
page caching for performance optimization. For applications that rely on the
41+
librbd library, Ceph provides its own RBD (RADOS Block Device) caching
42+
mechanism to enhance performance.
43+
44+
### Performance and Scalability
45+
46+
Ceph's block devices are engineered to deliver high performance combined with
47+
vast scalability capabilities. This performance extends to various deployment
48+
scenarios, including direct integration with kernel modules and virtualization
49+
environments. The architecture supports Key-Value Machines (KVMs) such as QEMU,
50+
enabling efficient virtualized storage operations.
51+
52+
Cloud-based computing platforms have embraced Ceph block devices as a storage
53+
backend solution. Major cloud computing systems including OpenStack, OpenNebula,
54+
and CloudStack integrate with Ceph block devices through their reliance on
55+
libvirt and QEMU technologies. This integration allows these cloud platforms to
56+
leverage Ceph's distributed storage capabilities for their virtual machine
57+
storage requirements.
58+
59+
### Unified Storage Cluster
60+
61+
One of Ceph's significant architectural advantages is its ability to support
62+
multiple storage interfaces simultaneously within a single cluster. The same
63+
Ceph cluster can concurrently operate the Ceph RADOS Gateway for object
64+
storage, the Ceph File System (CephFS) for file-based storage, and Ceph block
65+
devices for block-based storage. This unified approach eliminates the need for
66+
separate storage infrastructure for different storage paradigms, simplifying
67+
management and reducing operational overhead.
68+
69+
This multi-interface capability allows organizations to deploy a single storage
70+
solution that addresses diverse storage requirements, from traditional block
71+
storage for databases and virtual machines to object storage for unstructured
72+
data and file storage for shared filesystems. The convergence of these storage
73+
types within one cluster provides operational efficiency and cost-effectiveness
74+
while maintaining the performance and reliability characteristics required for
75+
enterprise deployments.
76+
77+
### Technical Implementation
78+
79+
The thin-provisioning feature of Ceph block devices means that storage space is
80+
allocated only as data is written, rather than pre-allocating the entire volume
81+
capacity upfront. This approach optimizes storage utilization by avoiding waste
82+
from unused pre-allocated space and allows for oversubscription strategies
83+
where the sum of provisioned capacity can exceed physical capacity, based on
84+
actual usage patterns.
85+
86+
The resizable nature of Ceph block devices provides operational flexibility,
87+
allowing administrators to expand or contract volume sizes based on changing
88+
application requirements without disrupting service availability. This dynamic
89+
sizing capability supports evolving storage needs without requiring complex
90+
migration procedures or extended downtime windows.
91+
92+
Data striping across multiple OSDs distributes data blocks across the cluster's
93+
storage nodes. This distribution achieves two critical objectives: it increases
94+
aggregate throughput by allowing parallel I/O operations across multiple
95+
devices, and it ensures data availability through the replication mechanisms
96+
built into RADOS. The striping process breaks data into smaller chunks that are
97+
distributed according to the cluster's CRUSH (Controlled Scalable Decentralized
98+
Placement of Replicated Data) algorithm, which determines optimal placement
99+
based on cluster topology and configured policies.
100+
101+
### RADOS Integration
102+
103+
The integration with RADOS provides Ceph block devices with enterprise-grade
104+
features. Snapshotting capability enables point-in-time copies of block devices,
105+
supporting backup operations, testing scenarios, and recovery procedures.
106+
Snapshots are space-efficient, storing only changed data rather than full
107+
copies, and can be created instantaneously without impacting ongoing operations.
108+
109+
Replication ensures data durability by maintaining multiple copies of data
110+
across different cluster nodes. The replication factor is configurable,
111+
allowing organizations to balance storage efficiency against data protection
112+
requirements. Strong consistency guarantees ensure that all replicas reflect the
113+
same data state, preventing split-brain scenarios and ensuring data integrity
114+
even during failure conditions.
115+
116+
The communication architecture between block storage clients and Ceph clusters
117+
through kernel modules or librbd provides flexibility in deployment scenarios.
118+
Kernel module integration enables direct access from operating systems, while
119+
librbd allows applications to interact with Ceph block devices programmatically,
120+
supporting a wide range of use cases from bare-metal servers to containerized
121+
applications.
122+
123+
### Conclusion
124+
125+
Ceph block devices represent a sophisticated implementation of block storage
126+
that combines the traditional simplicity of block-based interfaces with modern
127+
distributed storage capabilities. The thin-provisioned, resizable architecture
128+
with data striping across multiple OSDs provides a foundation for scalable,
129+
high-performance storage. Integration with RADOS brings enterprise features
130+
including snapshotting, replication, and strong consistency, while support for
131+
both kernel modules and librbd ensures broad compatibility across deployment
132+
scenarios. The ability to run block devices alongside object and file storage
133+
within a unified cluster positions Ceph as a comprehensive storage solution
134+
capable of addressing diverse organizational storage requirements through a
135+
single infrastructure platform. This convergence of capabilities, combined with
136+
proven integration with major virtualization and cloud platforms, establishes
137+
Ceph block devices as a viable solution for modern data center storage needs.
145138

146139
## See Also
147140
The architecture of the Ceph cluster is explained in [the Architecture

0 commit comments

Comments
 (0)