Merge pull request #13 from zdover23/docs-2026-03-26-architecture-rbd

senolcolak · web-flow · commit 102a6ff3c3b1 · 2026-03-26T18:13:27.000+01:00
Add "RBD" section to architecture.md
diff --git a/docs/architecture/ceph.md b/docs/architecture/ceph.md
@@ -14,134 +14,127 @@ each other to replicate and redistribute data dynamically.
 
 ## Architecture
 
-### The Ceph Storage Cluster
-
-At its core, Ceph provides an infinitely scalable storage cluster based on
-RADOS (Reliable Autonomic Distributed Object Store), a distributed storage
-service that uses the intelligence in each node to secure data and provide it
-to clients. A Ceph Storage Cluster consists of four daemon types: Ceph
-Monitors, which maintain the master copy of the cluster map; Ceph OSD Daemons,
-which check their own state and that of other OSDs; Ceph Managers, serving as
-endpoints for monitoring and orchestration; and Ceph Metadata Servers (MDS),
-which manage file metadata when CephFS provides file services.
-
-Storage cluster clients and Ceph OSD Daemons use the CRUSH (Controlled
-Scalable Decentralized Placement of Replicated Data) algorithm to compute data
-location information, avoiding bottlenecks from central lookup tables. This
-algorithmic approach enables Ceph's high-level features, including a native
-interface to the storage cluster via librados and numerous service interfaces
-built atop it.
-
-### Data Storage and Organization
-
-The Ceph Storage Cluster receives data from clients through various
-interfaces—Ceph Block Device, Ceph Object Storage, CephFS, or custom
-implementations using librados—and stores it as RADOS objects. Each object
-resides on an Object Storage Device (OSD), with Ceph OSD Daemons controlling
-read, write, and replication operations. The default BlueStore backend stores
-objects in a monolithic, database-like fashion within a flat namespace, meaning
-objects lack hierarchical directory structures. Each object has an identifier,
-binary data, and name/value pair metadata, with clients determining object data
-semantics.
-
-### Eliminating Centralization
-
-Traditional architectures rely on centralized components—gateways, brokers, or
-APIs—that act as single points of entry, creating failure points and
-performance limits. Ceph eliminates these centralized components, enabling
-clients to interact directly with Ceph OSDs. OSDs create object replicas on
-other nodes to ensure data safety and high availability, while monitor clusters
-ensure high availability. The CRUSH algorithm replaces centralized lookup
-tables, providing better data management by distributing work across all OSD
-daemons and communicating clients, using intelligent data replication to ensure
-resiliency suitable for hyper-scale storage.
-
-### Cluster Map and High Availability
-
-For proper functioning, Ceph clients and OSDs require current cluster topology
-information stored in the Cluster Map, actually a collection of five maps: the
-Monitor Map (containing cluster fsid, monitor positions, names, addresses, and
-ports), the OSD Map (containing cluster fsid, pool lists, replica sizes, PG
-numbers, and OSD statuses), the PG Map (containing PG versions, timestamps, and
-placement group details), the CRUSH Map (containing storage devices, failure
-domain hierarchy, and traversal rules), and the MDS Map (containing MDS map
-epoch, metadata storage pool, and metadata server information). Each map
-maintains operational state change history, with Ceph Monitors maintaining
-master copies including cluster members, states, changes, and overall health.
-
-Ceph uses monitor clusters for reliability and fault tolerance. To establish
-consensus about cluster state, Ceph employs the Paxos algorithm, requiring a
-majority of monitors to agree (one in single-monitor clusters, two in
-three-monitor clusters, three in five-monitor clusters, and so forth). This
-prevents issues when monitors fall behind due to latency or faults.
-
-### Authentication and Security
-
-The cephx authentication system authenticates users and daemons while
-protecting against man-in-the-middle attacks, though it doesn't address
-transport encryption or encryption at rest. Using shared secret keys, cephx
-enables mutual authentication without revealing keys. Like Kerberos, each
-monitor can authenticate users and distribute keys, eliminating single points
-of failure. The system issues session keys encrypted with users' permanent
-secret keys, which clients use to request services. Monitors provide tickets
-authenticating clients against OSDs handling data, with monitors and OSDs
-sharing secrets enabling ticket use across any cluster OSD or metadata server.
-Tickets expire to prevent attackers from using obtained credentials, protecting
-against message forgery and alteration as long as secret keys remain secure
-before expiration.
-
-### Smart Daemons and Hyperscale
-
-Ceph's architecture makes OSD Daemons and clients cluster-aware, unlike
-centralized storage clusters requiring double dispatches that bottleneck at
-petabyte-to-exabyte scale. Each Ceph OSD Daemon knows other OSDs in the
-cluster, enabling direct interaction with other OSDs and monitors. This
-awareness allows clients to interact directly with OSDs, and because monitors
-and OSD daemons interact directly, OSDs leverage aggregate cluster CPU and RAM
-resources.
-
-This distributed intelligence provides several benefits: OSDs service clients
-directly, improving performance by avoiding centralized interface connection
-limits; OSDs report membership and status (up or down), with neighboring OSDs
-detecting and reporting failures; data scrubbing maintains consistency by
-comparing object metadata across replicas, with deeper scrubbing comparing data
-bit-for-bit against checksums to find bad drive sectors; and replication
-involves client-OSD collaboration, with clients using CRUSH to determine object
-locations, mapping objects to pools and placement groups, then writing to
-primary OSDs that replicate to secondary OSDs.
-
-### Dynamic Cluster Management
-
-Pools are logical partitions for storing objects, with clients retrieving
-cluster maps from monitors and writing RADOS objects to pools. CRUSH
-dynamically maps placement groups (PGs) to OSDs, with clients storing objects
-by having CRUSH map each RADOS object to a PG. This abstraction layer between
-OSDs and clients enables adaptive cluster growth, shrinkage, and data
-redistribution when topology changes. The indirection allows dynamic
-rebalancing when new OSDs come online.
-
-Clients compute object locations rather than querying, requiring only object ID
-and pool name. Ceph hashes object IDs, calculates hash modulo PG numbers,
-retrieves pool IDs from pool names, and prepends pool IDs to PG IDs. This
-computation proves faster than query sessions, with CRUSH enabling clients to
-compute expected object locations and contact primary OSDs for storage or
-retrieval.
-
-### Client Interfaces
-
-Ceph provides three client types: Ceph Block Device (RBD) offers resizable,
-thin-provisioned, snapshottable block devices striped across clusters for high
-performance; Ceph Object Storage (RGW) provides RESTful APIs compatible with
-Amazon S3 and OpenStack Swift; and CephFS provides POSIX-compliant filesystems
-mountable as kernel objects or FUSE. Modern applications access storage through
-librados, which provides direct parallel cluster access supporting pool
-operations, snapshots, copy-on-write cloning, object read/write operations,
-extended attributes, key/value pairs, and object classes.
-
-The architecture demonstrates how Ceph's distributed, intelligent design
-eliminates traditional storage limitations, enabling massive scalability while
-maintaining reliability and performance through algorithmic data placement,
-autonomous daemon operations, and direct client-storage interactions.
+## Ceph Block Device Summary (RBD)
+
+### Overview of RBD
+
+A block is a sequence of bytes, often 512 bytes in size. Block-based storage
+interfaces represent a mature and common method for storing data on various
+media types including hard disk drives (HDDs), solid-state drives (SSDs),
+compact discs (CDs), floppy disks, and magnetic tape. The widespread adoption
+of block device interfaces makes them an ideal fit for mass data storage
+applications, including their integration with Ceph storage systems.
+
+### Core Features
+
+Ceph block devices are designed with three fundamental characteristics:
+thin-provisioning, resizability, and data striping across multiple Object
+Storage Daemons (OSDs). These devices leverage the full capabilities of RADOS
+(Reliable Autonomic Distributed Object Store), including snapshotting,
+replication, and strong consistency guarantees. Ceph block storage clients
+establish communication with Ceph clusters through two primary methods: kernel
+modules or the librbd library.
+
+An important distinction exists between these two communication methods
+regarding caching behavior. Kernel modules have the capability to utilize Linux
+page caching for performance optimization. For applications that rely on the
+librbd library, Ceph provides its own RBD (RADOS Block Device) caching
+mechanism to enhance performance.
+
+### Performance and Scalability
+
+Ceph's block devices are engineered to deliver high performance combined with
+vast scalability capabilities. This performance extends to various deployment
+scenarios, including direct integration with kernel modules and virtualization
+environments. The architecture supports Key-Value Machines (KVMs) such as QEMU,
+enabling efficient virtualized storage operations.
+
+Cloud-based computing platforms have embraced Ceph block devices as a storage
+backend solution. Major cloud computing systems including OpenStack, OpenNebula,
+and CloudStack integrate with Ceph block devices through their reliance on
+libvirt and QEMU technologies. This integration allows these cloud platforms to
+leverage Ceph's distributed storage capabilities for their virtual machine
+storage requirements.
+
+### Unified Storage Cluster
+
+One of Ceph's significant architectural advantages is its ability to support
+multiple storage interfaces simultaneously within a single cluster. The same
+Ceph cluster can concurrently operate the Ceph RADOS Gateway for object
+storage, the Ceph File System (CephFS) for file-based storage, and Ceph block
+devices for block-based storage. This unified approach eliminates the need for
+separate storage infrastructure for different storage paradigms, simplifying
+management and reducing operational overhead.
+
+This multi-interface capability allows organizations to deploy a single storage
+solution that addresses diverse storage requirements, from traditional block
+storage for databases and virtual machines to object storage for unstructured
+data and file storage for shared filesystems. The convergence of these storage
+types within one cluster provides operational efficiency and cost-effectiveness
+while maintaining the performance and reliability characteristics required for
+enterprise deployments.
+
+### Technical Implementation
+
+The thin-provisioning feature of Ceph block devices means that storage space is
+allocated only as data is written, rather than pre-allocating the entire volume
+capacity upfront. This approach optimizes storage utilization by avoiding waste
+from unused pre-allocated space and allows for oversubscription strategies
+where the sum of provisioned capacity can exceed physical capacity, based on
+actual usage patterns.
+
+The resizable nature of Ceph block devices provides operational flexibility,
+allowing administrators to expand or contract volume sizes based on changing
+application requirements without disrupting service availability. This dynamic
+sizing capability supports evolving storage needs without requiring complex
+migration procedures or extended downtime windows.
+
+Data striping across multiple OSDs distributes data blocks across the cluster's
+storage nodes. This distribution achieves two critical objectives: it increases
+aggregate throughput by allowing parallel I/O operations across multiple
+devices, and it ensures data availability through the replication mechanisms
+built into RADOS. The striping process breaks data into smaller chunks that are
+distributed according to the cluster's CRUSH (Controlled Scalable Decentralized
+Placement of Replicated Data) algorithm, which determines optimal placement
+based on cluster topology and configured policies.
+
+### RADOS Integration
+
+The integration with RADOS provides Ceph block devices with enterprise-grade
+features. Snapshotting capability enables point-in-time copies of block devices,
+supporting backup operations, testing scenarios, and recovery procedures.
+Snapshots are space-efficient, storing only changed data rather than full
+copies, and can be created instantaneously without impacting ongoing operations.
+
+Replication ensures data durability by maintaining multiple copies of data
+across different cluster nodes. The replication factor is configurable,
+allowing organizations to balance storage efficiency against data protection
+requirements. Strong consistency guarantees ensure that all replicas reflect the
+same data state, preventing split-brain scenarios and ensuring data integrity
+even during failure conditions.
+
+The communication architecture between block storage clients and Ceph clusters
+through kernel modules or librbd provides flexibility in deployment scenarios.
+Kernel module integration enables direct access from operating systems, while
+librbd allows applications to interact with Ceph block devices programmatically,
+supporting a wide range of use cases from bare-metal servers to containerized
+applications.
+
+### Conclusion
+
+Ceph block devices represent a sophisticated implementation of block storage
+that combines the traditional simplicity of block-based interfaces with modern
+distributed storage capabilities. The thin-provisioned, resizable architecture
+with data striping across multiple OSDs provides a foundation for scalable,
+high-performance storage. Integration with RADOS brings enterprise features
+including snapshotting, replication, and strong consistency, while support for
+both kernel modules and librbd ensures broad compatibility across deployment
+scenarios. The ability to run block devices alongside object and file storage
+within a unified cluster positions Ceph as a comprehensive storage solution
+capable of addressing diverse organizational storage requirements through a
+single infrastructure platform. This convergence of capabilities, combined with
+proven integration with major virtualization and cloud platforms, establishes
+Ceph block devices as a viable solution for modern data center storage needs.
 
 ## See Also
 The architecture of the Ceph cluster is explained in [the Architecture