|
1 | 1 | --- |
2 | | -draft: true |
3 | 2 | sidebar_label: Disk Layout |
4 | 3 | --- |
5 | 4 |
|
6 | 5 | # Datanode Disk Layout |
7 | 6 |
|
8 | | -**TODO:** File a subtask under [HDDS-9862](https://issues.apache.org/jira/browse/HDDS-9862) and complete this page or section. |
| 7 | +Ozone Datanodes manage physical disks (volumes) and organize data into containers. This page describes the directory structure and on-disk files used to store container data and metadata. For how Schema v3 consolidates metadata into one RocksDB per volume, see [Datanode Container Schema v3](./rocksdb-schema). |
| 8 | + |
| 9 | +## Overview |
| 10 | + |
| 11 | +Each Datanode volume (configured via `hdds.datanode.dir`) follows a fixed layout for scalability and performance. The layout has evolved to support very large container counts per Datanode; **Schema V3** merges container metadata into a shared database per volume instead of one RocksDB per container. |
| 12 | + |
| 13 | +## Volume level layout |
| 14 | + |
| 15 | +At the root of each configured HDDS directory, the structure is: |
| 16 | + |
| 17 | +```text |
| 18 | +storage_dir/ |
| 19 | +├── hdds/ |
| 20 | +│ ├── VERSION |
| 21 | +│ └── <<clusterUuid>>/ |
| 22 | +│ ├── current/ |
| 23 | +│ │ ├── container0/ |
| 24 | +│ │ ├── container1/ |
| 25 | +│ │ └── ... |
| 26 | +│ ├── <<storageID>>/ |
| 27 | +│ │ └── container.db/ (Schema V3 only) |
| 28 | +│ └── tmp/ |
| 29 | +│ └── deleted-containers/ |
| 30 | +``` |
| 31 | + |
| 32 | +### Key components |
| 33 | + |
| 34 | +- **`hdds/VERSION`**: Properties file with volume-level metadata, including: |
| 35 | + - `storageID`: Unique identifier for this volume. |
| 36 | + - `clusterID`: ID of the Ozone cluster this volume belongs to. |
| 37 | + - `datanodeUuid`: ID of the Datanode. |
| 38 | + - `ctime`: Creation time of the volume. |
| 39 | + - `layOutVersion`: Software layout version at volume creation. |
| 40 | + |
| 41 | +- **`<<clusterUuid>>/`**: Directory named after the cluster UUID. Ozone uses one cluster per volume. |
| 42 | + |
| 43 | +- **`current/`**: Active container data. Containers are grouped under subdirectories (`container0`, `container1`, …) so a single directory does not hold hundreds of thousands of entries. |
| 44 | + |
| 45 | +- **`container.db/` (Schema V3)**: When Schema V3 is enabled, a shared RocksDB stores metadata for **all** containers on this volume. That reduces open file descriptors and improves metadata operations compared with per-container databases. |
| 46 | + |
| 47 | +- **`tmp/deleted-containers/`**: Staging area for atomic container deletion. When a container is removed, its directory is moved here before being deleted from disk. |
| 48 | + |
| 49 | +## Container level layout |
| 50 | + |
| 51 | +Each container lives under one of the `container<N>` trees. The subdirectory index `N` is `(containerId >> 9) & 0xFF`. |
| 52 | + |
| 53 | +### Directory structure |
| 54 | + |
| 55 | +```text |
| 56 | +current/container<<N>>/<<containerID>>/ |
| 57 | +├── metadata/ |
| 58 | +│ ├── <<containerID>>.container |
| 59 | +│ ├── <<containerID>>.db/ (Schema V1/V2 only) |
| 60 | +│ └── <<containerID>>.tree (checksum info) |
| 61 | +└── chunks/ |
| 62 | + ├── <<blockID_1>>.chunk |
| 63 | + ├── <<blockID_2>>.chunk |
| 64 | + └── ... |
| 65 | +``` |
| 66 | + |
| 67 | +### Metadata directory (`metadata/`) |
| 68 | + |
| 69 | +- **`.container` file**: YAML with the main container metadata, including: |
| 70 | + - `containerID`: Unique ID of the container. |
| 71 | + - `state`: Current state (`OPEN`, `CLOSED`, `QUASI_CLOSED`, and so on). |
| 72 | + - `schemaVersion`: Schema version (1, 2, or 3). |
| 73 | + - `chunksPath`: Path to the chunks directory. |
| 74 | + - `metadataPath`: Path to the metadata directory. |
| 75 | + - `maxSize`: Configured maximum size of the container. |
| 76 | + |
| 77 | +- **`.db` directory (Schema V1/V2)**: Per-container RocksDB. In V1 and V2, each container has its own database for block metadata. |
| 78 | + - **Schema V1**: Data in the default column family. |
| 79 | + - **Schema V2**: Metadata, block data, and delete transactions use separate column families. |
| 80 | + - In **Schema V3**, this directory is not used; metadata lives in the volume-level `container.db`. |
| 81 | + |
| 82 | +- **`.tree` file**: Merkle tree checksums for the container’s data (integrity). |
| 83 | + |
| 84 | +### Chunks directory (`chunks/`) |
| 85 | + |
| 86 | +User data files, typically named `<<blockLocalID>>.chunk` (with extra suffixes when a block has multiple chunks). These are raw chunk files for client writes. |
| 87 | + |
| 88 | +## Container metadata (RocksDB) internals |
| 89 | + |
| 90 | +Whether RocksDB is per-container (V1/V2) or shared per volume (V3), it stores mappings and counters such as: |
| 91 | + |
| 92 | +- **Block data**: Maps `blockLocalID` to `BlockData` (protobuf: chunk list and offsets). |
| 93 | +- **Metadata counters** (examples): |
| 94 | + - `#BLOCKCOUNT`: Block count in the container. |
| 95 | + - `#BYTESUSED`: Bytes used by blocks. |
| 96 | + - `#BCSID`: Block commit sequence ID. |
| 97 | + - `#delTX`: Latest delete transaction ID. |
| 98 | + - `#pendingDeleteBlockCount`: Blocks marked for deletion but not yet removed. |
| 99 | + - `#pendingDeleteBlockBytes`: Bytes of blocks pending deletion. |
| 100 | + |
| 101 | +## Schema V3 optimization |
| 102 | + |
| 103 | +Schema V3 is the modern layout: one RocksDB per disk (per volume) instead of thousands of small instances. |
| 104 | + |
| 105 | +### Layout differences |
| 106 | + |
| 107 | +- **Location**: Metadata moves from |
| 108 | + `hdds/<<clusterUuid>>/current/container<<N>>/<<containerID>>/metadata/<<containerID>>.db` |
| 109 | + to |
| 110 | + `hdds/<<clusterUuid>>/<<storageID>>/container.db`. |
| 111 | + |
| 112 | +- **Key prefixing**: Keys in the shared database are prefixed with `containerID` so containers stay isolated in one keyspace. |
| 113 | + |
| 114 | +### Benefits |
| 115 | + |
| 116 | +- **Lower resource use**: Fewer file descriptors and less RocksDB cache overhead. |
| 117 | +- **Better operations**: More efficient compaction and faster startup (fewer databases to open and scan). |
| 118 | + |
| 119 | +## Deletion process |
| 120 | + |
| 121 | +Container deletion is staged so it stays consistent across crashes: |
| 122 | + |
| 123 | +1. The container is marked `DELETED` in its `.container` file. |
| 124 | +2. It is removed from the Datanode in-memory container set. |
| 125 | +3. For Schema V3, its metadata is removed from the shared RocksDB. |
| 126 | +4. The container directory is moved under `hdds/<<clusterUuid>>/tmp/deleted-containers/`. |
| 127 | +5. A background task deletes files under `tmp`. |
| 128 | + |
| 129 | +This avoids partial on-disk states and lets a restart finish cleanup of anything left in `tmp/deleted-containers/`. |
0 commit comments