Skip to content

Commit de36c4b

Browse files
authored
HDDS-15049. [Docs] System internals: Datanode disk layout. (apache#398)
1 parent 44a4bb9 commit de36c4b

2 files changed

Lines changed: 125 additions & 2 deletions

File tree

cspell.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,8 @@ words:
135135
- spnego
136136
- keytab
137137
- Keytab
138+
- keyspace
139+
- Merkle
138140
- RocksDB
139141
- LDB
140142
- memtable
Lines changed: 123 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,129 @@
11
---
2-
draft: true
32
sidebar_label: Disk Layout
43
---
54

65
# Datanode Disk Layout
76

8-
**TODO:** File a subtask under [HDDS-9862](https://issues.apache.org/jira/browse/HDDS-9862) and complete this page or section.
7+
Ozone Datanodes manage physical disks (volumes) and organize data into containers. This page describes the directory structure and on-disk files used to store container data and metadata. For how Schema v3 consolidates metadata into one RocksDB per volume, see [Datanode Container Schema v3](./rocksdb-schema).
8+
9+
## Overview
10+
11+
Each Datanode volume (configured via `hdds.datanode.dir`) follows a fixed layout for scalability and performance. The layout has evolved to support very large container counts per Datanode; **Schema V3** merges container metadata into a shared database per volume instead of one RocksDB per container.
12+
13+
## Volume level layout
14+
15+
At the root of each configured HDDS directory, the structure is:
16+
17+
```text
18+
storage_dir/
19+
├── hdds/
20+
│ ├── VERSION
21+
│ └── <<clusterUuid>>/
22+
│ ├── current/
23+
│ │ ├── container0/
24+
│ │ ├── container1/
25+
│ │ └── ...
26+
│ ├── <<storageID>>/
27+
│ │ └── container.db/ (Schema V3 only)
28+
│ └── tmp/
29+
│ └── deleted-containers/
30+
```
31+
32+
### Key components
33+
34+
- **`hdds/VERSION`**: Properties file with volume-level metadata, including:
35+
- `storageID`: Unique identifier for this volume.
36+
- `clusterID`: ID of the Ozone cluster this volume belongs to.
37+
- `datanodeUuid`: ID of the Datanode.
38+
- `ctime`: Creation time of the volume.
39+
- `layOutVersion`: Software layout version at volume creation.
40+
41+
- **`<<clusterUuid>>/`**: Directory named after the cluster UUID. Ozone uses one cluster per volume.
42+
43+
- **`current/`**: Active container data. Containers are grouped under subdirectories (`container0`, `container1`, …) so a single directory does not hold hundreds of thousands of entries.
44+
45+
- **`container.db/` (Schema V3)**: When Schema V3 is enabled, a shared RocksDB stores metadata for **all** containers on this volume. That reduces open file descriptors and improves metadata operations compared with per-container databases.
46+
47+
- **`tmp/deleted-containers/`**: Staging area for atomic container deletion. When a container is removed, its directory is moved here before being deleted from disk.
48+
49+
## Container level layout
50+
51+
Each container lives under one of the `container<N>` trees. The subdirectory index `N` is `(containerId >> 9) & 0xFF`.
52+
53+
### Directory structure
54+
55+
```text
56+
current/container<<N>>/<<containerID>>/
57+
├── metadata/
58+
│ ├── <<containerID>>.container
59+
│ ├── <<containerID>>.db/ (Schema V1/V2 only)
60+
│ └── <<containerID>>.tree (checksum info)
61+
└── chunks/
62+
├── <<blockID_1>>.chunk
63+
├── <<blockID_2>>.chunk
64+
└── ...
65+
```
66+
67+
### Metadata directory (`metadata/`)
68+
69+
- **`.container` file**: YAML with the main container metadata, including:
70+
- `containerID`: Unique ID of the container.
71+
- `state`: Current state (`OPEN`, `CLOSED`, `QUASI_CLOSED`, and so on).
72+
- `schemaVersion`: Schema version (1, 2, or 3).
73+
- `chunksPath`: Path to the chunks directory.
74+
- `metadataPath`: Path to the metadata directory.
75+
- `maxSize`: Configured maximum size of the container.
76+
77+
- **`.db` directory (Schema V1/V2)**: Per-container RocksDB. In V1 and V2, each container has its own database for block metadata.
78+
- **Schema V1**: Data in the default column family.
79+
- **Schema V2**: Metadata, block data, and delete transactions use separate column families.
80+
- In **Schema V3**, this directory is not used; metadata lives in the volume-level `container.db`.
81+
82+
- **`.tree` file**: Merkle tree checksums for the container’s data (integrity).
83+
84+
### Chunks directory (`chunks/`)
85+
86+
User data files, typically named `<<blockLocalID>>.chunk` (with extra suffixes when a block has multiple chunks). These are raw chunk files for client writes.
87+
88+
## Container metadata (RocksDB) internals
89+
90+
Whether RocksDB is per-container (V1/V2) or shared per volume (V3), it stores mappings and counters such as:
91+
92+
- **Block data**: Maps `blockLocalID` to `BlockData` (protobuf: chunk list and offsets).
93+
- **Metadata counters** (examples):
94+
- `#BLOCKCOUNT`: Block count in the container.
95+
- `#BYTESUSED`: Bytes used by blocks.
96+
- `#BCSID`: Block commit sequence ID.
97+
- `#delTX`: Latest delete transaction ID.
98+
- `#pendingDeleteBlockCount`: Blocks marked for deletion but not yet removed.
99+
- `#pendingDeleteBlockBytes`: Bytes of blocks pending deletion.
100+
101+
## Schema V3 optimization
102+
103+
Schema V3 is the modern layout: one RocksDB per disk (per volume) instead of thousands of small instances.
104+
105+
### Layout differences
106+
107+
- **Location**: Metadata moves from
108+
`hdds/<<clusterUuid>>/current/container<<N>>/<<containerID>>/metadata/<<containerID>>.db`
109+
to
110+
`hdds/<<clusterUuid>>/<<storageID>>/container.db`.
111+
112+
- **Key prefixing**: Keys in the shared database are prefixed with `containerID` so containers stay isolated in one keyspace.
113+
114+
### Benefits
115+
116+
- **Lower resource use**: Fewer file descriptors and less RocksDB cache overhead.
117+
- **Better operations**: More efficient compaction and faster startup (fewer databases to open and scan).
118+
119+
## Deletion process
120+
121+
Container deletion is staged so it stays consistent across crashes:
122+
123+
1. The container is marked `DELETED` in its `.container` file.
124+
2. It is removed from the Datanode in-memory container set.
125+
3. For Schema V3, its metadata is removed from the shared RocksDB.
126+
4. The container directory is moved under `hdds/<<clusterUuid>>/tmp/deleted-containers/`.
127+
5. A background task deletes files under `tmp`.
128+
129+
This avoids partial on-disk states and lets a restart finish cleanup of anything left in `tmp/deleted-containers/`.

0 commit comments

Comments
 (0)