MVP. Storage policy support for Ozone#9807
Conversation
|
Thanks for the patch @greenwich. If this is something you are working on, it would be great to have a bit more info on the context, use-case and goals of this PR. Also if you have any reference JIRA for this with the relevant info, it'd be great |
|
|
||
| HOT(StorageType.SSD, StorageType.DISK), | ||
| WARM(StorageType.DISK, null), | ||
| COLD(StorageType.ARCHIVE, null); |
There was a problem hiding this comment.
what is StorageType.ARCHIVE in this context? if disk = HDD, what do we use for slower storage type?
There was a problem hiding this comment.
Yeah, good that you pointed it out; it's not needed here. I guess ARCHIVE comes from the ancient HDFS code. As in our team, we use the following storage types: DISK, SSD, NVME.
From my perspective, it should be:
- HOT -> NVME
- WARM -> SSD
- COLD -> DISK
Technically, NVMe is an SSD, but they are much faster, with different throughput and performance profiles, and we want separate layers for each. So, within our team, we would need to define separate storage for them.
I didn't want to change the policies at this point, but we should. What's your thought?
Also, as a user, I would appreciate the ability to define and configure my own storage policies and storage types, too. We missed it in HDFS, but it might be useful because we use multiple SSD types with different sizes, performance, etc. I would set them to different individual storage types with specific storage policies.
| */ | ||
| public enum OzoneStoragePolicy { | ||
|
|
||
| HOT(StorageType.SSD, StorageType.DISK), |
There was a problem hiding this comment.
how would we call e2e NVMe solution?
There was a problem hiding this comment.
Those things definitely need refinement - I responded to your comment above.
Please note it's a Draft MR.
|
Hi @greenwich, I'm not sure all the design/requirements for this feature have been completed to the point where we are ready to add code. Right now it looks like we should continue discussion in #6989 or open a new PR. I have pinged the contributors on that change for the best way forward. |
|
Thanks, everyone, for having a look! I am very sorry, but this MR isn't intended to be public or in the Open state. My bad - I'm moving it to Draft. I explained my motivation and urgency here: #6989 (comment) cc @errose28 |
a4a28c3 to
8e9c641
Compare
67504c0 to
4004e4a
Compare
7d9cad1 to
5bc9f04
Compare
5bc9f04 to
413af29
Compare
df684d6 to
c0f9eb5
Compare
0a7151c to
9143dea
Compare
42d1dea to
79ba94f
Compare
79ba94f to
b798aec
Compare
|
This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days. |
|
Thank you for your contribution. This PR is being closed due to inactivity. Please contact a maintainer if you would like to reopen it. |
What changes were proposed in this pull request?
This PR adds storage tiering (MVP-1) to Apache Ozone, enabling bucket-level storage policies that direct new writes to specific storage media (SSD or DISK). It implements the full write path end-to-end across OM, SCM, and DN.
What changes were proposed in this pull request?
Apache Ozone currently has no mechanism for directing data placement based on storage media type. Although DataNodes already report per-volume storage types (SSD, DISK, ARCHIVE) to SCM via heartbeats, this information is never used for placement decisions. All writes land on whichever pipeline SCM happens to pick, regardless of the underlying storage hardware. This means operators with mixed-media clusters cannot separate hot (latency-sensitive) data from cold (throughput-oriented) data across different storage tiers.
This PR introduces storage tiering — bucket-level storage policies that direct new writes to the correct storage media. It implements the full write path end-to-end across OM, SCM, and DN. A design document is included at
hadoop-hdds/docs/content/design/storage-policy.mdPolicy Model
A new
OzoneStoragePolicyenum maps semantic intent to physicalStorageType:The default policy is WARM (DISK), matching current behavior. A
StoragePolicyProtoenum is added toOmClientProtocol.protowithSTORAGE_POLICY_UNSET = 0so that old data and old clientsare unaffected — unset fields resolve to the server default.
How a Write Works with Storage Tiering
Client:
ozone sh bucket create --storage-policy HOT o3://om/vol/bucketOn key write:
have SSD volumes (using PipelineStorageTypeFilter)
├─ Found → allocate block on that pipeline
└─ Not found → fall back to DISK, log warning
Changes by Layer
Protobuf —
StoragePolicyProtoenum added.optional storagePolicyfields added toBucketInfo(field 23) andBucketArgs(field 13).optional storageTypeadded toAllocateScmBlockRequestProtoandCreateContainerRequestProto. All fields are optional for backward compatibility.OM — bucket metadata —
OmBucketInfoandOmBucketArgscarry a nullableOzoneStoragePolicyfield.OMBucketCreateRequestpersists the policy on bucket creation.OMBucketSetPropertyRequesthandles policy updates.OzoneManager.getDefaultStoragePolicy()provides the server-side default (configurable viaozone.default.storage.policy).OM — write-time resolution —
OMKeyRequest.resolveEffectiveStoragePolicy()resolves the effective policy at write time using the chain: bucket policy → server default. The resolvedStorageTypeis passed toallocateBlock(). This method is called fromOMKeyCreateRequest,OMFileCreateRequest, andOMAllocateBlockRequest.SCM — pipeline filtering — A new
PipelineStorageTypeFilterutility filters pipelines using a set-based approach: it builds aSet<UUID>of all healthy nodes that have the requestedStorageType, then filters pipelines by checking whether all member nodes are in that set. At scale (2000 pipelines, 200 nodes), this takes ~0.5ms per allocation vs ~3-5ms for a naiveper-pipeline approach. Both
WritableECContainerProviderandWritableRatisContainerProviderapply this filter.SCM — proactive pipeline creation — On a 32-node cluster (16 SSD-only, 16 DISK-only) with EC 3+2, the probability that a randomly formed 5-node pipeline is all-SSD is only ~2.2%. Without
proactive creation, HOT writes would almost always fall back to DISK. When
ozone.scm.pipeline.creation.storage-type-aware.enabled=true,BackgroundPipelineCreatoriterates overStorageTypevalues and creates per-type pipelines using
SCMCommonPlacementPolicyto select only nodes with the matching storage type. On heterogeneous clusters (every DN has both SSD and DISK), thisconfig is unnecessary since all nodes qualify for both types.
SCM — fallback —
BlockManagerImpl.allocateBlock()wraps the container allocation in a try-catch. If no pipeline matches the primaryStorageTypeand the policy defines a fallback (HOT:SSD → DISK), it retries with the fallback type and emits a
WARNlog for monitoring. If no fallback is defined (WARM, COLD) or the fallback also fails, the allocation fails as it does today.DN — volume selection —
KeyValueContainer.create()filters the candidateHddsVolumelist by the requestedStorageTypebefore passing it toVolumeChoosingPolicy. TheVolumeChoosingPolicyinterface itself is unchanged — filtering happens upstream.CLI —
ozone sh bucket create --storage-policy HOT|WARMandozone sh bucket update --storage-policy HOT|WARMare added.ozone sh bucket infoautomatically displays the policy via JSONserialization (no code change needed).
Scope and Limitations
This PR is scoped to OBJECT_STORE buckets with EC replication. FSO and Ratis buckets are not affected — they continue using default placement. Future work (prefix-level policies, a Mover
tool for migrating existing data, on-demand pipeline creation, S3
x-amz-storage-classintegration) is described in the design document.Configuration
ozone.scm.pipeline.creation.storage-type-aware.enabledfalseozone.default.storage.policyWARMBackward Compatibility
All protobuf fields are
optionalwithUNSET = 0defaults. Old clients ignore new fields. Existing data is unaffected — keys without a policy resolve to WARM (DISK), matching currentbehavior. No DB migration is required.
What is the link to the Apache JIRA
Please create an issue in ASF JIRA before opening a pull request, and you need to set the title of the pull
request which starts with the corresponding JIRA issue number. (e.g. HDDS-XXXX. Fix a typo in YYY.)
(Please replace this section with the link to the Apache JIRA)
How was this patch tested?
Unit tests, integration testing, and system testing using the company environment.