Skip to content

HDDS-14225. [WIP] Upgrade RocksDB from 7.7.3 to 10.10.1#9813

Draft
smengcl wants to merge 28 commits intoapache:masterfrom
smengcl:rocksdb-v10-upgrade
Draft

HDDS-14225. [WIP] Upgrade RocksDB from 7.7.3 to 10.10.1#9813
smengcl wants to merge 28 commits intoapache:masterfrom
smengcl:rocksdb-v10-upgrade

Conversation

@smengcl
Copy link
Copy Markdown
Contributor

@smengcl smengcl commented Feb 24, 2026

Generated-by: GPT-5.3-Codex

What changes were proposed in this pull request?

Bump RocksDB version from 7.7.3 to 10.4.2 10.10.1. While maintaining compatibility and not breaking anything.

  1. Since RocksDB 9, BlockBasedTableOptions.format_version=6 is the default. Files written in format_version 6 cannot be read by RocksDB < 8.6.0. This PR explicitly sets the default to version 5 to be compatible with 7.7.3 in the case where Ozone gets downgraded before it gets finalized after an upgrade. ref: https://github.com/facebook/rocksdb/releases/tag/v9.0.0
  2. DO NOT MERGE until v10.9.0 or higher RocksDB JNI is available. Potential forward compatibility bug < v10.9.0: https://github.com/facebook/rocksdb/releases/tag/v10.9.1

Fix a bug where compaction with range deletion can persist kTypeMaxValid in MANIFEST as file metadata. kTypeMaxValid is not supposed to be persisted and can change as new value types are introduced. This can cause a forward compatibility issue where older versions of RocksDB don't recognize kTypeMaxValid from newer versions. A new placeholder value type kTypeTruncatedRangeDeletionSentinel is also introduced to replace kTypeMaxValid when reading existing SST files' metadata from MANIFEST. This allows us to strengthen some checks to avoid using kTypeMaxValid in the future.

  1. Currently in this PR I am using a rocksdbjni 10.10.1 fatjar I built myself and pushed to maven central snapshot. We need to properly publish it (for example, under Apache Ozone account) before we can merge this.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14225

How was this patch tested?

  • Most RocksDB related tests are fixed.
  • Pending full CI.

@smengcl smengcl changed the title HDDS-14225. Upgrade RocksDB from 7.7.3 to 10.4.2 HDDS-14225. [DO NOT MERGE] Upgrade RocksDB from 7.7.3 to 10.4.2 Feb 24, 2026
Copy link
Copy Markdown
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @smengcl for working on this.

We should detect OS at runtime, not build time. Single build (on any OS) should create binaries for all supported operating systems.

  1. dependencyManagement in root POM should include an entry for rocksdbjni artifact with each supported classifier, as well as the non-classified artifact:

      <dependency>
        <groupId>org.rocksdb</groupId>
        <artifactId>rocksdbjni</artifactId>
        <version>${rocksdb.version}</version>
      </dependency>
      <dependency>
        <groupId>org.rocksdb</groupId>
        <artifactId>rocksdbjni</artifactId>
        <version>${rocksdb.version}</version>
        <classifier>linux64</classifier>
      </dependency>
      <dependency>
        <groupId>org.rocksdb</groupId>
        <artifactId>rocksdbjni</artifactId>
        <version>${rocksdb.version}</version>
        <classifier>osx</classifier>
      </dependency>
      ...
  2. hdds-rocks-native should unpack all of these

  3. other modules should continue depending on platform-independent rocksdbjni (no classifier)

Thus changes in most pom.xml files are not needed, nor is rocksdbjni.classifier.

This is a workaround for RocksDB 10.4.2 thin-jar packaging. Ensure classifier JNI artifacts are present at runtime/tests and keep dependency analysis stable.

Not needed if mvnrepo rocksdbjni provides a fat jar containing native libs for all supported platforms.
…or change

RocksDB 9.10.0 changed DB::KeyMayExist behavior semantics to follow its function comment. Ozone snapshot code paths were treating keyMayExist/getIfExist misses as definitive, which could misclassify existing keys in snapshot-related flows under RocksDB >= 9.10.0 and cause test failures.

Treat keyMayExist/getIfExist as hints and fall back to point reads before deciding not-found:
- RDBTable: verify with get()/get(ByteBuffer) on inconclusive keyMayExist
- TypedTable: in codec-buffer isExist path, fallback to full get before returning false
- KeyManagerImpl and ReclaimableRenameEntryFilter: use getSkipCache fallback for snapshot rename lookups

RocksDB changelog for reference:
https://github.com/facebook/rocksdb/releases/tag/v9.10.0

Behavior Changes
DB::KeyMayExist() now follows its function comment, which means value parameter can be null, and it will be set only if value_found is passed in.
RocksDB 9.10+ may consume ByteBuffer state during keyMayExist; treat it as a hint and always use duplicated key buffers for keyMayExist/get fallback paths in RDBTable/RocksDatabase.

Add a regression unit test for ByteBuffer fallback behavior.

Also make replicas-test.sh restore the whole container.db directory (not just one file) to avoid stale RocksDB WAL/MANIFEST artifacts when recovering from backup.
@smengcl smengcl force-pushed the rocksdb-v10-upgrade branch from 431f2b2 to c9b83ef Compare March 10, 2026 05:46
@smengcl smengcl force-pushed the rocksdb-v10-upgrade branch from c9b83ef to 7270ac0 Compare March 10, 2026 18:22
@smengcl smengcl changed the title HDDS-14225. [DO NOT MERGE] Upgrade RocksDB from 7.7.3 to 10.4.2 HDDS-14225. [DO NOT MERGE] Upgrade RocksDB from 7.7.3 to 10.10.1 Mar 10, 2026
@smengcl
Copy link
Copy Markdown
Contributor Author

smengcl commented Mar 12, 2026

Good news. rocksdb team just published 10.5.1 jars: https://repo1.maven.org/maven2/org/rocksdb/rocksdbjni/10.5.1/

But we need at least 10.9.0 to avoid the forward compatibility bug mentioned in the description. Need to wait a bit more

@smengcl smengcl force-pushed the rocksdb-v10-upgrade branch from fcef039 to ed1d278 Compare March 30, 2026 22:26
Comment thread pom.xml Outdated
Comment thread pom.xml Outdated
@@ -83,6 +83,11 @@ private final boolean isRenameEntryReclaimable(Table.KeyValue<String, String> re
if (previousTable != null) {
String prevDbKey = renameEntry.getValue();
WithObjectID withObjectID = previousTable.getIfExist(prevDbKey);
if (withObjectID == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a change of behavior in rocksdb?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But adapatations have already been made in RDBTable. This chunk can be removed I think. Let me double check.

@Override
public byte[] getIfExist(byte[] key) throws RocksDatabaseException {
rdbMetrics.incNumDBKeyGetIfExistChecks();
final Supplier<byte[]> value = db.keyMayExist(family, key);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from commit history:
24dd415

Fix snapshot existence checks after RocksDB 9.10.0 keyMayExist behavior change
RocksDB 9.10.0 changed DB::KeyMayExist behavior semantics to follow its function comment. Ozone snapshot code paths were treating keyMayExist/getIfExist misses as definitive, which could misclassify existing keys in snapshot-related flows under RocksDB >= 9.10.0 and cause test failures.

Treat keyMayExist/getIfExist as hints and fall back to point reads before deciding not-found:

  • RDBTable: verify with get()/get(ByteBuffer) on inconclusive keyMayExist
  • TypedTable: in codec-buffer isExist path, fallback to full get before returning false
  • KeyManagerImpl and ReclaimableRenameEntryFilter: use getSkipCache fallback for snapshot rename lookups

RocksDB changelog for reference:
https://github.com/facebook/rocksdb/releases/tag/v9.10.0

Behavior Changes
DB::KeyMayExist() now follows its function comment, which means value parameter can be null, and it will be set only if value_found is passed in.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smengcl can you put the above keyMayExist() behavior change in the PR description.
The goal is to ensure the Ozone API does not change semantics after RocksDB update.

// Try the lightweight "existence only" check first.
try (CodecBuffer outValue = CodecBuffer.getEmptyBuffer()) {
return getFromTableIfExist(inKey, outValue) != null;
if (getFromTableIfExist(inKey, outValue) != null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this Ozone API be definitive?

// keyMayExist/getIfExist can still be false-negative in some RocksDB
// configurations (observed with snapshot/checkpoint DBs), so confirm with
// a regular point lookup before returning false.
return getFromTable(key) != null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really necessary? This means an additional lookup.

@@ -141,15 +141,15 @@ public byte[] getSkipCache(byte[] bytes) throws RocksDatabaseException {
@Override
public byte[] getIfExist(byte[] key) throws RocksDatabaseException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this API provide definitive result?
if not, the javadoc needs to update

public class ManagedBloomFilter extends BloomFilter {
private final UncheckedAutoCloseable leakTracker = track(this);

@Override
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks unnecessary to me

return (previousSnapshotKM) -> table.apply(previousSnapshotKM).get(
renamedKey != null ? renamedKey : currentKeyPath);
Table<String, String> renamedTable = metadataManager.getSnapshotRenamedTable();
String renamedKey = renamedTable.getIfExist(renameKey);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if getIfExist() is definitive, we wouldn't need to make this change?

@jojochuang
Copy link
Copy Markdown
Contributor

Hi @smengcl thanks a lot for the PR.
My main question is around getIfExist(). If it returns definitive results, a lot of change wouldn't be needed, and it can save a lot of db lookup potentially.

@smengcl smengcl changed the title HDDS-14225. [DO NOT MERGE] Upgrade RocksDB from 7.7.3 to 10.10.1 HDDS-14225. [WIP] Upgrade RocksDB from 7.7.3 to 10.10.1 Apr 13, 2026
Copy link
Copy Markdown
Contributor

@ptlrs ptlrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @smengcl. I have a few questions.

Comment on lines +110 to 112
// keyMayExist can be a false-negative in some RocksDB setups (for example,
// snapshot/checkpoint DBs). Verify via point-get to preserve correctness.
final boolean exists = get(key) != null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to understand more about this comment or how we can reach this stage.

The keyMayExist documentation says that it will never return a false-negative.

Are we trying to say here that a key may not exist in the snapshot/checkpoint DB but may exist in the active DB? In that case the get(key) is still running on the same db which would be the snapshot DB instead of the active DB and again not find the key.

If the key definitely does not exist in the database, then this method
returns false, otherwise it returns true if the key might exist.
That is to say that this method is probabilistic and may return false
positives, but never a false negative.

private final UncheckedAutoCloseable leakTracker = track(this);
private final AtomicReference<Logger> loggerRef = new AtomicReference<>();

@Override
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to remove this? The RocksDB JNI still has a setLogger method available.

Comment on lines +122 to +124
public void deleteFile(
ColumnFamilyHandle columnFamilyHandle,
LiveFileMetaData fileToBeDeleted) throws RocksDatabaseException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deleteFile was deprecated in facebook/rocksdb#13284 and removed in facebook/rocksdb#13322.

As a result it should no longer be part of ManagedRocksDB.java either.
Any callers of the method should likely be migrated to use the existing deleteFilesInRanges method.

Comment on lines +384 to 387
+ FilePrefetchBuffer prefetch_buffer(ReadaheadParams(),
+ !fopts.use_mmap_reads /* enable */,
+ false /* track_min_offset */);
+ if (s.ok()) {
Copy link
Copy Markdown
Contributor

@ptlrs ptlrs Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know where the raw_sst_file_reader header and implementation files are located?

This comment mentions that the implementation was originally copied from sst_file_dumper but that file doesn't seem to show any signature changes for prefetch_buffer in 10.10.1.

Comment thread pom.xml Outdated
@smengcl smengcl force-pushed the rocksdb-v10-upgrade branch from 99ec6f0 to 9b6f43f Compare April 20, 2026 06:01
@ptlrs
Copy link
Copy Markdown
Contributor

ptlrs commented Apr 20, 2026

Since RocksDB 9, BlockBasedTableOptions.format_version=6 is the default. Files written in format_version 6 cannot be read by RocksDB < 8.6.0. This PR explicitly sets the default to version 5 to be compatible with 7.7.3 in the case where Ozone gets downgraded before it gets finalized after an upgrade. ref: https://github.com/facebook/rocksdb/releases/tag/v9.0.0

Can we explore what it would take to support the newer file formats?
What are the challenges with the following and other potential solutions:

  • For a fresh installation, we could use the newer format by default.
  • For an upgrade scenario, we could consider re-ingesting the SST files.

@smengcl
Copy link
Copy Markdown
Contributor Author

smengcl commented Apr 20, 2026

Since RocksDB 9, BlockBasedTableOptions.format_version=6 is the default. Files written in format_version 6 cannot be read by RocksDB < 8.6.0. This PR explicitly sets the default to version 5 to be compatible with 7.7.3 in the case where Ozone gets downgraded before it gets finalized after an upgrade. ref: facebook/rocksdb@v9.0.0 (release)

Can we explore what it would take to support the newer file formats? What are the challenges with the following and other potential solutions:

  • For a fresh installation, we could use the newer format by default.
  • For an upgrade scenario, we could consider re-ingesting the SST files.

@ptlrs We surely can add a post finalization step to reingest the SST files, but there is no official tool iirc to do such in-place reingestion so basically it is up to us to write the logic to read from old SSTs and write to new ones.

Or, just allow new SST files to be written in format_version=6 post finalization. Old SSTs should get compacted into new format_version=6 SSTs eventually.

I'd prefer the latter. Anyway, there needs to be another task for upgrade handling. The scope of this one is to upgrade JNI but keep everything compatible.

Copy link
Copy Markdown
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error: Failed to execute goal com.googlecode.maven-download-plugin:download-maven-plugin:1.9.0:wget (rocksdb source download) on project hdds-rocks-native: Download failed with code 404: Not Found -> [Help 1]
Error:
Error: To see the full stack trace of the errors, re-run Maven with the -e switch.
Error: Re-run Maven using the -X switch to enable full debug logging.
Error:
Error: For more information about the errors and possible solutions, please read the following articles:
Error: [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
Error:
Error: After correcting the problems, you can resume the build with the command
Error: mvn -rf :hdds-rocks-native
Error: Process completed with exit code 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants