HDDS-14225. [WIP] Upgrade RocksDB from 7.7.3 to 10.10.1#9813
HDDS-14225. [WIP] Upgrade RocksDB from 7.7.3 to 10.10.1#9813smengcl wants to merge 28 commits intoapache:masterfrom
Conversation
Generated-by: GPT-5.3-Codex
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @smengcl for working on this.
We should detect OS at runtime, not build time. Single build (on any OS) should create binaries for all supported operating systems.
-
dependencyManagementin root POM should include an entry forrocksdbjniartifact with each supported classifier, as well as the non-classified artifact:<dependency> <groupId>org.rocksdb</groupId> <artifactId>rocksdbjni</artifactId> <version>${rocksdb.version}</version> </dependency> <dependency> <groupId>org.rocksdb</groupId> <artifactId>rocksdbjni</artifactId> <version>${rocksdb.version}</version> <classifier>linux64</classifier> </dependency> <dependency> <groupId>org.rocksdb</groupId> <artifactId>rocksdbjni</artifactId> <version>${rocksdb.version}</version> <classifier>osx</classifier> </dependency> ...
-
hdds-rocks-nativeshould unpack all of these -
other modules should continue depending on platform-independent
rocksdbjni(no classifier)
Thus changes in most pom.xml files are not needed, nor is rocksdbjni.classifier.
This is a workaround for RocksDB 10.4.2 thin-jar packaging. Ensure classifier JNI artifacts are present at runtime/tests and keep dependency analysis stable. Not needed if mvnrepo rocksdbjni provides a fat jar containing native libs for all supported platforms.
…or change RocksDB 9.10.0 changed DB::KeyMayExist behavior semantics to follow its function comment. Ozone snapshot code paths were treating keyMayExist/getIfExist misses as definitive, which could misclassify existing keys in snapshot-related flows under RocksDB >= 9.10.0 and cause test failures. Treat keyMayExist/getIfExist as hints and fall back to point reads before deciding not-found: - RDBTable: verify with get()/get(ByteBuffer) on inconclusive keyMayExist - TypedTable: in codec-buffer isExist path, fallback to full get before returning false - KeyManagerImpl and ReclaimableRenameEntryFilter: use getSkipCache fallback for snapshot rename lookups RocksDB changelog for reference: https://github.com/facebook/rocksdb/releases/tag/v9.10.0 Behavior Changes DB::KeyMayExist() now follows its function comment, which means value parameter can be null, and it will be set only if value_found is passed in.
RocksDB 9.10+ may consume ByteBuffer state during keyMayExist; treat it as a hint and always use duplicated key buffers for keyMayExist/get fallback paths in RDBTable/RocksDatabase. Add a regression unit test for ByteBuffer fallback behavior. Also make replicas-test.sh restore the whole container.db directory (not just one file) to avoid stale RocksDB WAL/MANIFEST artifacts when recovering from backup.
431f2b2 to
c9b83ef
Compare
c9b83ef to
7270ac0
Compare
Replace restored schema-v3 container.db directory
|
Good news. rocksdb team just published 10.5.1 jars: https://repo1.maven.org/maven2/org/rocksdb/rocksdbjni/10.5.1/ But we need at least 10.9.0 to avoid the forward compatibility bug mentioned in the description. Need to wait a bit more |
fcef039 to
ed1d278
Compare
| @@ -83,6 +83,11 @@ private final boolean isRenameEntryReclaimable(Table.KeyValue<String, String> re | |||
| if (previousTable != null) { | |||
| String prevDbKey = renameEntry.getValue(); | |||
| WithObjectID withObjectID = previousTable.getIfExist(prevDbKey); | |||
| if (withObjectID == null) { | |||
There was a problem hiding this comment.
is this a change of behavior in rocksdb?
There was a problem hiding this comment.
Yes. But adapatations have already been made in RDBTable. This chunk can be removed I think. Let me double check.
| @Override | ||
| public byte[] getIfExist(byte[] key) throws RocksDatabaseException { | ||
| rdbMetrics.incNumDBKeyGetIfExistChecks(); | ||
| final Supplier<byte[]> value = db.keyMayExist(family, key); |
There was a problem hiding this comment.
from commit history:
24dd415
Fix snapshot existence checks after RocksDB 9.10.0 keyMayExist behavior change
RocksDB 9.10.0 changed DB::KeyMayExist behavior semantics to follow its function comment. Ozone snapshot code paths were treating keyMayExist/getIfExist misses as definitive, which could misclassify existing keys in snapshot-related flows under RocksDB >= 9.10.0 and cause test failures.
Treat keyMayExist/getIfExist as hints and fall back to point reads before deciding not-found:
- RDBTable: verify with get()/get(ByteBuffer) on inconclusive keyMayExist
- TypedTable: in codec-buffer isExist path, fallback to full get before returning false
- KeyManagerImpl and ReclaimableRenameEntryFilter: use getSkipCache fallback for snapshot rename lookups
RocksDB changelog for reference:
https://github.com/facebook/rocksdb/releases/tag/v9.10.0
Behavior Changes
DB::KeyMayExist() now follows its function comment, which means value parameter can be null, and it will be set only if value_found is passed in.
There was a problem hiding this comment.
@smengcl can you put the above keyMayExist() behavior change in the PR description.
The goal is to ensure the Ozone API does not change semantics after RocksDB update.
| // Try the lightweight "existence only" check first. | ||
| try (CodecBuffer outValue = CodecBuffer.getEmptyBuffer()) { | ||
| return getFromTableIfExist(inKey, outValue) != null; | ||
| if (getFromTableIfExist(inKey, outValue) != null) { |
There was a problem hiding this comment.
shouldn't this Ozone API be definitive?
| // keyMayExist/getIfExist can still be false-negative in some RocksDB | ||
| // configurations (observed with snapshot/checkpoint DBs), so confirm with | ||
| // a regular point lookup before returning false. | ||
| return getFromTable(key) != null; |
There was a problem hiding this comment.
Is this really necessary? This means an additional lookup.
| @@ -141,15 +141,15 @@ public byte[] getSkipCache(byte[] bytes) throws RocksDatabaseException { | |||
| @Override | |||
| public byte[] getIfExist(byte[] key) throws RocksDatabaseException { | |||
There was a problem hiding this comment.
does this API provide definitive result?
if not, the javadoc needs to update
| public class ManagedBloomFilter extends BloomFilter { | ||
| private final UncheckedAutoCloseable leakTracker = track(this); | ||
|
|
||
| @Override |
There was a problem hiding this comment.
Looks unnecessary to me
| return (previousSnapshotKM) -> table.apply(previousSnapshotKM).get( | ||
| renamedKey != null ? renamedKey : currentKeyPath); | ||
| Table<String, String> renamedTable = metadataManager.getSnapshotRenamedTable(); | ||
| String renamedKey = renamedTable.getIfExist(renameKey); |
There was a problem hiding this comment.
if getIfExist() is definitive, we wouldn't need to make this change?
|
Hi @smengcl thanks a lot for the PR. |
| // keyMayExist can be a false-negative in some RocksDB setups (for example, | ||
| // snapshot/checkpoint DBs). Verify via point-get to preserve correctness. | ||
| final boolean exists = get(key) != null; |
There was a problem hiding this comment.
I am trying to understand more about this comment or how we can reach this stage.
The keyMayExist documentation says that it will never return a false-negative.
Are we trying to say here that a key may not exist in the snapshot/checkpoint DB but may exist in the active DB? In that case the get(key) is still running on the same db which would be the snapshot DB instead of the active DB and again not find the key.
If the key definitely does not exist in the database, then this method
returns false, otherwise it returns true if the key might exist.
That is to say that this method is probabilistic and may return false
positives, but never a false negative.
| private final UncheckedAutoCloseable leakTracker = track(this); | ||
| private final AtomicReference<Logger> loggerRef = new AtomicReference<>(); | ||
|
|
||
| @Override |
There was a problem hiding this comment.
Do we need to remove this? The RocksDB JNI still has a setLogger method available.
| public void deleteFile( | ||
| ColumnFamilyHandle columnFamilyHandle, | ||
| LiveFileMetaData fileToBeDeleted) throws RocksDatabaseException { |
There was a problem hiding this comment.
The deleteFile was deprecated in facebook/rocksdb#13284 and removed in facebook/rocksdb#13322.
As a result it should no longer be part of ManagedRocksDB.java either.
Any callers of the method should likely be migrated to use the existing deleteFilesInRanges method.
| + FilePrefetchBuffer prefetch_buffer(ReadaheadParams(), | ||
| + !fopts.use_mmap_reads /* enable */, | ||
| + false /* track_min_offset */); | ||
| + if (s.ok()) { |
There was a problem hiding this comment.
Do we know where the raw_sst_file_reader header and implementation files are located?
This comment mentions that the implementation was originally copied from sst_file_dumper but that file doesn't seem to show any signature changes for prefetch_buffer in 10.10.1.
99ec6f0 to
9b6f43f
Compare
Can we explore what it would take to support the newer file formats?
|
@ptlrs We surely can add a post finalization step to reingest the SST files, but there is no official tool iirc to do such in-place reingestion so basically it is up to us to write the logic to read from old SSTs and write to new ones. Or, just allow new SST files to be written in format_version=6 post finalization. Old SSTs should get compacted into new format_version=6 SSTs eventually. I'd prefer the latter. Anyway, there needs to be another task for upgrade handling. The scope of this one is to upgrade JNI but keep everything compatible. |
jojochuang
left a comment
There was a problem hiding this comment.
Error: Failed to execute goal com.googlecode.maven-download-plugin:download-maven-plugin:1.9.0:wget (rocksdb source download) on project hdds-rocks-native: Download failed with code 404: Not Found -> [Help 1]
Error:
Error: To see the full stack trace of the errors, re-run Maven with the -e switch.
Error: Re-run Maven using the -X switch to enable full debug logging.
Error:
Error: For more information about the errors and possible solutions, please read the following articles:
Error: [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
Error:
Error: After correcting the problems, you can resume the build with the command
Error: mvn -rf :hdds-rocks-native
Error: Process completed with exit code 1.
Generated-by: GPT-5.3-Codex
What changes were proposed in this pull request?
Bump RocksDB version from 7.7.3 to
10.4.210.10.1. While maintaining compatibility and not breaking anything.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14225
How was this patch tested?