Skip to content

[opt](memory) Replace TabletMeta object map with SoA CompactTabletMetaStore#62086

Open
dataroaring wants to merge 3 commits intoapache:masterfrom
dataroaring:compact-tablet-meta
Open

[opt](memory) Replace TabletMeta object map with SoA CompactTabletMetaStore#62086
dataroaring wants to merge 3 commits intoapache:masterfrom
dataroaring:compact-tablet-meta

Conversation

@dataroaring
Copy link
Copy Markdown
Contributor

@dataroaring dataroaring commented Apr 3, 2026

What problem does this PR solve?

Memory optimization for FE metadata storage at scale.

In Doris FE, TabletInvertedIndex stores one TabletMeta Java object per tablet in a Long2ObjectOpenHashMap<TabletMeta>. At 10M+ tablets, each object's 16-byte Java header adds ~160 MB of pure overhead. Combined with hash map reference overhead, this wastes ~270 MB per 10M tablets.

How is the problem solved?

Replace per-object TabletMeta storage with a Structure-of-Arrays (SoA) layout via a new CompactTabletMetaStore class:

  • Before: Long2ObjectOpenHashMap<TabletMeta> — one TabletMeta object per tablet (~96 bytes/tablet including object header, field padding, and map entry overhead)
  • After: CompactTabletMetaStore — 7 parallel primitive arrays indexed by slot (long[] dbIds/tableIds/partitionIds/indexIds, int[] oldSchemaHashes/newSchemaHashes, byte[] storageMediumOrdinals) with a Long2IntOpenHashMap for tabletId→slot mapping (~69 bytes/tablet, ~28% reduction)

Key design choices:

  1. Free list embedded in dbIds[] for O(1) slot reuse after deletion
  2. On-demand TabletMeta construction — external callers are unchanged; TabletMeta objects are only created when needed via getTabletMeta()
  3. Direct field accessors (getStorageMedium(), setStorageMedium()) avoid object allocation for hot paths like storage medium checks and mutations
  4. Thread safety delegated to existing StampedLock in TabletInvertedIndex

What are the changes?

File Change
CompactTabletMetaStore.java NEW — SoA store with parallel arrays, free list, and backward-compatible TabletMeta construction
TabletInvertedIndex.java Replace Long2ObjectOpenHashMap<TabletMeta> tabletMetaMap field with CompactTabletMetaStore tabletMetaStore; update 5 methods
LocalTabletInvertedIndex.java Replace 10 tabletMetaMap references with tabletMetaStore calls
CloudTabletInvertedIndex.java Replace 4 tabletMetaMap references with tabletMetaStore calls
CompactTabletMetaStoreTest.java NEW — 12 unit tests covering add/get/remove, free list reuse, array growth, all storage media

All 25+ external callers of TabletInvertedIndex are unchanged — full backward compatibility.

Test plan

  • CompactTabletMetaStoreTest — covers add/get/remove, duplicate adds, storageMedium mutation, free list reuse, array growth, toMap(), clear(), all storage medium values
  • Existing TabletInvertedIndexTest and ClusterLoadStatisticsTest pass
  • FE compile and unit tests

🤖 Generated with Claude Code

…aStore

Replace Long2ObjectOpenHashMap<TabletMeta> with CompactTabletMetaStore that
stores fields in parallel primitive arrays (Structure-of-Arrays layout).
This eliminates the 16-byte Java object header per tablet entry, reducing
per-tablet overhead from ~96 bytes to ~69 bytes (~28% savings, ~270 MB at
10M tablets).

CompactTabletMetaStore uses:
- Long2IntOpenHashMap for tabletId -> slot mapping
- 7 parallel arrays (dbIds, tableIds, partitionIds, indexIds,
  oldSchemaHashes, newSchemaHashes, storageMediumOrdinals)
- Free list embedded in dbIds[] for slot reuse after deletion
- On-demand TabletMeta construction for backward compatibility

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 3, 2026 04:05
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a more memory-efficient representation of per-tablet metadata in FE by replacing the tabletId -> TabletMeta object map with a Structure-of-Arrays (SoA) CompactTabletMetaStore, and migrates TabletInvertedIndex implementations to use it.

Changes:

  • Added CompactTabletMetaStore (SoA + Long2IntOpenHashMap + free-list reuse) and a dedicated unit test suite.
  • Migrated TabletInvertedIndex to store tablet metadata in tabletMetaStore and keep API compatibility by constructing TabletMeta on demand.
  • Updated LocalTabletInvertedIndex and CloudTabletInvertedIndex to use tabletMetaStore instead of tabletMetaMap.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
fe/fe-core/src/main/java/org/apache/doris/catalog/CompactTabletMetaStore.java New SoA-backed metadata store with slot reuse and on-demand TabletMeta construction.
fe/fe-core/src/main/java/org/apache/doris/catalog/TabletInvertedIndex.java Switches core inverted index metadata storage to CompactTabletMetaStore; UT map view via toMap().
fe/fe-core/src/main/java/org/apache/doris/catalog/LocalTabletInvertedIndex.java Replaces metadata map access/mutation with tabletMetaStore equivalents.
fe/fe-core/src/main/java/org/apache/doris/cloud/catalog/CloudTabletInvertedIndex.java Replaces metadata existence checks and deletion with tabletMetaStore.
fe/fe-core/src/test/java/org/apache/doris/catalog/CompactTabletMetaStoreTest.java Adds unit tests for add/get/remove, reuse, growth, clear, and medium updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 100 to 106
public TabletMeta getTabletMeta(long tabletId) {
long stamp = readLock();
try {
return tabletMetaMap.get(tabletId);
return tabletMetaStore.getTabletMeta(tabletId);
} finally {
readUnlock(stamp);
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getTabletMeta() now constructs a new TabletMeta object on every call via tabletMetaStore.getTabletMeta(). TabletInvertedIndex.getTabletMeta() is used broadly across FE (rebalancers, report handler, proc nodes, etc.), so this can introduce significant allocation/GC overhead and may offset some of the memory win.

Consider adding allocation-free accessors on TabletInvertedIndex/CompactTabletMetaStore (e.g., getDbId/getTableId/getPartitionId/getIndexId/getOldSchemaHash/getStorageMedium) and migrating hot call sites to those, or providing a cached/flyweight TabletMeta for read-only access.

Copilot uses AI. Check for mistakes.
Comment on lines 129 to 131
if (tabletMetaStore.containsKey(tabletId)) {
return;
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addTablet() checks tabletMetaStore.containsKey(tabletId) and then calls tabletMetaStore.add(tabletId, tabletMeta), but CompactTabletMetaStore.add() also checks for duplicates. This does two hash lookups for every add.

Consider removing the outer containsKey() check and rely on tabletMetaStore.add() to no-op on duplicates, or change CompactTabletMetaStore.add() to return a boolean and avoid duplicate lookups.

Suggested change
if (tabletMetaStore.containsKey(tabletId)) {
return;
}

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +55
private long[] indexIds;
private int[] oldSchemaHashes;
private int[] newSchemaHashes;
private byte[] storageMediumOrdinals;

Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newSchemaHashes is allocated and grown alongside the other arrays, but it is never read or exposed, and TabletMeta currently has no getter for newSchemaHash. This adds per-tablet memory overhead and undermines the compaction goal.

Either remove newSchemaHashes entirely for now, or fully wire it through (capture the value on add and expose a getter / include it when constructing TabletMeta).

Copilot uses AI. Check for mistakes.
if (replicaMetaWithBackend != null) {
for (long tabletId : replicaMetaWithBackend.keySet()) {
if (tabletMetaMap.get(tabletId).getStorageMedium() == TStorageMedium.HDD) {
if (tabletMetaStore.getTabletMeta(tabletId).getStorageMedium() == TStorageMedium.HDD) {
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In getReplicaNumByBeIdAndStorageMedium(), tabletMetaStore.getTabletMeta(tabletId) is called inside the loop solely to read storageMedium, which allocates a new TabletMeta per tablet. This is a hot path (iterating all tablets on a backend) and can add significant GC pressure.

Use tabletMetaStore.getStorageMedium(tabletId) instead (as is already done in getTabletSizeByBackendIdAndStorageMedium) to avoid per-iteration object allocation.

Suggested change
if (tabletMetaStore.getTabletMeta(tabletId).getStorageMedium() == TStorageMedium.HDD) {
if (tabletMetaStore.getStorageMedium(tabletId) == TStorageMedium.HDD) {

Copilot uses AI. Check for mistakes.
…locations

- Remove unused newSchemaHashes array (-4 bytes/tablet)
- Change add() to return boolean, eliminating double hash lookup in addTablet()
- Use getStorageMedium() directly in getReplicaNumByBeIdAndStorageMedium()
  to avoid unnecessary TabletMeta object allocation in hot loop

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dataroaring
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 96.40% (107/111) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 63.96% (71/111) 🎉
Increment coverage report
Complete coverage report

… hot loop

Replace tabletMetaStore.getTabletMeta(tabletId) with individual field
accessors (getDbId, getTableId, etc.) in buildPartitionInfoBySkew() to
eliminate per-tablet object allocation when iterating over millions of
tablets. This reduces GC pressure during partition balance computation.

Also fixes a latent bug where Preconditions.checkNotNull(tabletMeta)
ran after tabletMeta was already dereferenced, making it unreachable.
Replaced with a NOT_EXIST_VALUE check on the first accessor call.

Generated by ThinkOps
@dataroaring
Copy link
Copy Markdown
Contributor Author

run buildall

@dataroaring
Copy link
Copy Markdown
Contributor Author

Addressed the review feedback in commit b50ad34:

Comment #1 (getTabletMeta() GC concern): Fixed the main hot-path usage — buildPartitionInfoBySkew() now uses individual field accessors (getDbId, getTableId, getPartitionId, getIndexId, getStorageMedium) instead of constructing a TabletMeta object per tablet. This loop iterates over all tablets (potentially millions), so avoiding per-iteration object allocation significantly reduces GC pressure. The remaining getTabletMeta() call at line 229 (tablet report handling) passes the full object to downstream methods and runs once per-BE, so it's appropriate to keep.

Comment #2 (redundant containsKey in addTablet): Already addressed in a prior revision — addTablet() now relies on tabletMetaStore.add() returning false for duplicates.

Comment #3 (unused newSchemaHashes): Already addressed in a prior revision — the newSchemaHashes array was removed.

Comment #4 (getReplicaNumByBeIdAndStorageMedium): Already addressed in a prior revision — the method already uses tabletMetaStore.getStorageMedium(tabletId) directly.

Latent bug fix: Also fixed a latent bug where Preconditions.checkNotNull(tabletMeta) ran after tabletMeta was already dereferenced (would NPE before reaching the check). Replaced with a NOT_EXIST_VALUE check on the first accessor call.

— ThinkOps 🤖

@dataroaring
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 58.20% (71/122) 🎉
Increment coverage report
Complete coverage report

4 similar comments
@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 58.20% (71/122) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 58.20% (71/122) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 58.20% (71/122) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 58.20% (71/122) 🎉
Increment coverage report
Complete coverage report

@dataroaring
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 92.62% (113/122) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 58.20% (71/122) 🎉
Increment coverage report
Complete coverage report

2 similar comments
@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 58.20% (71/122) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 58.20% (71/122) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants