Skip to content

Performance 11900115518: use atom key packed in version map cache#3116

Open
alexowens90 wants to merge 1 commit into
masterfrom
performance/11900115518/use-AtomKeyPacked-in-version-map-cache
Open

Performance 11900115518: use atom key packed in version map cache#3116
alexowens90 wants to merge 1 commit into
masterfrom
performance/11900115518/use-AtomKeyPacked-in-version-map-cache

Conversation

@alexowens90
Copy link
Copy Markdown
Collaborator

Reference Issues/PRs

11900115518

What does this implement or fix?

Uses AtomKeyPacked instead of AtomKey in the two collections inside VersionMapEntry in order to reduce the memory footprint of the version map cache. Also uses ankerl::unordered_dense::map over std::unordered_map for the same reason. Full AtomKeys are only materialised when necessary.

@alexowens90 alexowens90 self-assigned this May 15, 2026
@alexowens90 alexowens90 added patch Small change, should increase patch version performance labels May 15, 2026
@alexowens90 alexowens90 marked this pull request as ready for review May 18, 2026 14:47
entry->validate();

version_store::TombstoneVersionResult res{true, entry->head_->id()};
version_store::TombstoneVersionResult res{true, entry->stream_id_};
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a pre-existing bug that dereferenced std::nullopt when trying to delete a symbol that had never existed (i.e. had no vref key), it started segfaulting with the other changes.

Comment thread cpp/arcticdb/entity/atom_key.hpp Outdated
Comment thread cpp/arcticdb/version/version_map.hpp
Comment thread cpp/arcticdb/version/version_map_entry.hpp Outdated
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 18, 2026

ArcticDB Code Review Summary

Correctness

  • AtomKeyPacked::operator> (atom_key.hpp:276) and AtomKeyImpl::operator> (atom_key.hpp:97) now defined as r < l, consistent with operator<. Two keys differing only in content_hash/key_type are now correctly equivalent (neither greater nor less than the other) for sorting purposes. Fixed in earlier commit.
  • Log message in check_ref_key (version_map.hpp:1000-1005) now has a matching format placeholder and includes ref_entry.keys_[0] as the fourth argument. Fixed in earlier commit.

Performance

  • VersionMapEntry(const StreamId&) (version_map_entry.hpp:230) now takes by const reference, avoiding the extra string copy on every construction. Fixed in earlier commit.

Testing

  • Earlier commit (37e0c12) adds object_version_store.version_store.clear() between iterations in test_special_chars (test_basic_version_store.py:106), isolating each special-character case so a write failure for one char does not pollute the next iteration.
  • Commit 33dc3c2 simplifies AddAndCompact (test_symbol_list.cpp:976-981) to pass std::nullopt as previous_key rather than threading the prior loop iteration key. Each iteration writes a distinct symbol so the previous loop key referred to a different symbol - passing it as the previous version was incorrect. The change is a correct simplification.

Latest delta (7baa1f3..1873c55, after rebase onto current master)

  • Only two formatting cleanups in version_map_entry.hpp: a misplaced } on the same line as another } is split onto its own line (line 348), and a util::check(...) call is reformatted with one argument per line (line 445). No behavioural change.
  • The other commits in the range are merges/cherry-picks from master (CMake fix, POW function, binary search optimisation, compact_data work, etc.) and are unrelated to this PR diff against master.
  • No new issues introduced.

PR Title and Description

  • Title is awkward (Performance 11900115518: Reduce version map cache memory footprint) - leading with the Monday ticket number is uninformative. Something like Reduce version map cache memory footprint by using AtomKeyPacked reads better.

@alexowens90 alexowens90 force-pushed the performance/11900115518/use-AtomKeyPacked-in-version-map-cache branch from bc83aa8 to e07974f Compare May 18, 2026 16:14
@poodlewars
Copy link
Copy Markdown
Collaborator

Excellent point Claude

Title is awkward (Performance 11900115518: use atom key packed in version map cache) - leading with the Monday ticket number is uninformative. Something like Use AtomKeyPacked in version map cache reads better.

@alexowens90 alexowens90 force-pushed the performance/11900115518/use-AtomKeyPacked-in-version-map-cache branch from e07974f to 56cc7a5 Compare May 19, 2026 08:21
}

friend bool operator>(const AtomKeyImpl& l, const AtomKeyImpl& r) { return !(l < r) && (l != r); }
friend bool operator>(const AtomKeyImpl& l, const AtomKeyImpl& r) { return r < l; }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed that this now breaks std::weak_ordering which requires exactly one of a < b, a == b and a > b to be true. And we now have only std::partial_ordering

E.g. if two atom keys are the same but have a different type all of the above would be false.

To be honest the previous version was also broken because a > b && b > a could have been true.

In any case I think if not worried about performance we should include all fields in the std::tie comparison. I think it would make reasoning about this class easier because it would give us std::strong_ordering

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh, comparing keys of different type is pretty meaningless, but so is having index_start_ and index_end_ before creation_ts_ in the std::tie, so I'm fine with making things simple.

Comment thread cpp/arcticdb/version/version_map.hpp Outdated
#include <arcticdb/version/version_utils.hpp>
#include <arcticdb/util/lock_table.hpp>

#include "storage/storage_utils.hpp"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: include with <>

if (head_)
id_to_version_id[head_->id()].push_back(head_->version_id());
for (const auto& k : keys_)
id_to_version_id[k.id()].push_back(k.version_id());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we used to check that all keys have the same stream_id.

We should move these assertions to methods like unsift_key or try_set_tombstone_all which get AtomKey arguments.

@alexowens90 alexowens90 force-pushed the performance/11900115518/use-AtomKeyPacked-in-version-map-cache branch from 6567da1 to de3849c Compare May 19, 2026 11:13
object_version_store.write(sym, df)
vitem = object_version_store.read(sym)
assert_equal(vitem.data, df)
object_version_store.version_store.clear()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Azurite silently maps \ characters to / in this test. It was passing previously as we did not validate that keys on version/version ref segments had the same stream id as the version map entry. In practice, because / is before \ in the character list, this meant that the write call for the \ symbol actually loaded the version map entry for the / symbol, but so did the read, so everything appeared to work correctly.
The change is safe for real users because one of the following is true:

  1. This is a bug in Azurite, real Azure blob storage fully supports \ characters in blob paths (the docs don't mention a limitation like this, although a 7 year old stack overflow answer does), and so this is only a problem in the testing environment.
  2. Real Azure behaves the same way as Azurite. In this case, if a user has two symbols a/b and b\a, then it is possible they were previously accidentally reading the wrong one, and so raising an exception is better behaviour. If this is the case, we should disallow \ characters in symbol names on Azure.

auto key = atom_key_builder().build(symbol, KeyType::TABLE_INDEX);
version_map_->write_version(store_, key, previous_key);
previous_key = key;
version_map_->write_version(store_, key, std::nullopt);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was wrong previously, it was setting the previous key to one with a different stream id to key.
Before this change stream ids were not checked to be identical in the version map entry, so it worked anyway.
With this change, it only failed on the sanitizer runs because they are slower than regular test runs, so the version map cache expired before the call to get_symbol_set below, which then raised in read_segment_with_keys.

@alexowens90 alexowens90 force-pushed the performance/11900115518/use-AtomKeyPacked-in-version-map-cache branch from 7ec9cb7 to d5724da Compare May 28, 2026 14:25
@alexowens90 alexowens90 force-pushed the performance/11900115518/use-AtomKeyPacked-in-version-map-cache branch from 7baa1f3 to 1873c55 Compare May 29, 2026 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

patch Small change, should increase patch version performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants