perf(iterator): -25% on dgraph-shaped scans via lockless isBanned, hoisted fill, internal-key skip, KeyOnly opt-in, user-key lastKey#2285
Open
shaunpatterson wants to merge 8 commits into
Conversation
Add four benchmarks modeling dgraph's posting-layer iterator hot
paths. dgraph drives badger in a specific combination of options
that none of the existing root-package benchmarks exercise:
- NamespaceOffset = 1 (DB.isBanned runs on every key)
- NumVersionsToKeep = math.MaxInt32 (all versions retained)
- DetectConflicts = false (dgraph owns OCC)
- PrefetchValues = false (key-only scans in every hot path)
- AllVersions = true (MVCC-aware reads)
- Prefix-bounded forward iteration over mid-sized structured keys
Keys mirror dgraph's data-key layout (1B type + 8B namespace + 2B
attr-len + ~20B attr + 1B subtype + 8B UID = 40 bytes).
Benchmarks:
- BenchmarkDgraphPrefixScanKeyOnly: models has() / index-scan
paths (posting.MemoryLayer.IterateDisk, worker.sort.go)
- BenchmarkDgraphPrefixScanAllVersions: models the rollup
iterator with multiple MVCC versions per key
- BenchmarkDgraphKeyIteratorAllVersions: models readFromDisk
via NewKeyIterator over a single posting key (every cache miss)
- BenchmarkDgraphPrefixScanNamespaceOffset: isolates the cost of
NamespaceOffset=1 vs -1 (i.e. the per-key DB.isBanned tax)
These benchmarks are intended as the measurement baseline for
upcoming iterator hot-path optimizations.
DB.isBanned runs on every iterator step and every key access when NamespaceOffset >= 0. The hot path acquires lockedKeys.RLock and probes the keys map even when the ban set is empty (the steady-state case in production deployments like dgraph that enable NamespaceOffset for multi-tenant key isolation but rarely ban anything). Add a monotonic atomic.Bool flag (lockedKeys.hasAny) that flips from false to true on the first add() and never flips back. isBanned can check this flag with a single atomic load and skip the lock + map lookup when no ban has ever been recorded. Correctness: add() stores the flag inside the same lock that guards the keys map, so any reader observing hasAny=true is guaranteed to see the populated map via release-acquire ordering on the lock. Readers observing hasAny=false correctly return early because the flag is monotonic and is set before any key is observable in the map. DB-reopen replay path also calls add() so reopened DBs retain the flag for previously-recorded bans. Measured on dgraph-shaped iterator benchmarks (Apple M4 Max, median-of-3, -benchtime=5s): PrefixScanKeyOnly 11.80M -> 10.56M ns/op (-10.5%) PrefixScanAllVersions 11.98M -> 10.60M ns/op (-11.4%) KeyIteratorAllVersions 1,514 -> 1,371 ns/op (-9.4%) NamespaceOffset/dgraph 5.51M -> 4.96M ns/op (-10.0%) NamespaceOffset/off 5.08M -> 5.01M ns/op (-1.4%, noise) The NamespaceOffset/dgraph case now matches NamespaceOffset/off (both ~4.96-5.01M ns/op) -- the entire ~8.5% NamespaceOffset tax from the per-key RLock is eliminated. Tests: full suite passes (TestBannedPrefixes, TestIterateWithBanned verified) except TestProtosRegenerate which fails identically on baseline due to missing protoc binary. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…decode
Iterator.parseItem previously called mi.Value() twice per kept item on
the non-AllVersions hot path:
1. To pre-check Meta/ExpiresAt for the deleted/expired filter.
2. Inside fill(), which re-fetched the same ValueStruct (cost includes
an Arena getVal + ValueStruct.Decode that the CPU profile showed
consuming ~40% of cumulative samples on dgraph-shaped key-only
prefix scans).
Refactor fill to take the already-fetched key []byte and a pointer to
the already-decoded *y.ValueStruct, and hoist mi.Value() into both the
AllVersions and non-AllVersions branches of parseItem. fill no longer
touches it.iitr.
Pass vs by pointer (not value): y.ValueStruct is ~40 bytes; passing by
value added a per-item struct copy that wiped out the AllVersions win
(branch only had one Value() call, no double-decode to eliminate).
Pointer collapses to a single 8-byte arg; fill does not store the
pointer so escape analysis keeps vs on the stack.
Correctness:
- vs and key are fetched while the iterator is positioned on the
current entry; no mi.Next() runs between fetch and fill. Arena/block
memory backing vs.Value and key is stable for that window.
- In the reverse-iteration goto FILL path, mi has advanced once, so
the refactor refreshes key = mi.Key() before re-entering FILL, and
FILL re-fetches vs := mi.Value(). Semantically equivalent to the
pre-refactor code.
Measured on dgraph-shaped iterator benchmarks (Apple M4 Max,
median-of-3, -benchtime=5s) -- composite (sum of 5 medians) drops
31.13M -> 26.37M ns (-15.3% vs iter1, -23.3% vs upstream baseline):
PrefixScanKeyOnly 10.56M -> 8.18M ns/op (-22.5%)
PrefixScanAllVersions 10.60M -> 10.53M ns/op (-0.7%, flat)
KeyIteratorAllVersions 1,371 -> 1,376 ns/op (+0.4%, noise)
NamespaceOffset/off 5.01M -> 3.83M ns/op (-23.6%)
NamespaceOffset/dgraph 4.96M -> 3.83M ns/op (-22.7%)
Tests: full suite passes (TestProtosRegenerate continues to fail
identically due to missing protoc binary).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Iterator.parseItem unconditionally ran bytes.HasPrefix(key, badgerPrefix)
on every step to detect badger-internal keys (e.g. !badger!banned). For
prefix-bounded scans whose Prefix's first byte differs from
badgerPrefix[0] ('!'), no key the iterator can land on can possibly be
internal -- the probe is dead. The CPU profile after iter2 attributed
~13% of cumulative samples to runtime.memequal, a portion of which was
this probe.
Precompute canSeeInternalKeys once at iterator construction:
- empty prefix: true (unbounded scan can hit internal keys)
- prefix[0] == '!' : true (could be the banned-ns recovery iterator
or any scan that starts in the internal-key range)
- otherwise: false (the iterator after Seek(prefix) is positioned
past the internal-key range and hasPrefix() will reject any
!-prefixed key anyway, so the per-step probe is provably dead).
When the flag is false, skip both the HasPrefix probe and the
!InternalAccess skip-block. isInternalKey defaults to false, which is
also the correct value for the downstream "!isInternalKey && isBanned"
gate (non-internal keys are subject to ban checks; internal ones are
not, but here we know they can't appear).
Correctness:
- initBannedNamespaces sets Prefix = bannedNsKey (starts with '!') so
the flag is true and the probe still runs on the recovery path.
- All other badger-internal access uses InternalAccess=true with a
badgerPrefix-anchored Prefix, so the flag is also true there.
- For dgraph's data scans (Prefix[0] = 0x00) and all other user scans
the probe is provably dead and can be skipped.
Measured on dgraph-shaped iterator benchmarks (Apple M4 Max,
median-of-3, -benchtime=5s):
PrefixScanKeyOnly 8.18M -> 7.92M ns/op (-3.2%)
PrefixScanAllVersions 10.53M ->10.50M ns/op (-0.2%, noise)
KeyIteratorAllVersions 1,376 -> 1,350 ns/op (-1.9%)
NamespaceOffset/off 3.83M -> 3.72M ns/op (-2.6%)
NamespaceOffset/dgraph 3.83M -> 3.74M ns/op (-2.4%)
Cumulative since upstream baseline: composite 34.37M -> 25.89M ns
(-24.7%); PrefixScanKeyOnly 11.80M -> 7.92M (-32.9%);
NamespaceOffset/dgraph 5.51M -> 3.74M (-32.1%).
Tests: TestBannedPrefixes, TestIterateWithBanned, TestIterator*, and
TestKeyIterator* all pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add opt-in KeyOnly flag to IteratorOptions. When set, fill() skips the per-item SafeCopy of vs.Value into item.vptr, which is the largest remaining memmove on the key-only forward-scan hot path (dgraph has()/index scans). Trade-offs (Option A, strict): - Item.Value / Item.ValueCopy return new ErrKeyOnlyMode. - Item.ValueSize / Item.EstimatedSize report 0 / key-size only. - Item.Key, Version, UserMeta, ExpiresAt, IsDeletedOrExpired unaffected. - NewIterator forces PrefetchValues=false when KeyOnly=true so the prefetch goroutine never starts. Existing callers see no behavior change. The new benchmark BenchmarkDgraphPrefixScanKeyOnlyOpt exercises the opt-in path alongside the original PrefixScanKeyOnly to preserve comparison. Results (5-bench composite, median-of-3, Apple M4 Max): - PrefixScanKeyOnlyOpt: 7.76M ns/op (−2.0% vs iter3 KeyOnly, −34.3% vs baseline) - Composite with opt-in: 25.63M ns (−25.4% vs upstream baseline) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Iterator.lastKey previously stored the full key (user-key || 8B ts) but
SameKey only compared the user-key portion. Two consequences:
1. SafeCopy(it.lastKey, key) copied 8 bytes of dead ts every kept item.
2. y.SameKey called ParseKey on both src and dst (two slice ops + a
function call) on the hot dedup compare.
Change: store the user-key only in lastKey; inline the compare.
ukLen := len(key) - 8
if ukLen == len(it.lastKey) && bytes.Equal(key[:ukLen], it.lastKey) {
mi.Next()
return false
}
it.lastKey = y.SafeCopy(it.lastKey, key[:ukLen])
The same-key dedup invariant ("skip later versions of a user-key we've
already considered, including tombstoned ones") is preserved exactly:
lastKey is still updated before the FILL/deleted-or-expired check.
Result: BenchmarkDgraphPrefixScanKeyOnlyOpt: 7.76M -> 7.58M ns (-2.3%).
Other benchmarks within run-to-run noise.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Strengthen comments on two hot-path invariants surfaced during code review of the perf series: 1. FILL label entry contract: `key` is always mi.Key() at the current iitr position. The only goto FILL (reverse path) refreshes key after mi.Next(); the fall-through entry above never advances the iterator between `key := mi.Key()` and reaching FILL. This is what lets fill() safely reuse the caller-supplied key without re-calling mi.Key() per item. 2. lastKey ukLen safety: `ukLen := len(key) - 8` is safe because len(key) >= 8 is a badger-wide invariant (every LSM key is stored with an 8-byte timestamp suffix via y.KeyWithTs). y.ParseTs(key) above already relies on this — no defensive check needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a comprehensive test file that hits every line of the perf series touched in db.go and iterator.go. Two test groups: * TestKeyOnlyIterator_* — exercise the new KeyOnly opt-in path: Value/ValueCopy return ErrKeyOnlyMode; EstimatedSize returns key length; ValueSize returns 0; NewIterator forces PrefetchValues=false; Key/UserMeta/Version/ExpiresAt continue to work under KeyOnly. * TestRegression* — lock in pre-existing behaviors that the perf series rewrote in place: same-key dedup across multiple MVCC versions (iter5 lastKey rewrite), internal-key hiding (iter3 canSeeInternalKeys gate), isBanned correctness across the new atomic.Bool fast path (iter1) including short-key-with-bans branch, and non-KeyOnly fill path correctness (iter2 hoist). * TestCanSeeInternalKeys — table-driven unit test for the helper added in iter3. All preserved-behavior tests also pass on upstream/main, proving the expectations match the unchanged baseline. Together with the benchmarks added in 7174cc2, the touched lines in db.go and iterator.go now have 100% coverage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Five targeted optimizations to the iterator hot path, measured against a new dgraph-shaped benchmark suite. Composite improvement: −25.4% vs upstream/main on the 5-bench composite (sum of ns/op medians, Apple M4 Max,
-benchtime=5s, median-of-3).PrefixScanKeyOnlyPrefixScanKeyOnlyOpt(new, opt-in)PrefixScanAllVersionsKeyIteratorAllVersionsNamespaceOffset/offNamespaceOffset/dgraphWhat changed
Six commits behind the perf series + 2 doc/test commits:
bench— Addsiterator_dgraph_bench_test.gowith 5 benches that model dgraph's iteration patterns (NamespaceOffset=1, NumVersionsToKeep=MaxInt32, DetectConflicts=false). These are the metric for every iteration below.perf: lockless fast path for DB.isBanned when no namespaces banned— Adds monotonicatomic.Bool hasAnyonlockedKeys.isBannedreturns on a single atomic load when no namespaces are banned (the production steady state for tenants who enable NamespaceOffset for isolation but rarely ban). EliminatesRWMutex.RLockper iterator step. Sets the flag inside the same lock that guards the map, so any reader observinghasAny=trueis guaranteed to see the populated map via release-acquire on the lock.perf: hoist mi.Value() and key out of fill()—fill(item)→fill(item, key, *y.ValueStruct). The KeyOnly path'sparseItempreviously made twomi.Value()+Decode calls per kept item.vsis passed by pointer to avoid copying the ~40-byte ValueStruct on every kept item.perf: skip per-step internal-key probe when prefix excludes badgerPrefix— PrecomputecanSeeInternalKeysonce at iterator construction. When the user'sopt.Prefixexists and its first byte differs frombadgerPrefix[0], no key the iterator lands on can be a badger-internal key, so the per-stepbytes.HasPrefix(key, badgerPrefix)probe is dead code.perf: IteratorOptions.KeyOnly skip per-item value SafeCopy— New opt-inIteratorOptions.KeyOnly. When set,fillskips the per-itemSafeCopy(item.vptr, vs.Value).Item.Value/Item.ValueCopyreturn a newErrKeyOnlyMode(strict-error, not silent-empty).PrefetchValuesis forced off whenKeyOnly=true. Captures an additional ~2–4% on top of the implicit gains for callers (e.g. dgraph'shas()evaluator / index scans) that already discarditem.vptr.perf: store user-key only in lastKey, inline same-key compare—Iterator.lastKeypreviously held the full key (user-key‖ts). Stores user-key only now. Replacesy.SameKey(lastKey, key)with an inlinedlen(key)-8 == len(lastKey) && bytes.Equal(key[:ukLen], lastKey). Avoids oneParseKeyperSameKeyand an 8-byte memcpy on each lastKey update. Same-key dedup invariant preserved exactly.docs(iterator): clarify FILL key-freshness and ukLen invariants— Strengthened comments on two hot-path invariants surfaced during code review.test: cover KeyOnly iterator paths and preserved-behavior regressions— Newiterator_keyonly_test.go. Together with the benchmarks, every touched line indb.goanditerator.gois now at 100% coverage (verified viago tool cover).Correctness
Every commit preserves existing semantics:
add()stores the flag inside the same lock that guards the map → release-acquire visibility holds.keyat everyfill()call site is the currentmi.Key(); the onlygoto FILL(reverse path) refreshes it aftermi.Next().canSeeInternalKeys=false,isInternalKeystaysfalse, which is the correct value for prefix shapes that cannot overlapbadgerPrefix.Item.ValuereturnsErrKeyOnlyModerather than a silent empty slice.PrefetchValuesis forced off so the prefetch goroutine never racesfill.lastKeyuser-key-only representation is preserved across the existing "b 7 (del), b 5" hazard becauselastKeyis still updated before the deleted-or-expired check.Tests
TestPreserved*/TestRegression*) pass on upstream/main AND on this branch — they were validated on both via a worktree.TestKeyOnlyIterator_*,TestCanSeeInternalKeys) pass on this branch.go test -run='^Test' ./...): PASS in 234.7s.db.go(isBanned + lockedKeys.add) anditerator.go(Item.Value/ValueCopy/EstimatedSize/ValueSize, NewIterator, canSeeInternalKeys, fill, parseItem).Test plan
go test ./...— full suite passesgo test -bench=BenchmarkDgraph -benchtime=5s -count=3 .— composite −25.2% vs maingo tool cover -func=...— 100% on touched linesNotes for reviewers
IteratorOptions.KeyOnlyis opt-in; existing callers see no behavior change. Dgraph would need a one-line change at itshas()/ index-scan iterator construction sites to capture the additional 2–4% on those paths.PrefixScanAllVersions) dominates and hides smaller relative wins onKeyIteratorAllVersions. Per-benchmark breakdown above tells the full story.IteratorOptions.KeyOnly(bool, default false) andErrKeyOnlyMode(sentinel error).🤖 Generated with Claude Code