iterv2: introduce a new iterator stack (off by default) by RaduBerinde · Pull Request #5851 · cockroachdb/pebble

RaduBerinde · 2026-03-20T17:11:57Z

We introduce a new iterator abstraction along with a new merging
iterator, level iterator, interleaving iterator stack. This
abstraction allows presenting the next span boundary at any point,
which allows the merging iterator to (conceptually) divide the key
space into "slabs". The slab approach will make the addition of new
rangedel-type keys much easier. The boundary also naturally allows
"stopping points" in the level iterator without a special contract
with the merging iterator.

The new stack can be used by setting iterv2.Enabled. The stack
functions correctly (it passed a few hours of the metamorphic test).
It is still a prototype in that it does not yet have many of the
optimizations the existing stack has (like try-seek-using-next). The
integration of the v2 stack also needs some optimization in terms of
allocations.

Most of the code inside the iterators was written by me. Claude helped
with comments and with limited bits and pieces but it was generating
much more complicated code for the core logic. In all fairness, I also
had to go back and rewrite most of it a few times.

Suggestion on what to review:

iterv2.Iter semantics;
iterv2.TestIter implementation. While this is only used for
testing, it serves as a formal codification of the semantics (with
a fairly simple implementation). All the new iterators are tested
against the TestIter, so it's less important to review the
actual iterator implementations;
iterv2.InterleavingIter struct and Init() comments;
levelIterV2 struct comment;
mergingIterV2 struct comment;
slabState interface (BuildForward).

cockroach-teamcity · 2026-03-20T17:12:06Z

This change is

RaduBerinde · 2026-03-20T17:15:05Z

CC @jbowens - in case you are interested; this continues and is inspired by your work in this area over the last few years.

sumeerbhola

@sumeerbhola made 5 comments.
Reviewable status: 0 of 33 files reviewed, 5 unresolved discussions (waiting on annrpom and RaduBerinde).

internal/iterv2/iter.go line 43 at r3 (raw file):

// # Boundary keys
//
// When the iterator is about to leave the current Span, it returns a synthetic

And the Span() for this key is still the same span, yes?

internal/iterv2/iter.go line 51 at r3 (raw file):

// same user key. When a point key coincides with a span boundary (e.g. point
// key "c" at the start of span [c, g)), the boundary key for the previous span
// is returned first, followed by the point key (now in the new span).

This example is the forward iteration case.
Worth clarifying that.

And adding an example for the reverse case too.

Also, are we side-stepping the issue of how InternalKeyKindSpanBoundary compares with the rangedel or the rangekey key kinds?
We currently have these boundary/sentinel/fake/synthetic keys that use I think the rangedel and other rangekey kind. I suppose we no longer need to expose them. So we won't have a situation where both a rangedel end is exposed as a point and so is this InternalKeyKindSpanBoundary and they are in the wrong order.

internal/iterv2/iter.go line 54 at r3 (raw file):

//
// At a boundary key, Span() still returns the Span being exited; it is updated
// to the adjacent Span on the subsequent Next or Prev call.

This answers my question about the Span().

internal/iterv2/iter.go line 100 at r3 (raw file):

// nil, the caller would not discover any span keys (like RANGEDEL) in this
// region.
//

I'm a bit lost here. Could use some more examples.
For example with the same SeekPrefixGE(a) and the only keys are b keys and a RANGEDEL [b, ...) we won't return anything, yes?

Or is this saying that it will not return the b keys but will go all the way to the b boundary, and stop before showing the RANGEDEL [b, ...)?

The nuanced example doesn't have any point keys that don't match the prefix that are skipped over until the boundary is reached.

internal/iterv2/iter.go line 165 at r3 (raw file):

	//
	// Boundary can be nil when the iteration range is unbounded in the
	// corresponding direction: nil for BoundaryEnd means no upper bound, nil for

Is this unbounded case for dealing with memtable and batch iterators? I am assuming any iterator on a file will always expose a non-nil Boundary.

sumeerbhola

just flushing comments as I read. This is neat!

@sumeerbhola made 1 comment.
Reviewable status: 0 of 33 files reviewed, 5 unresolved discussions (waiting on annrpom and RaduBerinde).

sumeerbhola

@sumeerbhola made 5 comments.
Reviewable status: 0 of 33 files reviewed, 10 unresolved discussions (waiting on annrpom and RaduBerinde).

internal/iterv2/test_iter.go line 70 at r3 (raw file):

//   - points: point keys to include.
//   - spans: span definitions. Note that a span with Start=End can be used to
//     induce a spurious boundary inside a gap between spans.

I didn't understand this part about spurious boundary.

Are these spans already fragmented to be non-overlapping?

internal/iterv2/test_iter.go line 71 at r3 (raw file):

//   - spans: span definitions. Note that a span with Start=End can be used to
//     induce a spurious boundary inside a gap between spans.
//   - startKey: the overall start key of the iterator's range (can be nil).

what is an "iterator range"? Unlike bounds, is this immutable?
Why is the range not simply a function of points and spans?

Oh, I see now that the bounds can't be changed, which is then more confusing since we shouldn't need two things to express the same idea of bounds. What am I missing?

Hmm, maybe the startKey, endKey are simulating file bounds.

internal/iterv2/test_iter.go line 160 at r3 (raw file):

	t.boundaries = t.boundaries[:0]
	if lower != nil {
		t.boundaries = append(t.boundaries, lower)

The comment just above says effectiveLower, effectiveUpper.

internal/iterv2/test_iter.go line 189 at r3 (raw file):

	t.spanKeys = make([][]keyspan.Key, numRegions)
	for i := range numRegions {
		if t.boundaries[i] == nil || t.boundaries[i+1] == nil {

does this only happen when there are no spans and we added nil to both sides?

internal/iterv2/test_iter.go line 194 at r3 (raw file):

		for _, s := range filteredSpans {
			if keyCmp(s.Start, t.boundaries[i]) <= 0 && keyCmp(s.End, t.boundaries[i+1]) >= 0 {
				t.spanKeys[i] = s.Keys

So the filteredSpans don't overlap? Otherwise we'd need to append here, yes?

RaduBerinde

@RaduBerinde made 8 comments.
Reviewable status: 0 of 33 files reviewed, 10 unresolved discussions (waiting on annrpom and sumeerbhola).

internal/iterv2/iter.go line 43 at r3 (raw file):

Previously, sumeerbhola wrote…

And the Span() for this key is still the same span, yes?

Yes.

internal/iterv2/iter.go line 51 at r3 (raw file):

Previously, sumeerbhola wrote…

This example is the forward iteration case.
Worth clarifying that.

And adding an example for the reverse case too.

Also, are we side-stepping the issue of how InternalKeyKindSpanBoundary compares with the rangedel or the rangekey key kinds?
We currently have these boundary/sentinel/fake/synthetic keys that use I think the rangedel and other rangekey kind. I suppose we no longer need to expose them. So we won't have a situation where both a rangedel end is exposed as a point and so is this InternalKeyKindSpanBoundary and they are in the wrong order.

Yes, we no longer have any of those keys in iterv2 :) Only real point keys and this BOUNDARY key. We still interleave range key stuff but that happens on top of mergingIterV2 which is just an InternalIterator and not an iterv2.Iter and never returns Boundary keys.

internal/iterv2/iter.go line 100 at r3 (raw file):

Previously, sumeerbhola wrote…

I'm a bit lost here. Could use some more examples.
For example with the same SeekPrefixGE(a) and the only keys are b keys and a RANGEDEL [b, ...) we won't return anything, yes?

Or is this saying that it will not return the b keys but will go all the way to the b boundary, and stop before showing the RANGEDEL [b, ...)?

The nuanced example doesn't have any point keys that don't match the prefix that are skipped over until the boundary is reached.

The latter - we will return a b#BOUNDARY (and nothing else).

internal/iterv2/iter.go line 165 at r3 (raw file):

Previously, sumeerbhola wrote…

Is this unbounded case for dealing with memtable and batch iterators? I am assuming any iterator on a file will always expose a non-nil Boundary.

Yes, and also for the level iterator.

internal/iterv2/test_iter.go line 70 at r3 (raw file):

Previously, sumeerbhola wrote…

I didn't understand this part about spurious boundary.

Are these spans already fragmented to be non-overlapping?

Yes, they are non-overlapping. The iterv2 contract allows spurious boundaries where there are no span keys, and we want the TestIter to present those.

internal/iterv2/test_iter.go line 71 at r3 (raw file):

Previously, sumeerbhola wrote…

what is an "iterator range"? Unlike bounds, is this immutable?
Why is the range not simply a function of points and spans?

Oh, I see now that the bounds can't be changed, which is then more confusing since we shouldn't need two things to express the same idea of bounds. What am I missing?

Hmm, maybe the startKey, endKey are simulating file bounds.

There is an extended comment somewhere explaining the 'static' vs 'dynamic' bounds. Basically the static one exist as an optimization - they don't need to be enforfed, and they allow use of First/Last.

internal/iterv2/test_iter.go line 189 at r3 (raw file):

Previously, sumeerbhola wrote…

does this only happen when there are no spans and we added nil to both sides?

It will happen at either end if we added a nil

internal/iterv2/test_iter.go line 194 at r3 (raw file):

Previously, sumeerbhola wrote…

So the filteredSpans don't overlap? Otherwise we'd need to append here, yes?

Correct, I should add an assertion here.

sumeerbhola

@sumeerbhola reviewed 2 files, made 20 comments, and resolved 1 discussion.
Reviewable status: 2 of 33 files reviewed, 23 unresolved discussions (waiting on annrpom and RaduBerinde).

internal/iterv2/iter.go line 165 at r3 (raw file):

Previously, RaduBerinde wrote…

Yes, and also for the level iterator.

Once the level iterator is past the last file in the direction of iteration? Hmm, that doesn't sound right, since it would simply be exhausted. Say we are forward iterating, could you provide an example of when the level iterator would produce a nil boundary?

Also, if the boundary is nil when iterating, say in the forward direction, it can't become non-nil if we keep iterating in that direction, yes? This seems obvious, but I just want to confirm.

internal/iterv2/test_iter.go line 70 at r3 (raw file):

Previously, RaduBerinde wrote…

Yes, they are non-overlapping. The iterv2 contract allows spurious boundaries where there are no span keys, and we want the TestIter to present those.

That is a Span with empty Span.Keys, yes? In that the iter needs to present spans with no gaps, yes? I was imagining TestIter would generate those when it sorts and sees a gap. And I still don't understand how that relates to Start==End. I was assuming such spans were disallowed, in that the iter would only expose non-empty spans.

internal/iterv2/test_iter.go line 71 at r3 (raw file):

Previously, RaduBerinde wrote…

There is an extended comment somewhere explaining the 'static' vs 'dynamic' bounds. Basically the static one exist as an optimization - they don't need to be enforfed, and they allow use of First/Last.

Ack. Looks like these are file bounds in practice.

And I missed that the lower and upper bounds can be changed, by calling init again.

internal/iterv2/test_iter.go line 160 at r3 (raw file):

Previously, sumeerbhola wrote…

The comment just above says effectiveLower, effectiveUpper.

Ignore. lower is nil if it is not effective, in which case the effective bound is t.startKey.

Can this be written more simply as:

if effectiveLower != nil {
  t.boundaries = append(t.boundaries, effectiveLower)
}

internal/iterv2/test_iter.go line 189 at r3 (raw file):

Previously, RaduBerinde wrote…

It will happen at either end if we added a nil

Ah yes, the disjunction.

And by definition such spans have no keyspan.Keys. A code comment would be useful.

internal/iterv2/test_iter.go line 194 at r3 (raw file):

Previously, RaduBerinde wrote…

Correct, I should add an assertion here.

We could assert in NewTestIter?

internal/iterv2/iter.go line 32 at r4 (raw file):

//
// Example: given an Iter with bounds [a, z), point keys {a, b, c, d, e, h, i},
// and a range key span [c, g), the key space is partitioned into three Spans:

nit: rangedel span

internal/iterv2/iter.go line 58 at r4 (raw file):

// When the iterator has a lower/upper bound, there will be a boundary key
// emitted for that bound. If there is no bound, there is no boundary key
// emitted before exhaustion.

Does this happen even when the iterator is exhausted before hitting that bound?
For example during forward iteration if the last point was g, and last rangedel was [h, i), and the upper bound is k, will this emit a span [i, k) with no point key, and empty Span.Keys?

I suspect the answer is no, based on the following code in the test iter, where for the above example, assuming endKey is < k, lower is already nil.

	t.boundaries = t.boundaries[:0]
	if lower != nil {
		t.boundaries = append(t.boundaries, lower)

I was confused by this special case, but then realized it is not a special case. The spans need to be clipped to the upper and lower bound if it is effective, in which case when we reach the start of the first span or the end of the last span, that lower or upper bound will naturally be emitted as a boundary key. Is my understanding correct?

An example with an effective upper bound below would be helpful.

internal/iterv2/iter.go line 128 at r4 (raw file):

	// positioned, the span can be empty (len(Keys)==0) and always contains the
	// last key returned by the iterator (other than synthetic boundary keys,
	// which sit at the span edge).

consider refining this span edge comment to say that the span doesn't contain the synthetic boundary key in the forward direction since it is exclusive and does contain it in the reverse direction.

internal/iterv2/iter.go line 131 at r4 (raw file):

	//
	// When the iterator is not positioned (before any seeking operation, after
	// SetBounds, or after exhaustion) the Span has zero-value fields

I think an iterator is considered positioned after exhaustion e.g. iterator doing forward iteration and exhausted is expected to behave deterministically when Prev is called. It of course makes sense that the *Span would be zero when exhausted.

I think this comment is implying that the iter will always return a non-nil Span. This seems reinforced by the "stash" comment below. Stating this non-nil invariant explicitly would be good.

internal/iterv2/iter.go line 163 at r4 (raw file):

	//   - BoundaryEnd: forward iteration; Boundary is the exclusive end.
	//   - BoundaryStart: backward iteration; Boundary is the inclusive start.
	//   - BoundaryNone: unpositioned or exhausted.

this comment does distinguish between unposiitioned and exhausted, so the earlier must have been a typo.

internal/iterv2/test_iter.go line 28 at r4 (raw file):

type testEntry struct {
	kv      base.InternalKV
	spanIdx int  // index into boundaries/spanKeys

It is a bit worrying that this index applies to both boundaries and spanKeys but the length of spanKeys is 1 less than boundaries. An invariant comment on the range of this spanIdx would be helpful.

internal/iterv2/test_iter.go line 29 at r4 (raw file):

	kv      base.InternalKV
	spanIdx int  // index into boundaries/spanKeys
	skipFwd bool // skip during forward operations (First, SeekGE, Next)

These are exactly the boundary keys representing the start of the span, and skipBwd are the boundary keys representing the end of the span, yes?

internal/iterv2/test_iter.go line 126 at r4 (raw file):

	}

	// Filter points to [lower, upper).

// NB: points are already in [startKey, endKey), which is why we are not filtering to [effectiveLower, effectiveUpper).

internal/iterv2/test_iter.go line 138 at r4 (raw file):

	}

	// Filter spans overlapping [lower, upper), clip Start/End.

// NB: spans are already in [startKey, endKey), which is why we are not filtering to [effectiveLower, effectiveUpper).

internal/iterv2/test_iter.go line 178 at r4 (raw file):

	})

	if lower == nil && t.startKey == nil {

if effectiveLower == nil

or

t.nilLower

?

internal/iterv2/test_iter.go line 208 at r4 (raw file):

	// Point entries.
	for _, p := range filteredPoints {
		spanIdx := sort.Search(numRegions-1, func(i int) bool {

why numRegions-1 and not numRegions. A point can be in the last region, yes?
I suspect this works because it will return numRegions-1 when the point is in the last region, since that represents the point not being found. But it would be better to use numRegions as a parameter and assert that spanIdx is < numRegions.

internal/iterv2/test_iter.go line 266 at r4 (raw file):

			t.currentSpan.Boundary = nil
		} else {
			t.currentSpan.Boundary = t.boundaries[e.spanIdx+1]

why do we need the if-clause given we added nil to the end of boundaries when t.nilUpper. Isn't that automatically handled?

Same for the backward case below.

internal/iterv2/test_iter.go line 290 at r4 (raw file):

	t.filter = nil
	t.dir = +1
	for t.idx = 0; t.idx < len(t.entries); t.idx++ {

nit: after setting t.idx = 0, we could delegate to a findForward helper method. That method would work here, seekGEInternal and Next.

Similar helper method for reverse iteration.

internal/iterv2/test_iter.go line 349 at r4 (raw file):

	prefix = slices.Clone(prefix)

	// Find the first boundary that has a different prefix.

Say the call is SeekPrefixGE("b", "d@6", ...), which is permitted.
And there is a span [c, e). The end boundary will be e, yes? And we are still at the [c, e) span, yes?
That is, it doesn't matter that d@6 is already past the prefix b.

RaduBerinde

@RaduBerinde made 8 comments.
Reviewable status: 2 of 33 files reviewed, 23 unresolved discussions (waiting on annrpom and sumeerbhola).

internal/iterv2/iter.go line 100 at r3 (raw file):

Previously, RaduBerinde wrote…

The latter - we will return a b#BOUNDARY (and nothing else).

Improved the example.

internal/iterv2/iter.go line 165 at r3 (raw file):

Previously, sumeerbhola wrote…

Once the level iterator is past the last file in the direction of iteration? Hmm, that doesn't sound right, since it would simply be exhausted. Say we are forward iterating, could you provide an example of when the level iterator would produce a nil boundary?

Also, if the boundary is nil when iterating, say in the forward direction, it can't become non-nil if we keep iterating in that direction, yes? This seems obvious, but I just want to confirm.

I didn't want the merging iterator to have to deal with level iterators that (conceptually) start and end at different places in the keyspace (it would add unnecessary cases to think about, e.g. think about when seeking at a key that overlaps some levels but is outside others).

So every merging iterator level covers the entire [-∞, ∞) keyspace. The first key returned by the level iterator is a boundary key at the first file boundary (with a Span that has that boundary and no span keys). If L1 has a single file at [a, d) and L2 has the last file at [c, e), the L1 level would emit a boundary key at d.

I don't think we would ever "see" a nil boundary in a non-exhausted Level iterator - because it has the property that the first and last (half open) spans have no other keys inside. We could conceptually be in one of those places if we SeekGE before the first file, or if we SeekLT after the last file (but in either case, we would be exposing the other boundary, not the nil one).

internal/iterv2/iter.go line 58 at r4 (raw file):

Previously, sumeerbhola wrote…

Does this happen even when the iterator is exhausted before hitting that bound?
For example during forward iteration if the last point was g, and last rangedel was [h, i), and the upper bound is k, will this emit a span [i, k) with no point key, and empty Span.Keys?

I suspect the answer is no, based on the following code in the test iter, where for the above example, assuming endKey is < k, lower is already nil.
	t.boundaries = t.boundaries[:0]
	if lower != nil {
		t.boundaries = append(t.boundaries, lower)
I was confused by this special case, but then realized it is not a special case. The spans need to be clipped to the upper and lower bound if it is effective, in which case when we reach the start of the first span or the end of the last span, that lower or upper bound will naturally be emitted as a boundary key. Is my understanding correct?

An example with an effective upper bound below would be helpful.

Yes, it will emit a boundary key at 'i', then a boundary key at 'k' which shows a span to k with no keys.

It is illegal to set bounds that are outside [startKey, endKey) (see the checkBounds in InterleavingIter). The dynamic bounds can narrow the "natural/static" bounds, never extend them.

internal/iterv2/test_iter.go line 70 at r3 (raw file):

Previously, sumeerbhola wrote…

That is a Span with empty Span.Keys, yes? In that the iter needs to present spans with no gaps, yes? I was imagining TestIter would generate those when it sorts and sees a gap. And I still don't understand how that relates to Start==End. I was assuming such spans were disallowed, in that the iter would only expose non-empty spans.

The iterator does generate spans with no keys to fill the gaps. The iterator will never expose a zero length span.

This special case is just a way to tell the TestIter to add some spurious boundaries.

For example, if we have spans [a,b):{something} and [x,y):{something}, the iterator will expose a iterv2.Span with no keys between [b,x). But suppose we want it instead to expose two spans [b,d) and [d,x), to simulate a level file boundary. To get TestIter to do that, we pass an empty [d,d) "span".

I could add a separate "spuriousBoundaries" instead if that makes more sense.

internal/iterv2/test_iter.go line 160 at r3 (raw file):

Previously, sumeerbhola wrote…

The comment just above says effectiveLower, effectiveUpper.

Fixed.

internal/iterv2/test_iter.go line 194 at r3 (raw file):

Previously, RaduBerinde wrote…

Correct, I should add an assertion here.

Added.

internal/iterv2/test_iter.go line 29 at r4 (raw file):

Previously, sumeerbhola wrote…

These are exactly the boundary keys representing the start of the span, and skipBwd are the boundary keys representing the end of the span, yes?

Yes.

internal/iterv2/test_iter.go line 349 at r4 (raw file):

Previously, sumeerbhola wrote…

Say the call is SeekPrefixGE("b", "d@6", ...), which is permitted.
And there is a span [c, e). The end boundary will be e, yes? And we are still at the [c, e) span, yes?
That is, it doesn't matter that d@6 is already past the prefix b.

It is not permitted in iterv2. OpCheckIter asserts that prefix is the prefix of the key.

Inside the merging iterator, we no longer have to "move up" all levels in some way (which was the reason for calling SeekPrefixGE with a non-matching prefix); we are now "parking" levels when under a tombstone and leaving them alone.

Note that this means we won't be able to do the "RelativeSeek" thing in the same way, but I have a different (cleaner) idea for lazy iteration.

sumeerbhola

@sumeerbhola made 19 comments.
Reviewable status: 2 of 33 files reviewed, 40 unresolved discussions (waiting on annrpom and RaduBerinde).

internal/iterv2/iter.go line 58 at r4 (raw file):

Previously, RaduBerinde wrote…

Yes, it will emit a boundary key at 'i', then a boundary key at 'k' which shows a span to k with no keys.

It is illegal to set bounds that are outside [startKey, endKey) (see the checkBounds in InterleavingIter). The dynamic bounds can narrow the "natural/static" bounds, never extend them.

I see. In my example, I was imagining the dynamic upper bound is k, the the endKey is i, so that can't happen. And id the endKey is m, then the dynamic upper bound k will narrow, and then it makes sense that the [i, k) span is emitted.

internal/iterv2/test_iter.go line 70 at r3 (raw file):

I could add a separate "spuriousBoundaries" instead if that makes more sense.

Yes, that may be simpler to explain than an empty span.

internal/iterv2/interleaving_iter.go line 39 at r4 (raw file):

	// lower and upper are optional dynamic bounds that restrict the iteration
	// range within [startKey, endKey). Unlike startKey/endKey, these are
	// enforced via key comparisons. When set, First/Last panic (the caller

we should panic for this case in TestIter too, if we plan to use it for correctness validation.

internal/iterv2/interleaving_iter.go line 64 at r4 (raw file):

	// Forward (dir=+1): the LOWER boundary of the current gap. Equals
	// effectiveLower() or the End of the preceding fragment span. The gap
	// is [gapStart, span.Start) or [gapStart, upper) when span == nil.

should this say

or [gapStart, effectiveUpper) when span == nil.

internal/iterv2/interleaving_iter.go line 68 at r4 (raw file):

	// Backward (dir=-1): the UPPER boundary of the current gap. Equals
	// effectiveUpper() or the Start of the next fragment span forward.
	// The gap is [span.End, gapStart) or [lower, gapStart) when span == nil.

similarly, should this say

or [effectiveLower, gapStart) ...

internal/iterv2/interleaving_iter.go line 74 at r4 (raw file):

	// dir is +1 for forward, -1 for backward (and 0 for unpositioned).
	dir int8
	// inSpan indicates whether the current position is inside span.

So gapStart is irrelevant when inSpan, yes?

internal/iterv2/interleaving_iter.go line 79 at r4 (raw file):

	atBoundary  bool
	currentSpan Span
	kv          base.InternalKV // reused for boundary keys

So pointer to kv is returned for boundary keys, and pointKV is returned when not on a boundary key?

But for the span, we always return currentSpan which may contain a shallow copy of the []keyspan.Key slice from span and a shallow copy of either Span.{Start,End} for the Boundary?

internal/iterv2/interleaving_iter.go line 259 at r4 (raw file):

		return
	}
	s, err := i.spanIter.SeekLT(pos)

I was initially puzzled by the SeekLT and then later realized it is because we need to populate the gap, specifically we need gapStart.

internal/iterv2/interleaving_iter.go line 273 at r4 (raw file):

	}
	if s != nil && i.leLower(s.End) {
		s = nil

why is the FragmentIterator returning a span that is completely preceding the lower bound? Do we not set the bounds on the FragmentIterator?

Never mind. It looks like there is no bounds support for FragmentIterator. Is doing the comparisons here cheaper (fewer comparisons) than having bounds support in FragmentIterator?

internal/iterv2/interleaving_iter.go line 275 at r4 (raw file):

		s = nil
	}
	if s != nil {

// INVARIANT: pos >= s.End
// It is possible that pos lies in a gap between this span and the next span (which we have not yet stepped to). So tentatively set the gapStart to the end of the span. We will revise this decision below.

internal/iterv2/interleaving_iter.go line 279 at r4 (raw file):

		i.gapStart = i.gapStartBuf
	} else {
		i.gapStart = i.effectiveLower()

A similar comment about this being the tentative start of the gap, though we are not yet sure there is a gap.

internal/iterv2/interleaving_iter.go line 280 at r4 (raw file):

	} else {
		i.gapStart = i.effectiveLower()
	}

// INVARIANT: i.gapStart <= pos

Assuming the caller is following the requirement and not seeking below the lower bound.
If there is no lower bound, can it seek before the startKey -- we don't panic for that case? Should we explicitly disallow that since there is no reason for the level iterator to seek before the start of a file?

I think iterV2 is a good opportunity to tighten requirements and sprinkle a lot of invariants.Enabled panics.

internal/iterv2/interleaving_iter.go line 283 at r4 (raw file):

	i.nextSpan(i.spanIter.Next())
	// In span if the span starts at or before the seek position or at or
	// before gapStart (a zero-length gap means we're directly in the span).

// INVARIANT: If i.span != nil, i.span.Start >= pos

So if i.span.Start >= pos and i.gapStart <= pos, then i.gapStart <= i.span.Start. Which implies we shouldn't need the (i.gapStart != nil && i.cmp.Compare(i.span.Start, i.gapStart) <= 0)) clause.
It should be sufficient to do

i.span != nil && (i.cmp.Compare(i.span.Start, pos) == 0

internal/iterv2/interleaving_iter.go line 308 at r4 (raw file):

// When inSpan is true, pos is inside span. If pos is in a gap, gapStart is the
// upper boundary of that gap.
func (i *InterleavingIter) positionSpanIterBackward(pos []byte) {

TODO(sumeer): read this

internal/iterv2/interleaving_iter.go line 382 at r4 (raw file):

func (i *InterleavingIter) SeekGE(key []byte, flags base.SeekGEFlags) *base.InternalKV {
	// TODO(radu): support TrySeekUsingNext.
	flags = flags.DisableTrySeekUsingNext()

is seekGEFlagRelativeSeek irrelevant for iterv2?

internal/iterv2/interleaving_iter.go line 402 at r4 (raw file):

		}
		if i.geEffectiveUpper(key) {
			i.invalidateSpan()

I am puzzled that the FragmentIterator is not positioned. Isn't it legal to seek past the upper bound and then do Prev? If yes, don't we need to position the FragmentIterator so it can handle that Prev?

Maybe I am mistaken. I looked at levelIter.internalNext on master and if it is exhausted in the backward direction and one does Next it seeks (and vice versa).

internal/iterv2/interleaving_iter.go line 409 at r4 (raw file):

	return i.resolveForward()
}

TODO(sumeer): resume here

internal/iterv2/interleaving_iter.go line 572 at r4 (raw file):

			// TODO(radu): make SeekPrefixGE strict in InternalIterator.
			i.pointKV = nil
		}

I suppose this path doesn't need to step the FragmentIterator since if i.pointKV is not in the currentSpan we will first emit the boundary key.

internal/iterv2/interleaving_iter.go line 615 at r4 (raw file):

	}
	if i.inSpan {
		// Was in fragment span, now entering gap after it.

nit: now potentially entering gap ...

RaduBerinde

@RaduBerinde made 6 comments.
Reviewable status: 2 of 33 files reviewed, 40 unresolved discussions (waiting on annrpom and sumeerbhola).

internal/iterv2/interleaving_iter.go line 79 at r4 (raw file):

Previously, sumeerbhola wrote…

So pointer to kv is returned for boundary keys, and pointKV is returned when not on a boundary key?

But for the span, we always return currentSpan which may contain a shallow copy of the []keyspan.Key slice from span and a shallow copy of either Span.{Start,End} for the Boundary?

Correct. We require that Span() a stable pointer so that we don't have to keep calling the (virtual) method.

internal/iterv2/interleaving_iter.go line 273 at r4 (raw file):

Previously, sumeerbhola wrote…

why is the FragmentIterator returning a span that is completely preceding the lower bound? Do we not set the bounds on the FragmentIterator?

Never mind. It looks like there is no bounds support for FragmentIterator. Is doing the comparisons here cheaper (fewer comparisons) than having bounds support in FragmentIterator?

Probably not.. But we would need to rework all FragmentIterator implementations to support bounds, won't tackle that here.

internal/iterv2/interleaving_iter.go line 280 at r4 (raw file):

Previously, sumeerbhola wrote…

// INVARIANT: i.gapStart <= pos

Assuming the caller is following the requirement and not seeking below the lower bound.

If there is no lower bound, can it seek before the startKey -- we don't panic for that case? Should we explicitly disallow that since there is no reason for the level iterator to seek before the start of a file?

I think iterV2 is a good opportunity to tighten requirements and sprinkle a lot of invariants.Enabled panics.

I think we currently tolerate seek before startKey, but you're right, we shouldn't need to. I'll look at tightening that up.

internal/iterv2/interleaving_iter.go line 283 at r4 (raw file):

Previously, sumeerbhola wrote…

// INVARIANT: If i.span != nil, i.span.Start >= pos

So if i.span.Start >= pos and i.gapStart <= pos, then i.gapStart <= i.span.Start. Which implies we shouldn't need the (i.gapStart != nil && i.cmp.Compare(i.span.Start, i.gapStart) <= 0)) clause.
It should be sufficient to do
i.span != nil && (i.cmp.Compare(i.span.Start, pos) == 0 

I think you're right.

internal/iterv2/interleaving_iter.go line 382 at r4 (raw file):

Previously, sumeerbhola wrote…

is seekGEFlagRelativeSeek irrelevant for iterv2?

Correct, we don't use it (and don't plan to).

internal/iterv2/interleaving_iter.go line 402 at r4 (raw file):

Previously, sumeerbhola wrote…

I am puzzled that the FragmentIterator is not positioned. Isn't it legal to seek past the upper bound and then do Prev? If yes, don't we need to position the FragmentIterator so it can handle that Prev?

Maybe I am mistaken. I looked at levelIter.internalNext on master and if it is exhausted in the backward direction and one does Next it seeks (and vice versa).

If we do Prev() after a SeekGE(), that is handled by switchToBackward which will SeekLT from the upper limit in this case.

sumeerbhola

@sumeerbhola reviewed 1 file, made 9 comments, and resolved 2 discussions.
Reviewable status: 3 of 33 files reviewed, 47 unresolved discussions (waiting on annrpom and RaduBerinde).

internal/iterv2/interleaving_iter.go line 539 at r4 (raw file):

	i.gapStart = i.upper
	if i.spanIter != nil {
		i.prevSpan(i.spanIter.Last())

Given spanIter doesn't know about the upper bound shouldn't we be seeking it if i.upper is set?

internal/iterv2/interleaving_iter.go line 332 at r4 (raw file):

		s = nil
	}
	if s != nil {

// INVARIANT: pos <= s.Start
// It is possible that there is a gap between this span and the prev span (which we have not yet stepped to). So tentatively set the gapStart to the start of the span. We will revise this decision below.

internal/iterv2/interleaving_iter.go line 782 at r4 (raw file):

	if i.pointKV != nil {
		kv = i.pointIter.NextPrefix(succKey)
	}

why is this calling seekGEHelper?
I was expecting this to be:

		i.pointKV = i.pointIter.NextPrefix()
		if i.pointKV == nil {
			if err := i.pointIter.Error(); err != nil {
				i.setError(err)
				return nil
			}
		}
		return i.resolveForward()

In that all that is happening is that the pointIter is stepping more.

internal/iterv2/interleaving_iter.go line 501 at r4 (raw file):

	}

	i.gapStart = i.lower

should this be effectiveLower since that is where First() starts from.

internal/iterv2/interleaving_iter.go line 340 at r4 (raw file):

	i.prevSpan(i.spanIter.Prev())
	// In span if the span ends at the seek position or at or after gapStart
	// (a zero-length gap means we're directly in the span).

// INVARIANT: If i.span != nil, i.span.End <= pos (since otherwise it would have been returned by the i.spanIter.SeekGE above).

I think similarly it should be sufficient to do i.span != nil && (i.cmp.Compare(i.span.End, pos) == 0

internal/iterv2/interleaving_iter.go line 424 at r4 (raw file):

	kv := i.pointIter.SeekPrefixGE(prefix, key, flags)
	i.checkPoint(kv)
	// TODO(radu): make SeekPrefixGE strict in InternalIterator.

I wonder if it was lazy to save comparisons in each leaf-level iterator.
If something doesn't rise to the top of the heap it doesn't matter if it does not have the prefix.
Though if it returns nil, it won't be in the heap, which seems better.

internal/iterv2/interleaving_iter.go line 537 at r4 (raw file):

	}

	i.gapStart = i.upper

should this be effectiveUpper since that is where Last() starts from.

internal/iterv2/interleaving_iter.go line 777 at r4 (raw file):

func (i *InterleavingIter) NextPrefix(succKey []byte) *base.InternalKV {
	if invariants.Enabled && i.atBoundary {
		panic(errors.AssertionFailedf("NextPrefix called when at boundary key"))

we should also panic if the iterator is in prefix iteration mode, or we were reverse iterating, or pointKV is nil. I think all of those violate the NextPrefix contract stated in the comment in the interface.

internal/iterv2/interleaving_iter.go line 503 at r4 (raw file):

	i.gapStart = i.lower
	if i.spanIter != nil {
		i.nextSpan(i.spanIter.First())

Given spanIter doesn't know about the lower bound shouldn't we be seeking it if i.lower is set?

RaduBerinde

@RaduBerinde reviewed 1 file, made 25 comments, and resolved 3 discussions.
Reviewable status: 3 of 33 files reviewed, 44 unresolved discussions (waiting on annrpom and sumeerbhola).

internal/iterv2/interleaving_iter.go line 39 at r4 (raw file):

Previously, sumeerbhola wrote…

we should panic for this case in TestIter too, if we plan to use it for correctness validation.

We panic for this case in OpCheckIter. That is the one that ensures correct API usage. I plan to wrap TestIter in OpCheckIter in more tests.

internal/iterv2/interleaving_iter.go line 74 at r4 (raw file):

Previously, sumeerbhola wrote…

So gapStart is irrelevant when inSpan, yes?

I removed gapStart altogether and simplified the code. It was necessary as a field before I came up with the "half span" (iterv2.Span) idea.

internal/iterv2/interleaving_iter.go line 259 at r4 (raw file):

Previously, sumeerbhola wrote…

I was initially puzzled by the SeekLT and then later realized it is because we need to populate the gap, specifically we need gapStart.

Exactly. But now that we don't need gapStart anymore, I switched to SeekGE.

internal/iterv2/interleaving_iter.go line 279 at r4 (raw file):

Previously, sumeerbhola wrote…

A similar comment about this being the tentative start of the gap, though we are not yet sure there is a gap.

I simplified all this stuff.

internal/iterv2/interleaving_iter.go line 280 at r4 (raw file):

Previously, RaduBerinde wrote…

I think we currently tolerate seek before startKey, but you're right, we shouldn't need to. I'll look at tightening that up.

I tried to disallow seeking before startKey but I don't see much benefit and there are some cases where we do this in the level iterator. For example, when we SeekGE, we'd need different cases depending if the file is overlapping or not. I don't think it's worth it. The semantics of these seeks are pretty clear: seeking at a key below startKey is equivalent to seeking at startKey.

The difference between this and seeking below a lower bound is that underlying spans or points below the lower bound might exist, whereas we know they don't exist below startKey.

internal/iterv2/interleaving_iter.go line 283 at r4 (raw file):

Previously, RaduBerinde wrote…

I think you're right.

Updated, it's much simpler now.

internal/iterv2/interleaving_iter.go line 424 at r4 (raw file):

Previously, sumeerbhola wrote…

I wonder if it was lazy to save comparisons in each leaf-level iterator.
If something doesn't rise to the top of the heap it doesn't matter if it does not have the prefix.
Though if it returns nil, it won't be in the heap, which seems better.

Probably - at the time. Now the columnar format allows us to know if the prefix still matches without key comparisons (in the hot path at least). I'm working on a separate change to use that.

internal/iterv2/interleaving_iter.go line 501 at r4 (raw file):

Previously, sumeerbhola wrote…

should this be effectiveLower since that is where First() starts from.

Removed.

internal/iterv2/interleaving_iter.go line 503 at r4 (raw file):

Previously, sumeerbhola wrote…

Given spanIter doesn't know about the lower bound shouldn't we be seeking it if i.lower is set?

i.lower is nil (we assert above)

internal/iterv2/interleaving_iter.go line 537 at r4 (raw file):

Previously, sumeerbhola wrote…

should this be effectiveUpper since that is where Last() starts from.

Removed.

internal/iterv2/interleaving_iter.go line 539 at r4 (raw file):

Previously, sumeerbhola wrote…

Given spanIter doesn't know about the upper bound shouldn't we be seeking it if i.upper is set?

i.upper is nil (we assert above)

internal/iterv2/interleaving_iter.go line 572 at r4 (raw file):

Previously, sumeerbhola wrote…

I suppose this path doesn't need to step the FragmentIterator since if i.pointKV is not in the currentSpan we will first emit the boundary key.

Correct. Added a comment.

internal/iterv2/interleaving_iter.go line 777 at r4 (raw file):

Previously, sumeerbhola wrote…

we should also panic if the iterator is in prefix iteration mode, or we were reverse iterating, or pointKV is nil. I think all of those violate the NextPrefix contract stated in the comment in the interface.

I don't every iterator to fully assert the correct usage; we have OpCheckIter for that, and we can wrap it around any iterator in invariants mode.

internal/iterv2/interleaving_iter.go line 782 at r4 (raw file):

Previously, sumeerbhola wrote…

why is this calling seekGEHelper?
I was expecting this to be:
		i.pointKV = i.pointIter.NextPrefix()
		if i.pointKV == nil {
			if err := i.pointIter.Error(); err != nil {
				i.setError(err)
				return nil
			}
		}
		return i.resolveForward()
In that all that is happening is that the pointIter is stepping more.

But that could skip over many spans in the spanIter; we need to reseek it. Otherwise we will emit boundaries that SeekGE(succKey) wouldn't. Currently NextPrefix is defined as being equivalent to SeekGE(succKey). I haven't implemented NextPrefix in the upper iterators yet, so I may come back and change the semantics for NextPrefix to include those boundaries.

internal/iterv2/iter.go line 128 at r4 (raw file):

Previously, sumeerbhola wrote…

consider refining this span edge comment to say that the span doesn't contain the synthetic boundary key in the forward direction since it is exclusive and does contain it in the reverse direction.

Done.

internal/iterv2/iter.go line 131 at r4 (raw file):

Previously, sumeerbhola wrote…

I think an iterator is considered positioned after exhaustion e.g. iterator doing forward iteration and exhausted is expected to behave deterministically when Prev is called. It of course makes sense that the *Span would be zero when exhausted.

I think this comment is implying that the iter will always return a non-nil Span. This seems reinforced by the "stash" comment below. Stating this non-nil invariant explicitly would be good.

Done.

internal/iterv2/test_iter.go line 70 at r3 (raw file):

Previously, sumeerbhola wrote…

I could add a separate "spuriousBoundaries" instead if that makes more sense.

Yes, that may be simpler to explain than an empty span.

Done.

internal/iterv2/test_iter.go line 189 at r3 (raw file):

Previously, sumeerbhola wrote…

Ah yes, the disjunction.

And by definition such spans have no keyspan.Keys. A code comment would be useful.

Done.

internal/iterv2/test_iter.go line 28 at r4 (raw file):

Previously, sumeerbhola wrote…

It is a bit worrying that this index applies to both boundaries and spanKeys but the length of spanKeys is 1 less than boundaries. An invariant comment on the range of this spanIdx would be helpful.

Done.

internal/iterv2/test_iter.go line 126 at r4 (raw file):

Previously, sumeerbhola wrote…

// NB: points are already in [startKey, endKey), which is why we are not filtering to [effectiveLower, effectiveUpper).

Done.

internal/iterv2/test_iter.go line 138 at r4 (raw file):

Previously, sumeerbhola wrote…

// NB: spans are already in [startKey, endKey), which is why we are not filtering to [effectiveLower, effectiveUpper).

Done.

internal/iterv2/test_iter.go line 178 at r4 (raw file):

Previously, sumeerbhola wrote…

if effectiveLower == nil

or

t.nilLower

?

Done.

internal/iterv2/test_iter.go line 208 at r4 (raw file):

Previously, sumeerbhola wrote…

why numRegions-1 and not numRegions. A point can be in the last region, yes?
I suspect this works because it will return numRegions-1 when the point is in the last region, since that represents the point not being found. But it would be better to use numRegions as a parameter and assert that spanIdx is < numRegions.

Done. I was avoiding dealing with the potential nil at the end :)

internal/iterv2/test_iter.go line 266 at r4 (raw file):

Previously, sumeerbhola wrote…

why do we need the if-clause given we added nil to the end of boundaries when t.nilUpper. Isn't that automatically handled?

Same for the backward case below.

Done. (this code went back and forth between a few different variants)

internal/iterv2/test_iter.go line 290 at r4 (raw file):

Previously, sumeerbhola wrote…

nit: after setting t.idx = 0, we could delegate to a findForward helper method. That method would work here, seekGEInternal and Next.

Similar helper method for reverse iteration.

Done.

sumeerbhola

@sumeerbhola partially reviewed 1 file and made 1 comment.
Reviewable status: 0 of 36 files reviewed, 44 unresolved discussions (waiting on annrpom and RaduBerinde).

internal/iterv2/interleaving_iter.go line 782 at r4 (raw file):

Previously, RaduBerinde wrote…

But that could skip over many spans in the spanIter; we need to reseek it. Otherwise we will emit boundaries that SeekGE(succKey) wouldn't. Currently NextPrefix is defined as being equivalent to SeekGE(succKey). I haven't implemented NextPrefix in the upper iterators yet, so I may come back and change the semantics for NextPrefix to include those boundaries.

Oh, I keep forgetting that seekGEHelper doesn't seek the pointIter. I see now why we need to seek the FragmentIterator.

This is potentially a place where the old levelIter could "optimize" because it didn't need to seek any iterator to expose the RANGEDEL boundary (though I am now losing track of what happened in the last few months, so I could be remembering a very old state). Of course, that is not a reason to continue with such a hard to understand implementation. It is very likely that the RANGEDEL that the FragmentIterator is currently on is wide enough that succKey also falls in it. We could add a TrySeekUsingNext optimization to it.

RaduBerinde · 2026-04-10T22:50:09Z

internal/iterv2/interleaving_iter.go line 782 at r4 (raw file):

Previously, sumeerbhola wrote…

Oh, I keep forgetting that seekGEHelper doesn't seek the pointIter. I see now why we need to seek the FragmentIterator.

This is potentially a place where the old levelIter could "optimize" because it didn't need to seek any iterator to expose the RANGEDEL boundary (though I am now losing track of what happened in the last few months, so I could be remembering a very old state). Of course, that is not a reason to continue with such a hard to understand implementation. It is very likely that the RANGEDEL that the FragmentIterator is currently on is wide enough that succKey also falls in it. We could add a TrySeekUsingNext optimization to it.

Yes, I plan to implement TrySeekUsingNext soon and that could apply here as well.

I don't quite understand what you are saying about the old levelIter.. Looking at the keyspan.InterleavingIter code, it looks like it would present all span boundaries on the way to the new prefix (so it doesn't re-seek, it just Next()s). Perhaps we do want to present the boundaries here as well, I am not too sure. I'll figure it out when I implement NextPrefix in the merging iter.

sumeerbhola

@sumeerbhola made 12 comments.
Reviewable status: 0 of 36 files reviewed, 56 unresolved discussions (waiting on annrpom and RaduBerinde).

level_iter_v2.go line 313 at r5 (raw file):
So for my earlier example

Say there are two files [c, e), [h, j) and the iteration bounds are [b, m): as described earlier the files will not produce boundaries at b or m. Is that what is being done here?

If looks like we are transferring the responsibility of producting the boundary at m to the interleaving iter. So only the boundary at b remains for the level iter to synthesize?

level_iter_v2.go line 80 at r5 (raw file):

	// The iterator options for the currently open table. If
	// tableOpts.{Lower,Upper}Bound are nil, the corresponding iteration boundary
	// does not lie within the table bounds.

This is the original table bounds and not the ones extended to the start of the next file, yes?

level_iter_v2.go line 128 at r5 (raw file):

	// the next file's SmallestUserKey). The level iterator only synthesizes
	// boundary keys in edge cases: gaps between iteration bounds and file
	// ranges, or when First() emits the first file's start boundary. When set,

I don't quite understand this part. Say there are two files [c, e), [h, j) and the iteration bounds are [b, m): as described earlier the files will not produce boundaries at b or m. Is that what is being done here?
If so, there isn't a "current file iterator" for either case, is there? In that we want to lazily load the [c, e) file during forward iteration after passing the b boundary (and similarly for reverse iteration).

never mind. I see now why this is needed also for seeks that fall between files.

level_iter_v2.go line 271 at r5 (raw file):

			return noFileLoaded
		}
		if l.iterFile != nil {

was this supposed to be l.iter != nil?

I think we need an invariant in the struct declaration on when iterFile and iter are not synchronized, and the nature of that unsynchronization.

level_iter_v2.go line 69 at r5 (raw file):

	// The context is stored here since (a) iterators are expected to be
	// short-lived (since they pin sstables), (b) plumbing a context into every
	// method is very painful, (c) they do not (yet) respect context

I suspect the "do not (yet) respect context cancellation" part is stale due to (a) objstorage.Readable.ReadAt accepting a context, and (b) even though vfs.File.ReadAt does not accept a context, the turn taking in read_shard.go to read the same block does respect context cancellation while waiting to get a turn.

level_iter_v2.go line 266 at r5 (raw file):

}

func (l *levelIterV2) loadFile(file *manifest.TableMetadata) loadFileReturnIndicator {

we should document that file can be nil.

level_iter_v2.go line 229 at r5 (raw file):

}

// Init the iteration bounds for the current table. The table bounds are assumed

for the current table's point iter?

level_iter_v2.go line 363 at r5 (raw file):

		l.invalidateSpan()
		return nil
	}

key < l.upper; and we know the key was in the file bounds (earlier comment). Why would we be here?

I wonder if we specify input invariants carefully, we could make AI try to write invariants for various cases in these functions and simplify the invariant expressions and mark what it thinks are dead paths. Or sub-expressions that are always false. Then we can review carefully for its correctness. Mental exhaustion is the main reason I don't think through all the invariants when writing or reviewing.

level_iter_v2.go line 414 at r5 (raw file):

}

func (l *levelIterV2) SeekLT(key []byte, flags base.SeekLTFlags) (kv *base.InternalKV) {

TODO(sumeer): resume here

level_iter_v2.go line 348 at r5 (raw file):

			return nil
		}
		return l.maybeEmitBoundaryFwd(file)

I think this is an improvement over the current levelIter which will load the file even if we are currently in the gap before the file. Or maybe it doesn't matter since a levelIterV2 positioned at a boundary key equal to the seek key will rise to the top of the heap. We would have an optimization (unsure if it makes a difference) if the boundary key corresponding to the start of the file was generated before the file's iter was created. I'm not suggesting we try to do that.

I think we also need a document an invariant involving {iter, iterFile, files}, since this is an example where we leave iter and iterFile nil, but files is positioned at the next file we want to load.

level_iter_v2.go line 204 at r5 (raw file):

		panic(errors.AssertionFailedf("levelIterV2 does not support RelativeSeek"))
	}
	m := l.files.SeekGE(l.comparer.Compare, key)

do we need a todo to optimize for flags.TrySeekUsingNext?

level_iter_v2.go line 352 at r5 (raw file):

	if l.loadFile(file) == noFileLoaded {
		return nil
	}

// INVARIANT: l.iter != nil; iter is the iterator over file; key >= file.PointKeyBounds.SmallestUserKey(), hence key must be within the file bounds.

sumeerbhola

@sumeerbhola made 4 comments and resolved 1 discussion.
Reviewable status: 0 of 36 files reviewed, 58 unresolved discussions (waiting on annrpom and RaduBerinde).

level_iter_v2.go line 348 at r5 (raw file):

Or maybe it doesn't matter since a levelIterV2 positioned at a boundary key equal to the seek key will rise to the top of the heap. We would have an optimization (unsure if it makes a difference) if the boundary key corresponding to the start of the file was generated before the file's iter was created. I'm not suggesting we try to do that.

What I said was probably wrong. The boundary key is for the start of the file since the file iter isn't responsible for its startKey boundary during forward iteration. So it does seem to be an optimization. Am I correct?

level_iter_v2.go line 565 at r5 (raw file):

		// ourselves. If we have a long run of same-prefix keys, the one seek won't
		// make much difference. If we have a short run, then it's rare that it
		// would cross a file boundary.

interesting observation. I looked at levelIter and it seems to use First which suggests there won't be a bloom filter check on the next file.

level_iter_v2.go line 573 at r5 (raw file):

		return nil
	}
	if kv != nil && l.prefix != nil && kv.K.Kind() == base.InternalKeyKindSpanBoundary && !l.matchesPrefix(kv.K.UserKey) {

will kv ever be nil? Or will we at least get the boundary? I suppose if there is no upperbound specified for this level iter and this is the last file then the boundary is also nil, yes?

level_iter_v2.go line 638 at r5 (raw file):

		// We have to deal with the area (if any) between the effective lower bound
		// and the start of the first file.
		if l.lower != nil && l.comparer.Equal(l.lower, l.currentSpan.Boundary) {

This currentSpan.Boundary is from an emitBoundary call hence the memory is stable even though we have stepped the InterleavingIter, yes?
If so, we could potentially tighten this by having some comments on when the memory is stable or not.
And test everyting with InterleavingIters that are wrapped to aggressively zero the memory when stepped.

RaduBerinde

@RaduBerinde made 7 comments.
Reviewable status: 0 of 36 files reviewed, 58 unresolved discussions (waiting on annrpom and sumeerbhola).

level_iter_v2.go line 204 at r5 (raw file):

Previously, sumeerbhola wrote…

do we need a todo to optimize for flags.TrySeekUsingNext?

Done.

level_iter_v2.go line 271 at r5 (raw file):

Previously, sumeerbhola wrote…

was this supposed to be l.iter != nil?

I think we need an invariant in the struct declaration on when iterFile and iter are not synchronized, and the nature of that unsynchronization.

l.iter is not a pointer. iterFile and iter should always be synchronized (outside of this code that's changing them).

level_iter_v2.go line 313 at r5 (raw file):

Previously, sumeerbhola wrote…

So for my earlier example

Say there are two files [c, e), [h, j) and the iteration bounds are [b, m): as described earlier the files will not produce boundaries at b or m. Is that what is being done here?

If looks like we are transferring the responsibility of producting the boundary at m to the interleaving iter. So only the boundary at b remains for the level iter to synthesize?

Correct. We are setting endKey to nil for the last file, so the last file covers the key space [h, inf), and it is responsible for producing the m boundary.

level_iter_v2.go line 348 at r5 (raw file):

Previously, sumeerbhola wrote…

Or maybe it doesn't matter since a levelIterV2 positioned at a boundary key equal to the seek key will rise to the top of the heap. We would have an optimization (unsure if it makes a difference) if the boundary key corresponding to the start of the file was generated before the file's iter was created. I'm not suggesting we try to do that.

What I said was probably wrong. The boundary key is for the start of the file since the file iter isn't responsible for its startKey boundary during forward iteration. So it does seem to be an optimization. Am I correct?

Normally, the boundary would have been emitted by the previous file's iter. But it would be silly to set up that iterator when we know we're in the gap, so we emit it ourselves.

level_iter_v2.go line 363 at r5 (raw file):

Previously, sumeerbhola wrote…

key < l.upper; and we know the key was in the file bounds (earlier comment). Why would we be here?

I wonder if we specify input invariants carefully, we could make AI try to write invariants for various cases in these functions and simplify the invariant expressions and mark what it thinks are dead paths. Or sub-expressions that are always false. Then we can review carefully for its correctness. Mental exhaustion is the main reason I don't think through all the invariants when writing or reviewing.

We know the key was in the "static' file bounds, but it might not be within the "dynamic" bounds (eg SetBounds).

Conceptually, we should have checked for this annoying case (key >= l.upper) before we did anything. But this is a way to avoid this comparison in the common case. Added a comment.

level_iter_v2.go line 573 at r5 (raw file):

Previously, sumeerbhola wrote…

will kv ever be nil? Or will we at least get the boundary? I suppose if there is no upperbound specified for this level iter and this is the last file then the boundary is also nil, yes?

Right (and we don't emit nil boundary keys).

level_iter_v2.go line 638 at r5 (raw file):

Previously, sumeerbhola wrote…

This currentSpan.Boundary is from an emitBoundary call hence the memory is stable even though we have stepped the InterleavingIter, yes?
If so, we could potentially tighten this by having some comments on when the memory is stable or not.
And test everyting with InterleavingIters that are wrapped to aggressively zero the memory when stepped.

Hm, good catch, I'll give this some thought. I think I didn't consider this carefully. It might be safe because we're either at a synthetic boundary or l.iter was exhausted, so the last span was either it's startKey or the lower bound which are both stable. Not great to rely on this though.

sumeerbhola

@sumeerbhola reviewed 6 files and made 15 comments.
Reviewable status: 6 of 36 files reviewed, 72 unresolved discussions (waiting on annrpom and RaduBerinde).

merging_iter_v2.go line 137 at r6 (raw file):
This relates to my earlier memory stability concern:

Given key "e" is only stable until the span boundary, wouldn't we need to seek at the boundary before stepping away from the span, to avoid copying "e"?

internal/iterv2/interleaving_iter.go line 782 at r4 (raw file):

Perhaps we do want to present the boundaries here as well, I am not too sure.

Presenting boundaries prior to succKey seems unnecessary.

iterator.go line 2751 at r6 (raw file):

				// V2: the batch point and range del iters are wrapped in an
				// InterleavingIter. We can't update them in place, so force
				// a full point iterator reconstruction.

This could use a todo to revisit. When we reuse an iterator for multiple requests in a BatchRequest in CockroachDB, we use SetOptions to reconfigure the iterator (see pebbleIterator.init).

merging_iter_v2.go line 52 at r6 (raw file):

// unshadowed span boundary exists across any level. "Unshadowed" means the
// boundary is not contained within a range deletion from a higher
// (lower-numbered) level. Because no unshadowed boundary exists inside a

If the spans were
L0 [b, f)
L1 [b, e)
L2 [b, d)

then [b, e) does not have a boundary not contained within the higher level [b, f) amd [b, d) does not have a boundary not contained in the 2 higher levels. Then by this phrasing it suggests the slab is [b, f). That is, it is not simply the min of d, e, f. I see why this makes sense since it allows us to seek to f for L1 and L2 if [b, f) contained a RANGEDEL.

But I am not sure if that was what was intended here. If [b, f) is the slab, what is the next slab: [d, f)? So there are overlapping slabs?
And what if [b, f) did not have a RANGEDEL but [b, e) did. The slab end f is awkward since e < f.

I am guessing the phrasing was meant to result in the first slab being [b, d). Say L0 [b, f) has a RANGEDEL

L1 and L2 get parked.
Say the next slab is [d, e). L1 and L2 stay parked.
Say the next slab is [e, f). Ditto.
When we finally get to the boundary f, which will be some future slab, we will unpark L1 and L2 after seeking to f.

Is my understanding correct?

merging_iter_v2.go line 103 at r6 (raw file):

//     d#SET(L0)      return point
//     Next       e#∞,BDRY       boundary at "e" (slab end); consumed internally
//     (cont.)   advance slab   L0 span now empty. L1 unparked, SeekGE("e")→f.

Bit odd to see the SeekGE("e") happening after passing the boundary. Given key "e" is only stable until the span boundary, wouldn't we need to seek at the boundary before stepping away from the span, to avoid copying "e"?

It would require us to keep track of the latest span that resulted in parking a level, so we don't keep doing earlier seeks for no reason.

merging_iter_v2.go line 117 at r6 (raw file):

//
// When the heap root is a boundary key from a level, it always matches the
// slab boundary (nextBoundary), triggering a slab transition

This answers my earlier concern about how the slabs are defined. They are simply the based on the adjacent boundaries encountered in the heap.

merging_iter_v2.go line 150 at r6 (raw file):

// the slab boundary. During direction switches (switchToMinHeap /
// switchToMaxHeap), parked levels are similarly repositioned.
//

Nice to see such a clean conceptual model with V2!

merging_iter_v2.go line 175 at r6 (raw file):

	// reuses slab.nextBoundaryBuf, which would overwrite the boundary key
	// that advanceSlabForward/advanceSlabBackward still needs for unparking
	// levels via SeekGE/SeekLT.

Looks like we do copy the key for the slab boundary, which we avoided in V1. This seems fine, given RANGEDELs tend to be wide.

merging_iter_v2.go line 233 at r6 (raw file):

	m.slab.levels = m.levels
	// Stash the span pointers from each level iter.
	for i := range m.levels {

nit: can this be hoisted up to the earlier for loop?

merging_iter_v2.go line 252 at r6 (raw file):

}

// nextEntry advances the given level's iterator and fixes the heap.

// Level is the current top of the heap.

merging_iter_v2.go line 265 at r6 (raw file):

}

// prevEntry backs up the given level's iterator and fixes the heap.

// Level is the current top of the heap.

merging_iter_v2.go line 279 at r6 (raw file):

// First implements base.InternalIterator.
func (m *mergingIterV2) First() (kv *base.InternalKV) {

TODO(sumeer): resume here.

merging_iter_v2_slab.go line 58 at r6 (raw file):

//
// TODO(radu): avoid building and maintaining the slab in the common case where
// no levels have any spans.

I don't understand this todo. Due to the levels with files, won't there always be spans, even if the iterator doesn't set bounds? Is this saying we should optimize for the case where no spans have rangedels?

merging_iter_v2_slab.go line 102 at r6 (raw file):

			if boundary := sl.span.Boundary; boundary != nil {
				if s.nextBoundary == nil || s.cmp(boundary, s.nextBoundary) < 0 {
					s.nextBoundaryBuf = append(s.nextBoundaryBuf[:0], boundary...)

seems like this copying could be delayed until after the loop, as an optimization.

merging_iter_v2_slab.go line 169 at r6 (raw file):

					s.nextBoundary = s.nextBoundaryBuf
				}
			}

Seems like BuildForward and BuildBackward are almost the same code, apart from the boundary comparison. Perhaps the common parts could be refactored, unless inlining is important.

RaduBerinde

@RaduBerinde made 5 comments.
Reviewable status: 6 of 36 files reviewed, 72 unresolved discussions (waiting on annrpom and sumeerbhola).

merging_iter_v2.go line 52 at r6 (raw file):

Previously, sumeerbhola wrote…

If the spans were
L0 [b, f)
L1 [b, e)
L2 [b, d)

then [b, e) does not have a boundary not contained within the higher level [b, f) amd [b, d) does not have a boundary not contained in the 2 higher levels. Then by this phrasing it suggests the slab is [b, f). That is, it is not simply the min of d, e, f. I see why this makes sense since it allows us to seek to f for L1 and L2 if [b, f) contained a RANGEDEL.

But I am not sure if that was what was intended here. If [b, f) is the slab, what is the next slab: [d, f)? So there are overlapping slabs?
And what if [b, f) did not have a RANGEDEL but [b, e) did. The slab end f is awkward since e < f.

I am guessing the phrasing was meant to result in the first slab being [b, d). Say L0 [b, f) has a RANGEDEL

L1 and L2 get parked.

Say the next slab is [d, e). L1 and L2 stay parked.

Say the next slab is [e, f). Ditto.

When we finally get to the boundary f, which will be some future slab, we will unpark L1 and L2 after seeking to f.

Is my understanding correct?

The slab ends at the min boundary among all non-parked levels. Equivalently, it is the min BOUNDARY key in the heap. BTW we don't really need to know the boundary, other than for assertions. I have some follow-up changes where we only compute it if invariants are enabled.

In your example the slab ends at f and next slab starts at f. If L0 didnt have a rangedel, it would end at e.

merging_iter_v2.go line 103 at r6 (raw file):

Previously, sumeerbhola wrote…

Bit odd to see the SeekGE("e") happening after passing the boundary. Given key "e" is only stable until the span boundary, wouldn't we need to seek at the boundary before stepping away from the span, to avoid copying "e"?

It would require us to keep track of the latest span that resulted in parking a level, so we don't keep doing earlier seeks for no reason.

We are currently using "e" from the span boundary key which is a copy. We have to first see what happens at the higher levels before seeking, in case this level becomes parked.

merging_iter_v2.go line 175 at r6 (raw file):

Previously, sumeerbhola wrote…

Looks like we do copy the key for the slab boundary, which we avoided in V1. This seems fine, given RANGEDELs tend to be wide.

I plan to remove that; I was planning to make a copy only when we advance the boundary and some levels are already parked.

merging_iter_v2_slab.go line 58 at r6 (raw file):

Previously, sumeerbhola wrote…

I don't understand this todo. Due to the levels with files, won't there always be spans, even if the iterator doesn't set bounds? Is this saying we should optimize for the case where no spans have rangedels?

Yeah I meant no spans with keys. But more generally, we can optimize when boundaries are spurious, i.e. same span keys before and after.

I"m open to a different name for iterv2.Span that doesn't conflict with the existing keyspan.Span. Maybe Region? There are pros and cons though since it's the same with a keyspan.Span in non-empty regions.

merging_iter_v2_slab.go line 169 at r6 (raw file):

Previously, sumeerbhola wrote…

Seems like BuildForward and BuildBackward are almost the same code, apart from the boundary comparison. Perhaps the common parts could be refactored, unless inlining is important.

Correct. I have a follow up change which updates it to Build(dir).

sumeerbhola

I'm done with the reading pass, except for the tests.
This is great! Conceptually very clean. And should be easier to optimize since it isn't fragile.

@sumeerbhola reviewed 1 file, made 8 comments, and resolved 1 discussion.
Reviewable status: 7 of 36 files reviewed, 78 unresolved discussions (waiting on annrpom and RaduBerinde).

merging_iter_v2.go line 423 at r6 (raw file):

	// Copy the slab boundary into a stable buffer. BuildForward below will
	// reuse slab.nextBoundaryBuf, invalidating slab.nextBoundary.
	m.slabBoundaryBuf = append(m.slabBoundaryBuf[:0], m.slab.nextBoundary...)

I was confused about the boundary being copied twice. Here and in slabState. Looks like slabState.nextBoundary is only read for assertions, in mergingIterV2. One could potentially steal the slabState.nextBoundary and have 2 buffers, instead of making a copy here (I think we had some microbenchmarks with long keys (Claude could check)).

merging_iter_v2.go line 438 at r6 (raw file):

		// pointer (level.span) is updated in-place by the iterator.
		// TODO(radu): if the level becomes parked, this Next() call is not
		// necessary and it can cause opening a new file in a level iterator.

Interesting. It seems we should be doing some of the logic of BuildForward here, to park levels?

I'm not sure, but I think we end up seeking early in mergingIter.findNextEntry which eventually calls isNextEntryDeleted, so it seems we should optimize this in V2.

merging_iter_v2.go line 453 at r6 (raw file):

	// all levels. All matching-prefix keys have been returned. Don't seek
	// any level past the prefix (preserves TrySeekUsingNext correctness
	// for future seeks to different prefixes).

nit: consider renaming to lastSlabBoundary.

merging_iter_v2.go line 480 at r6 (raw file):

				)
			} else {
				level.iterKV = level.iter.SeekGE(

can this also use TrySeekUsingNext?

merging_iter_v2.go line 664 at r6 (raw file):

		// Advance this level past its boundary (backward). The level's
		// stashed span pointer (level.span) is updated in-place by the
		// iterator.

Do we need a similar todo:

// TODO(radu): if the level becomes parked, this Prev() call is not
// necessary and it can cause opening a new file in a level iterator.

merging_iter_v2.go line 735 at r6 (raw file):

			// to handle both positioned and exhausted (iterKV == nil) levels.
			for l.iterKV = l.iter.Next(); l.iterKV != nil; l.iterKV = l.iter.Next() {
				if base.InternalCompare(m.heap.cmp, key, l.iterKV.K) < 0 {

I am wondering why this would need to Next multiple times.
Looks like mergingIter also has a for loop.

merging_iter_v2.go line 746 at r6 (raw file):

	// Advance the current level past key.
	cur.iterKV = cur.iter.Next()

Why do we need to do this? Isn't it just a switch, while keeping the top the same?
I looked at mergingIter and it does the same. Turns out the name switchToMinHeap has always been misleading, in that it is actually switchToMinHeadAndNext.

RaduBerinde

Thank you so much for the reviews!

@RaduBerinde made 6 comments.
Reviewable status: 7 of 36 files reviewed, 78 unresolved discussions (waiting on annrpom and sumeerbhola).

iterator.go line 2751 at r6 (raw file):

Previously, sumeerbhola wrote…

This could use a todo to revisit. When we reuse an iterator for multiple requests in a BatchRequest in CockroachDB, we use SetOptions to reconfigure the iterator (see pebbleIterator.init).

Yes, all this code needs revisiting from a perf standpoint.

level_iter_v2.go line 80 at r5 (raw file):

Previously, sumeerbhola wrote…

This is the original table bounds and not the ones extended to the start of the next file, yes?

Hm, it should be the extended ones.. I'll clarify.

merging_iter_v2.go line 438 at r6 (raw file):

Previously, sumeerbhola wrote…

Interesting. It seems we should be doing some of the logic of BuildForward here, to park levels?

I'm not sure, but I think we end up seeking early in mergingIter.findNextEntry which eventually calls isNextEntryDeleted, so it seems we should optimize this in V2.

It's fairly simple to do - we just pop off all instances of the same boundary and remember (using a flag inside each level) which levels we need to Next in the loop (if they don't get parked)

merging_iter_v2.go line 480 at r6 (raw file):

Previously, sumeerbhola wrote…

can this also use TrySeekUsingNext?

Yes, coming soon. We need a bit of extra state to avoid doing it if there was a re-seek (and perhaps back and forth movement) since it was first parked.

merging_iter_v2.go line 735 at r6 (raw file):

Previously, sumeerbhola wrote…

I am wondering why this would need to Next multiple times.
Looks like mergingIter also has a for loop.

I will think about it in more detail, I left it alone once I got it working for the first time. I think direction change is not an important path perf-wise, is that correct?

sumeerbhola

@sumeerbhola made 1 comment.
Reviewable status: 7 of 36 files reviewed, 78 unresolved discussions (waiting on annrpom and RaduBerinde).

merging_iter_v2.go line 735 at r6 (raw file):

Previously, RaduBerinde wrote…

I will think about it in more detail, I left it alone once I got it working for the first time. I think direction change is not an important path perf-wise, is that correct?

I don't remember whether we optimize for it now.

We do change direction in CockroachDB in pebbleMVCCScanner in reverse scans. They use SeekGE to land on the correct version and then iterate backwards. And the backwards iteration (in pebbleMVCCScanner) may also do SeekGE if a particular key has too many versions and it runs out of the stepping budget to get to the version it wants.

RaduBerinde

@RaduBerinde made 14 comments, resolved 3 discussions, and dismissed @sumeerbhola from a discussion.
Reviewable status: 5 of 36 files reviewed, 75 unresolved discussions (waiting on annrpom and sumeerbhola).

level_iter_v2.go line 69 at r5 (raw file):

Previously, sumeerbhola wrote…

I suspect the "do not (yet) respect context cancellation" part is stale due to (a) objstorage.Readable.ReadAt accepting a context, and (b) even though vfs.File.ReadAt does not accept a context, the turn taking in read_shard.go to read the same block does respect context cancellation while waiting to get a turn.

Done.

level_iter_v2.go line 80 at r5 (raw file):

Previously, RaduBerinde wrote…

Hm, it should be the extended ones.. I'll clarify.

Oh, no, you were right, it's the bounds for the point iterator. Clarified.

level_iter_v2.go line 229 at r5 (raw file):

Previously, sumeerbhola wrote…

for the current table's point iter?

Done.

level_iter_v2.go line 266 at r5 (raw file):

Previously, sumeerbhola wrote…

we should document that file can be nil.

That was a leftover from some code I got from the old levelIter, cleaned up the code.

level_iter_v2.go line 352 at r5 (raw file):

Previously, sumeerbhola wrote…

// INVARIANT: l.iter != nil; iter is the iterator over file; key >= file.PointKeyBounds.SmallestUserKey(), hence key must be within the file bounds.

Done.

level_iter_v2.go line 638 at r5 (raw file):

Previously, RaduBerinde wrote…

Hm, good catch, I'll give this some thought. I think I didn't consider this carefully. It might be safe because we're either at a synthetic boundary or l.iter was exhausted, so the last span was either it's startKey or the lower bound which are both stable. Not great to rely on this though.

I refactored this code. Using currentSpan was not necessary.

merging_iter_v2.go line 150 at r6 (raw file):

Previously, sumeerbhola wrote…

Nice to see such a clean conceptual model with V2!

🍻

merging_iter_v2.go line 252 at r6 (raw file):

Previously, sumeerbhola wrote…

// Level is the current top of the heap.

Done.

merging_iter_v2.go line 265 at r6 (raw file):

Previously, sumeerbhola wrote…

// Level is the current top of the heap.

Done. And added assertion.

merging_iter_v2.go line 423 at r6 (raw file):

Previously, sumeerbhola wrote…

I was confused about the boundary being copied twice. Here and in slabState. Looks like slabState.nextBoundary is only read for assertions, in mergingIterV2. One could potentially steal the slabState.nextBoundary and have 2 buffers, instead of making a copy here (I think we had some microbenchmarks with long keys (Claude could check)).

I have some upcoming changes here, I'll defer any optimizations

merging_iter_v2.go line 664 at r6 (raw file):

Previously, sumeerbhola wrote…

Do we need a similar todo:

// TODO(radu): if the level becomes parked, this Prev() call is not
// necessary and it can cause opening a new file in a level iterator.

Done.

merging_iter_v2.go line 735 at r6 (raw file):

Previously, sumeerbhola wrote…

I don't remember whether we optimize for it now.

We do change direction in CockroachDB in pebbleMVCCScanner in reverse scans. They use SeekGE to land on the correct version and then iterate backwards. And the backwards iteration (in pebbleMVCCScanner) may also do SeekGE if a particular key has too many versions and it runs out of the stepping budget to get to the version it wants.

I got in the weeds here and it looks like in the memtable iterator, points that were not there before can show up. They will be ignored because of their seqnum, but it might mean we need to Next() here more than one time to get back to where we were when we did the last Prev(). Added a comment.

merging_iter_v2.go line 746 at r6 (raw file):

Previously, sumeerbhola wrote…

Why do we need to do this? Isn't it just a switch, while keeping the top the same?
I looked at mergingIter and it does the same. Turns out the name switchToMinHeap has always been misleading, in that it is actually switchToMinHeadAndNext.

Renamed.

merging_iter_v2_slab.go line 102 at r6 (raw file):

Previously, sumeerbhola wrote…

seems like this copying could be delayed until after the loop, as an optimization.

Good point. But this will become invariants-only code soon, I'll leave it alone for now.

sumeerbhola

@sumeerbhola reviewed 31 files and all commit messages, made 8 comments, and resolved 29 discussions.
Reviewable status: all files reviewed, 49 unresolved discussions (waiting on annrpom and RaduBerinde).

merging_iter_v2.go line 52 at r6 (raw file):

Previously, RaduBerinde wrote…

The slab ends at the min boundary among all non-parked levels. Equivalently, it is the min BOUNDARY key in the heap. BTW we don't really need to know the boundary, other than for assertions. I have some follow-up changes where we only compute it if invariants are enabled.

In your example the slab ends at f and next slab starts at f. If L0 didnt have a rangedel, it would end at e.

A few examples would be very helpful. The nuance that in the above example the slab is [b, f) because it has a RANGEDEL, but wuld be [b, d) if neither L0 or L1 had a RANGEDEL was lost to me when initially reading this.

internal/iterv2/interleaving_iter.go line 280 at r4 (raw file):

Previously, RaduBerinde wrote…

I tried to disallow seeking before startKey but I don't see much benefit and there are some cases where we do this in the level iterator. For example, when we SeekGE, we'd need different cases depending if the file is overlapping or not. I don't think it's worth it. The semantics of these seeks are pretty clear: seeking at a key below startKey is equivalent to seeking at startKey.

The difference between this and seeking below a lower bound is that underlying spans or points below the lower bound might exist, whereas we know they don't exist below startKey.

Ack

internal/iterv2/interleaving_iter.go line 777 at r4 (raw file):

Previously, RaduBerinde wrote…

I don't every iterator to fully assert the correct usage; we have OpCheckIter for that, and we can wrap it around any iterator in invariants mode.

Ack

internal/iterv2/test_iter.go line 160 at r3 (raw file):

Previously, RaduBerinde wrote…

Fixed.

nit: code seems the same

level_iter_v2.go line 106 at r7 (raw file):

	// to iterFile as follows:
	//
	//  - When atSyntheticBoundary is false and iterFile is non-nil,

is atSyntheticBoundary is false and iterFile is nil a valid case? Whether valid or not, I think it needs mention in this comment so that all cases are covered.

internal/iterv2/testdata/interleaving_iter line 73 at r7 (raw file):

seek-lt b
----
seek-ge p: z#inf,BDRY [?, z)

I was expecting u instead of z.
Then I thought maybe I forgot something and only the spans that are completely outside the bounds are omitted but partially overlapping spans are not truncated.
But below, the other boundary is b and not a, which seems inconsistent.

internal/iterv2/testdata/interleaving_iter line 317 at r7 (raw file):


define-spans
c-g:{(#5,RANGEKEYSET)}

will we be using the interleaving iter also for points and range-keys?
I assumed it was only rangedels.
Or is this just demonstrating generality that we don't plan to use?

RaduBerinde

TFTR!

@RaduBerinde made 6 comments.
Reviewable status: all files reviewed, 49 unresolved discussions (waiting on annrpom and sumeerbhola).

level_iter_v2.go line 106 at r7 (raw file):

Previously, sumeerbhola wrote…

is atSyntheticBoundary is false and iterFile is nil a valid case? Whether valid or not, I think it needs mention in this comment so that all cases are covered.

Yes, done.

merging_iter_v2.go line 52 at r6 (raw file):

Previously, sumeerbhola wrote…

A few examples would be very helpful. The nuance that in the above example the slab is [b, f) because it has a RANGEDEL, but wuld be [b, d) if neither L0 or L1 had a RANGEDEL was lost to me when initially reading this.

I asked claude to draw some examples, this should help.

internal/iterv2/test_iter.go line 160 at r3 (raw file):

Previously, sumeerbhola wrote…

nit: code seems the same

Done.

internal/iterv2/testdata/interleaving_iter line 73 at r7 (raw file):

Previously, sumeerbhola wrote…

I was expecting u instead of z.
Then I thought maybe I forgot something and only the spans that are completely outside the bounds are omitted but partially overlapping spans are not truncated.
But below, the other boundary is b and not a, which seems inconsistent.

There was a typo in the iter line, fixed.

internal/iterv2/testdata/interleaving_iter line 317 at r7 (raw file):

Previously, sumeerbhola wrote…

will we be using the interleaving iter also for points and range-keys?
I assumed it was only rangedels.
Or is this just demonstrating generality that we don't plan to use?

Only rangedels, changed.

sumeerbhola

@sumeerbhola reviewed 4 files and all commit messages, made 2 comments, and resolved 27 discussions.
Reviewable status: 35 of 36 files reviewed, 22 unresolved discussions (waiting on annrpom and RaduBerinde).

level_iter_v2.go line 106 at r7 (raw file):

Previously, RaduBerinde wrote…

Yes, done.

Thanks for clarifying iterFile when atSyntheticBoundary is true.

I was asking about the the atSyntheticBoundary is false case. The first bullet considers the sub-case "iterFile is non-nil". Which naturally leads to the question about the other sub-case when iterFile is nil.

We introduce a new iterator abstraction along with a new merging iterator, level iterator, interleaving iterator stack. This abstraction allows presenting the next span boundary at any point, which allows the merging iterator to (conceptually) divide the key space into "slabs". The slab approach will make the addition of new rangedel-type keys much easier. The boundary also naturally allows "stopping points" in the level iterator without a special contract with the merging iterator. The new stack can be used by setting `iterv2.Enabled`. The stack functions correctly (it passed a few hours of the metamorphic test). It is still a prototype in that it does not yet have many of the optimizations the existing stack has (like try-seek-using-next). The integration of the v2 stack also needs some optimization in terms of allocations. Most of the code inside the iterators was written by me. Claude helped with comments and with limited bits and pieces but it was generating much more complicated code for the core logic. In all fairness, I also had to go back and rewrite most of it a few times. Suggestion on what to review: - `iterv2.Iter` semantics; - `iterv2.TestIter` implementation. While this is only used for testing, it serves as a formal codification of the semantics (with a fairly simple implementation). All the new iterators are tested against the `TestIter`, so it's less important to review the actual iterator implementations; - `iterv2.InterleavingIter` struct and `Init()` comments; - `levelIterV2` struct comment; - `mergingIterV2` struct comment; - `slabState` interface (`BuildForward`).

RaduBerinde

@RaduBerinde made 1 comment.
Reviewable status: 34 of 36 files reviewed, 22 unresolved discussions (waiting on annrpom and sumeerbhola).

level_iter_v2.go line 106 at r7 (raw file):

Previously, sumeerbhola wrote…

Thanks for clarifying iterFile when atSyntheticBoundary is true.

I was asking about the the atSyntheticBoundary is false case. The first bullet considers the sub-case "iterFile is non-nil". Which naturally leads to the question about the other sub-case when iterFile is nil.

Updated. files.Current() == iterFile always, unless the iterator is exhausted.

RaduBerinde requested a review from sumeerbhola March 20, 2026 17:11

RaduBerinde requested a review from a team as a code owner March 20, 2026 17:11

RaduBerinde requested a review from annrpom March 20, 2026 17:15

RaduBerinde force-pushed the iterv2-off branch 6 times, most recently from 3863bec to 1f25a51 Compare March 26, 2026 16:12

sumeerbhola requested changes Mar 26, 2026

View reviewed changes

sumeerbhola reviewed Mar 26, 2026

View reviewed changes

RaduBerinde commented Mar 29, 2026

View reviewed changes

RaduBerinde force-pushed the iterv2-off branch 3 times, most recently from 2724091 to a5940ee Compare April 7, 2026 16:23

sumeerbhola requested changes Apr 9, 2026

View reviewed changes

RaduBerinde commented Apr 9, 2026

View reviewed changes

sumeerbhola requested changes Apr 9, 2026

View reviewed changes

RaduBerinde commented Apr 9, 2026

View reviewed changes

sumeerbhola requested changes Apr 9, 2026

View reviewed changes

RaduBerinde force-pushed the iterv2-off branch from a5940ee to 8ca1049 Compare April 10, 2026 05:26

RaduBerinde commented Apr 10, 2026

View reviewed changes

RaduBerinde force-pushed the iterv2-off branch from 8ca1049 to 87a7fe2 Compare April 10, 2026 13:12

sumeerbhola requested changes Apr 10, 2026

View reviewed changes

sumeerbhola requested changes Apr 11, 2026

View reviewed changes

RaduBerinde force-pushed the iterv2-off branch from 87a7fe2 to 4726e9d Compare April 11, 2026 01:22

RaduBerinde commented Apr 11, 2026

View reviewed changes

sumeerbhola requested changes Apr 12, 2026

View reviewed changes

RaduBerinde commented Apr 12, 2026

View reviewed changes

sumeerbhola requested changes Apr 12, 2026

View reviewed changes

RaduBerinde commented Apr 12, 2026

View reviewed changes

sumeerbhola requested changes Apr 13, 2026

View reviewed changes

RaduBerinde force-pushed the iterv2-off branch from 4726e9d to 1b1d773 Compare April 14, 2026 00:00

RaduBerinde commented Apr 14, 2026

View reviewed changes

sumeerbhola approved these changes Apr 15, 2026

View reviewed changes

RaduBerinde force-pushed the iterv2-off branch from 1b1d773 to 3c513e9 Compare April 15, 2026 15:28

RaduBerinde commented Apr 15, 2026

View reviewed changes

sumeerbhola approved these changes Apr 16, 2026

View reviewed changes

RaduBerinde force-pushed the iterv2-off branch from 3c513e9 to 7106356 Compare April 16, 2026 03:43

RaduBerinde commented Apr 16, 2026

View reviewed changes

RaduBerinde merged commit 34b7295 into cockroachdb:master Apr 16, 2026
9 checks passed

RaduBerinde deleted the iterv2-off branch April 16, 2026 03:55

Conversation

RaduBerinde commented Mar 20, 2026

Uh oh!

cockroach-teamcity commented Mar 20, 2026

Uh oh!

RaduBerinde commented Mar 20, 2026

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde commented Apr 10, 2026

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment