colexecerror: allow-list vecindex packages for panic catching by KAmbekar · Pull Request #170164 · cockroachdb/cockroach

KAmbekar · 2026-05-12T06:48:12Z

Previously, panics from pkg/sql/vecindex were not in the colexecerror
allow-list, so a bug on the vector-index foreground path (e.g. a
dimension mismatch in a search) crashed the node instead of returning
a SQL error to the client. Add the package prefix to the allow-list.

Also fixes a latent bug in Manager.Get exposed once panics from this
code path are caught instead of fatal: a panic during index construction
previously left parked waiters deadlocked forever, because the normal
mustWait=false / Broadcast path was skipped. Manager.Get now installs a
cleanup defer that wakes waiters with a wrapped error, removes the
half-built cache entry so subsequent callers retry, and re-panics so
the calling goroutine's colexecerror catch still fires.

The allow-list does not cover panics on cspann.FixupProcessor background
goroutines, which run under stopper.RunAsyncTask and continue to crash
the node via stopper.recover() by design. The safety argument is for
the foreground SQL execution path only.

Caught panics surface as assertion errors with sentry reports,
preserving bug visibility without crashing the server.

Resolves: #146694
Epic: none

Release note (bug fix): Vector-index queries that previously crashed
the node on internal errors (e.g. dimension mismatch) now return a SQL
error to the client instead.

trunk-io · 2026-05-12T06:48:16Z

😎 Merged successfully - details.

blathers-crl · 2026-05-12T06:48:18Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2026-05-12T06:48:28Z

This change is

KAmbekar · 2026-05-12T06:51:47Z

+	colPackagesPrefix        = "github.com/cockroachdb/cockroach/pkg/col"
+	encodingPackagePrefix    = "github.com/cockroachdb/cockroach/pkg/util/encoding"
+	execinfraPackagePrefix   = "github.com/cockroachdb/cockroach/pkg/sql/execinfra"
+	sqlColPackagesPrefix     = "github.com/cockroachdb/cockroach/pkg/sql/col"
+	sqlRowPackagesPrefix     = "github.com/cockroachdb/cockroach/pkg/sql/row"
+	sqlSemPackagesPrefix     = "github.com/cockroachdb/cockroach/pkg/sql/sem"
+	sqlVecindexPackagePrefix = "github.com/cockroachdb/cockroach/pkg/sql/vecindex"


Note: Reformatted by crlfmt

KAmbekar · 2026-05-12T07:34:34Z

What happen when we don't add vecindex in allow list? (which is our current state)
Without the allow-list entry, any panic from vecindex code is treated as an unknown/unsafe panic. The panic recovery logic sees it came from an unrecognized package, assumes it might be dangerous to catch, and re-panics — which crashes the entire CockroachDB node.

By allow-listing vecindex, are we bypassing genuine panics?
All ~24 panics in vecindex are errors.AssertionFailedf — invariant violations / programming bugs. None are triggered by user data directly. But catching them is actually the correct behavior for production:

When CatchVectorizedRuntimeError catches an AssertionFailedf panic from an allow-listed package (lines 140-160 of error.go), it does not silently swallow it. It:

Returns it as an error to the client
Since it has no PG code (pgcode.Uncategorized), wraps it in errors.NewAssertionErrorWithWrappedErrf (line 159)
This triggers a sentry report — so the bug gets reported

When it make sense to crash the server?
When the panicking code was holding a lock or modifying shared state at the time of the panic. If you catch the panic in that situation, the lock stays held forever (deadlock) or shared data is left half-updated (corruption). Other goroutines that depend on that state will silently produce wrong results or hang indefinitely.
Crashing the server is the safer option because the node restarts with clean state — no stuck locks, no corrupted in-memory structures. In a distributed system like CockroachDB, other nodes continue serving traffic, so a single node crash is recoverable.

Where do the vector processors manipulate shared state or hold locks?

Component	Locks in production?	On search path?	Details
vectorSearchProcessor	No	Yes	No mutexes, no atomics
vectorMutationSearchProcessor	No	Yes	No mutexes, no atomics
vecindex.Searcher / MutationSearcher	No	Yes	Simple delegation wrappers
cspann.Index.Search()	No	Yes	No lock acquired for read-only search
cspann.searcher.Next()	No	Yes	No locks held
vecstore.StoreTxn	No	Yes	KV-layer locking only, no in-memory locks
statsManager.ReportSearch()	Yes (defer-released)	Yes	Advisory stats only, lock always released during panic unwinding
memstore	Yes	No (test-only)	Never imported by production code
Manager.Get()	Yes	No (init-only)	Called during processor construction, not during Next()
FixupProcessor.mu	Yes	No (background)	Runs on separate goroutines

Why statsManager.ReportSearch() is safe despite holding a lock:
It uses defer sm.mu.Unlock(), so the lock is always released during panic unwinding before CatchVectorizedRuntimeError's recovery runs. The stats it updates (CV/Z-score) are purely advisory for adaptive search tuning — a partially updated stat cannot corrupt query results or KV state.

~~Why Manager.Get() is not a concern:~~
It is called from newVectorSearchProcessor() / newVectorMutationSearchProcessor() during processor construction, not during Next() execution. CatchVectorizedRuntimeError only wraps the Next() call path, so panics in Manager.Get() never reach the allow-list check.

mw5h

This PR converts a node crash from vecindex panics into a SQL error — that's a real win for users, and the PR-comment breakdown of locking surfaces in vecindex is the right level of due diligence for an allow-list change.

I have one blocking concern, however: I believe the analysis of Manager.Get is incorrect, and as written this PR converts a node crash into a silent per-index hang on the index-construction path.

Manager.Get is reachable inside CatchVectorizedRuntimeError

The PR comment claims:

Manager.Get() is called from newVectorSearchProcessor() / newVectorMutationSearchProcessor() during processor construction, not during Next() execution. CatchVectorizedRuntimeError only wraps the Next() call path, so panics in Manager.Get() never reach the allow-list check.

Under the default vectorize=on, this is not the case. Verified trace:

pkg/sql/colflow/vectorized_flow.go:1137 wraps the body of setupFlow in colexecerror.CatchVectorizedRuntimeError.
Line 1199 calls colbuilder.NewColOperator with ProcessorConstructor: rowexec.NewProcessor (set at line 1186).
colbuilder/execplan.go::supportedNatively has no case for core.VectorSearch, so it returns errCoreUnsupportedNatively.
NewColOperator (line 836) falls into the createAndWrapRowSource branch.
canWrap permits VectorSearch wrapping (line 345, empty case body).
wrapRowSources line 108 (toWrap, err := newToWrap(toWrapInputs)) invokes the closure that calls rowexec.NewProcessor eagerly at flow-setup time, not lazily inside Next().
rowexec.NewProcessor → newVectorSearchProcessor (pkg/sql/rowexec/vector_search.go:65) → getVectorIndexForSearch → Manager.Get.

Steps 1–7 all run on the same goroutine inside the same CatchVectorizedRuntimeError invocation. Under vectorize=off, the row-based flow doesn't wrap setup, so the risk is specific to the default vectorized path.

Cache-corruption consequence

In pkg/sql/vecindex/manager.go::Get (lines 92–173), if a panic occurs inside the inner closure (e.g. from vecstore.New, cspann.NewIndex, quantize.NewRaBitQuantizer, or any runtime panic on that path):

The defer chain releases m.mu cleanly. ✓
But lines 162–172 are skipped, so:
- e.mustWait stays true
- e.waitCond.Broadcast() never fires
- The cache entry is never cleared

Subsequent Get calls for the same (tableID, indexID) reach e.waitCond.Wait() (line 111) and block forever. The previously loud node crash becomes a silent per-index hang requiring a process restart to recover.

Suggested resolutions (pick one)

(recommended) Make Manager.Get panic-safe in this PR. Add a deferred cleanup that, on abnormal exit, sets e.mustWait = false, populates e.err with a sentinel error, calls e.waitCond.Broadcast(), and removes the cache entry. Sketch:

e = &indexEntry{mustWait: true, waitCond: sync.Cond{L: &m.mu}}
m.mu.indexes[idxKey] = e

var idx *cspann.Index
var err error
defer func() {
    if e.mustWait {
        // Abnormal exit (panic). Unblock waiters and clear the entry
        // so the next Get retries.
        e.mustWait = false
        if err == nil {
            err = errors.AssertionFailedf("vector index construction panicked")
        }
        e.err = err
        e.waitCond.Broadcast()
        delete(m.mu.indexes, idxKey)
    }
}()

Narrow the allow-list to specific subpackages (e.g. pkg/sql/vecindex/cspann, pkg/sql/vecindex/vecstore, pkg/sql/vecindex/quantize) so that Manager.Get panics still crash.

Other observations

No regression test. The 4-statement repro from #146693 / #146694 would make a clean logictest. The existing dimension-mismatch coverage in pkg/sql/logictest/testdata/logic_test/vector_index is on a non-vector-index path; the panic case from the linked issue is specifically the VECTOR INDEX + LIMIT shape.
Tangential nit: pkg/sql/vecindex/manager.go:170 writes m.mu.indexes[idxKey] = nil on the error path. This leaves a dead nil *indexEntry in the map (since Go's map index doesn't delete on nil assignment). It happens to work because the subsequent e := m.mu.indexes[idxKey]; if e != nil test treats it as missing, but a delete(m.mu.indexes, idxKey) would be cleaner. Pre-existing; not introduced by this PR.
Naming nit posted inline.

(made with /review-crdb)

mw5h

@mw5h reviewed 1 file and all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on cthumuluru-crdb, DrewKimball, KAmbekar, and ZhouXing19).

KAmbekar

No regression test. The 4-statement repro from #146693 / #146694 would make a clean logictest. The existing dimension-mismatch coverage in pkg/sql/logictest/testdata/logic_test/vector_index is on a non-vector-index path; the panic case from the linked issue is specifically the VECTOR INDEX + LIMIT shape.

To test the allow-list, I added TestVectorIndexPanicCaught. It uses a testing knob to force a panic inside vecindex, then checks the panic is caught instead of crashing the node.

Note: Your check from #146848 already catches this SQL before it reaches vecindex code. The existing test at vector_index:519 covers it. Adding the same SQL again would just test your check, not the allow-list.

Tangential nit: pkg/sql/vecindex/manager.go:170 writes m.mu.indexes[idxKey] = nil on the error path. This leaves a dead nil *indexEntry in the map (since Go's map index doesn't delete on nil assignment). It happens to work because the subsequent e := m.mu.indexes[idxKey]; if e != nil test treats it as missing, but a delete(m.mu.indexes, idxKey) would be cleaner. Pre-existing; not introduced by this PR.

Added.

Naming nit posted inline.

Updated.

@KAmbekar made 1 comment.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on cthumuluru-crdb, DrewKimball, mw5h, and ZhouXing19).

cockroach-teamcity · 2026-05-25T10:31:31Z

⚪ Sysbench [SQL, 3node, oltp_read_write]

Metric	Old Commit	New Commit	Delta	Note
⚪ sec/op	9.744m ±3%	9.694m ±3%	~	p=0.902 n=15
⚪ allocs/op	6.297k ±0%	6.297k ±0%	~	p=0.878 n=15

Reproduce

benchdiff binaries:

mkdir -p benchdiff/952fb25/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/952fb2538c38f1b48be8c8f0d3d87e3c6deed56e/bin/pkg_sql_tests benchdiff/952fb25/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/952fb25/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/0c5b33a/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/0c5b33ae3e54ec2f2f9f0acfae9d3a5ef1c00ddb/bin/pkg_sql_tests benchdiff/0c5b33a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/0c5b33a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/SQL/3node/oltp_read_write$ --old=0c5b33a --new=952fb25 --memprofile ./pkg/sql/tests

🔴 Sysbench [KV, 3node, oltp_read_only]

Metric	Old Commit	New Commit	Delta	Note
🔴 sec/op	3.041m ±0%	3.070m ±1%	+0.94%	p=0.001 n=15
⚪ allocs/op	2.101k ±0%	2.101k ±0%	~	p=0.107 n=15

Reproduce

benchdiff binaries:

mkdir -p benchdiff/952fb25/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/952fb2538c38f1b48be8c8f0d3d87e3c6deed56e/bin/pkg_sql_tests benchdiff/952fb25/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/952fb25/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/0c5b33a/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/0c5b33ae3e54ec2f2f9f0acfae9d3a5ef1c00ddb/bin/pkg_sql_tests benchdiff/0c5b33a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/0c5b33a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_read_only$ --old=0c5b33a --new=952fb25 --memprofile ./pkg/sql/tests

⚪ Sysbench [KV, 3node, oltp_write_only]

Metric	Old Commit	New Commit	Delta	Note
⚪ sec/op	2.824m ±1%	2.843m ±0%	+0.64%	p=0.008 n=15
⚪ allocs/op	4.202k ±0%	4.203k ±0%	~	p=0.927 n=15

Reproduce

benchdiff binaries:

mkdir -p benchdiff/952fb25/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/952fb2538c38f1b48be8c8f0d3d87e3c6deed56e/bin/pkg_sql_tests benchdiff/952fb25/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/952fb25/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/0c5b33a/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/0c5b33ae3e54ec2f2f9f0acfae9d3a5ef1c00ddb/bin/pkg_sql_tests benchdiff/0c5b33a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/0c5b33a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_write_only$ --old=0c5b33a --new=952fb25 --memprofile ./pkg/sql/tests

Artifacts

download:

mkdir -p new
gcloud storage cp gs://cockroach-microbench-ci/artifacts/952fb2538c38f1b48be8c8f0d3d87e3c6deed56e/26394125893-1/\* new/
mkdir -p old
gcloud storage cp gs://cockroach-microbench-ci/artifacts/0c5b33ae3e54ec2f2f9f0acfae9d3a5ef1c00ddb/26394125893-1/\* old/

built with commit: 952fb2538c38f1b48be8c8f0d3d87e3c6deed56e

cockroach-teamcity · 2026-05-25T16:05:48Z

⚪ Sysbench [SQL, 3node, oltp_read_write]

Metric	Old Commit	New Commit	Delta	Note
⚪ sec/op	9.978m ±1%	10.00m ±1%	~	p=0.870 n=15
⚪ allocs/op	6.292k ±0%	6.303k ±1%	~	p=0.099 n=15

Reproduce

benchdiff binaries:

mkdir -p benchdiff/22a88f9/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/22a88f9154d8649d625f9d51e1ad87793e3e01ab/bin/pkg_sql_tests benchdiff/22a88f9/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/22a88f9/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/fc8cddc/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/fc8cddcb212a3204c098e1964bd61aeaf5570ff5/bin/pkg_sql_tests benchdiff/fc8cddc/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/fc8cddc/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/SQL/3node/oltp_read_write$ --old=fc8cddc --new=22a88f9 --memprofile ./pkg/sql/tests

⚪ Sysbench [KV, 3node, oltp_read_only]

Metric	Old Commit	New Commit	Delta	Note
⚪ sec/op	3.084m ±1%	3.109m ±1%	~	p=0.161 n=15
⚪ allocs/op	2.102k ±0%	2.102k ±0%	~	p=0.123 n=15

Reproduce

benchdiff binaries:

mkdir -p benchdiff/22a88f9/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/22a88f9154d8649d625f9d51e1ad87793e3e01ab/bin/pkg_sql_tests benchdiff/22a88f9/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/22a88f9/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/fc8cddc/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/fc8cddcb212a3204c098e1964bd61aeaf5570ff5/bin/pkg_sql_tests benchdiff/fc8cddc/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/fc8cddc/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_read_only$ --old=fc8cddc --new=22a88f9 --memprofile ./pkg/sql/tests

⚪ Sysbench [KV, 3node, oltp_write_only]

Metric	Old Commit	New Commit	Delta	Note
⚪ sec/op	2.837m ±0%	2.835m ±1%	~	p=0.775 n=15
⚪ allocs/op	4.201k ±0%	4.203k ±0%	~	p=0.628 n=15

Reproduce

benchdiff binaries:

mkdir -p benchdiff/22a88f9/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/22a88f9154d8649d625f9d51e1ad87793e3e01ab/bin/pkg_sql_tests benchdiff/22a88f9/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/22a88f9/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/fc8cddc/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/fc8cddcb212a3204c098e1964bd61aeaf5570ff5/bin/pkg_sql_tests benchdiff/fc8cddc/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/fc8cddc/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_write_only$ --old=fc8cddc --new=22a88f9 --memprofile ./pkg/sql/tests

Artifacts

download:

mkdir -p new
gcloud storage cp gs://cockroach-microbench-ci/artifacts/22a88f9154d8649d625f9d51e1ad87793e3e01ab/26408125903-1/\* new/
mkdir -p old
gcloud storage cp gs://cockroach-microbench-ci/artifacts/fc8cddcb212a3204c098e1964bd61aeaf5570ff5/26408125903-1/\* old/

built with commit: 22a88f9154d8649d625f9d51e1ad87793e3e01ab

cockroach-teamcity · 2026-05-25T18:37:40Z

⚪ Sysbench [SQL, 3node, oltp_read_write]

Metric	Old Commit	New Commit	Delta	Note
⚪ sec/op	9.469m ±1%	9.460m ±0%	~	p=0.870 n=15
⚪ allocs/op	6.295k ±0%	6.291k ±0%	~	p=0.197 n=15

Reproduce

benchdiff binaries:

mkdir -p benchdiff/bcddff6/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/bcddff61a21c6ffd1b6373cd0d5e50e0cf1511e0/bin/pkg_sql_tests benchdiff/bcddff6/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/bcddff6/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/3e97b8a/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/3e97b8ab41072413abcea3f807fabc09efaee330/bin/pkg_sql_tests benchdiff/3e97b8a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/3e97b8a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/SQL/3node/oltp_read_write$ --old=3e97b8a --new=bcddff6 --memprofile ./pkg/sql/tests

⚪ Sysbench [KV, 3node, oltp_read_only]

Metric	Old Commit	New Commit	Delta	Note
⚪ sec/op	3.094m ±1%	3.069m ±1%	-0.82%	p=0.011 n=15
⚪ allocs/op	2.102k ±0%	2.101k ±0%	-0.05%	p=0.002 n=15

Reproduce

benchdiff binaries:

mkdir -p benchdiff/bcddff6/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/bcddff61a21c6ffd1b6373cd0d5e50e0cf1511e0/bin/pkg_sql_tests benchdiff/bcddff6/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/bcddff6/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/3e97b8a/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/3e97b8ab41072413abcea3f807fabc09efaee330/bin/pkg_sql_tests benchdiff/3e97b8a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/3e97b8a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_read_only$ --old=3e97b8a --new=bcddff6 --memprofile ./pkg/sql/tests

⚪ Sysbench [KV, 3node, oltp_write_only]

Metric	Old Commit	New Commit	Delta	Note
⚪ sec/op	2.914m ±3%	2.837m ±2%	~	p=0.202 n=15
⚪ allocs/op	4.213k ±0%	4.205k ±0%	~	p=0.182 n=15

Reproduce

benchdiff binaries:

mkdir -p benchdiff/bcddff6/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/bcddff61a21c6ffd1b6373cd0d5e50e0cf1511e0/bin/pkg_sql_tests benchdiff/bcddff6/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/bcddff6/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/3e97b8a/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/3e97b8ab41072413abcea3f807fabc09efaee330/bin/pkg_sql_tests benchdiff/3e97b8a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/3e97b8a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_write_only$ --old=3e97b8a --new=bcddff6 --memprofile ./pkg/sql/tests

Artifacts

download:

mkdir -p new
gcloud storage cp gs://cockroach-microbench-ci/artifacts/bcddff61a21c6ffd1b6373cd0d5e50e0cf1511e0/26413783108-1/\* new/
mkdir -p old
gcloud storage cp gs://cockroach-microbench-ci/artifacts/3e97b8ab41072413abcea3f807fabc09efaee330/26413783108-1/\* old/

built with commit: bcddff61a21c6ffd1b6373cd0d5e50e0cf1511e0

mw5h

@mw5h reviewed 11 files and all commit messages, and made 2 comments.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on cthumuluru-crdb, DrewKimball, KAmbekar, and ZhouXing19).

pkg/sql/vecindex/manager.go line 149 at r4 (raw file):

			e.err = errors.NewAssertionErrorWithWrappedErrf(
				cause, "vector index construction panicked")
		case nil:

nit: this is dead code because panic(nil) becomes *runtime.PanicNilError. This whole section could probably be something like:

r := recover()
cause, ok := r.(error)
if !ok {
  // Covers both nil (Goexit-style exit) and non-error panic values.
  cause = errors.Newf("non-error panic value: %v", r)
}
e.err = errors.NewAssertionErrorWithWrappedErrf(cause, "vector index construction panicked")

Previously, panics originating from the vecindex packages were not in the colexecerror allow-list, causing them to crash the server instead of being returned as SQL errors. For example, a dimension mismatch query against a vector index would bring down the node. Catching these panics is safe for two reasons. The vecindex search path holds no in-memory locks (it relies on KV-layer locking via vecstore), so a panic mid-operation cannot tear shared state. And Manager.Get — the one in-memory cache on the path — now cleans up after a panic during cache-miss construction: waiters are woken with an error and the failed entry is removed so the next call retries. Caught panics are returned as assertion errors with sentry reports, preserving bug visibility without crashing the server. Resolves: cockroachdb#146694 Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>

cockroach-teamcity · 2026-05-27T07:03:12Z

⚪ Sysbench [SQL, 3node, oltp_read_write]

Metric	Old Commit	New Commit	Delta	Note
⚪ sec/op	10.16m ±8%	10.12m ±8%	~	p=1.000 n=15
⚪ allocs/op	6.301k ±1%	6.292k ±0%	~	p=0.087 n=15

Reproduce

benchdiff binaries:

mkdir -p benchdiff/fd8a66e/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/fd8a66e81046db8c6a3faebfce672294b8b8f392/bin/pkg_sql_tests benchdiff/fd8a66e/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/fd8a66e/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/a081a00/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/a081a00a50d84e8da201bbe55b898ec41b9b013b/bin/pkg_sql_tests benchdiff/a081a00/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/a081a00/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/SQL/3node/oltp_read_write$ --old=a081a00 --new=fd8a66e --memprofile ./pkg/sql/tests

⚪ Sysbench [KV, 3node, oltp_read_only]

Metric	Old Commit	New Commit	Delta	Note
⚪ sec/op	3.045m ±1%	3.051m ±1%	~	p=0.174 n=15
⚪ allocs/op	2.101k ±0%	2.101k ±0%	~	p=0.671 n=15

Reproduce

benchdiff binaries:

mkdir -p benchdiff/fd8a66e/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/fd8a66e81046db8c6a3faebfce672294b8b8f392/bin/pkg_sql_tests benchdiff/fd8a66e/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/fd8a66e/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/a081a00/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/a081a00a50d84e8da201bbe55b898ec41b9b013b/bin/pkg_sql_tests benchdiff/a081a00/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/a081a00/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_read_only$ --old=a081a00 --new=fd8a66e --memprofile ./pkg/sql/tests

⚪ Sysbench [KV, 3node, oltp_write_only]

Metric	Old Commit	New Commit	Delta	Note
⚪ sec/op	2.823m ±1%	2.842m ±2%	~	p=0.461 n=15
⚪ allocs/op	4.199k ±0%	4.203k ±0%	+0.10%	p=0.018 n=15

Reproduce

benchdiff binaries:

mkdir -p benchdiff/fd8a66e/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/fd8a66e81046db8c6a3faebfce672294b8b8f392/bin/pkg_sql_tests benchdiff/fd8a66e/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/fd8a66e/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/a081a00/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/a081a00a50d84e8da201bbe55b898ec41b9b013b/bin/pkg_sql_tests benchdiff/a081a00/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/a081a00/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_write_only$ --old=a081a00 --new=fd8a66e --memprofile ./pkg/sql/tests

Artifacts

download:

mkdir -p new
gcloud storage cp gs://cockroach-microbench-ci/artifacts/fd8a66e81046db8c6a3faebfce672294b8b8f392/26495148089-1/\* new/
mkdir -p old
gcloud storage cp gs://cockroach-microbench-ci/artifacts/a081a00a50d84e8da201bbe55b898ec41b9b013b/26495148089-1/\* old/

built with commit: fd8a66e81046db8c6a3faebfce672294b8b8f392

KAmbekar commented May 12, 2026

View reviewed changes

KAmbekar marked this pull request as ready for review May 12, 2026 07:36

KAmbekar requested a review from a team as a code owner May 12, 2026 07:36

KAmbekar requested review from ZhouXing19, cthumuluru-crdb and yuzefovich and removed request for a team May 12, 2026 07:36

yuzefovich requested review from DrewKimball and mw5h and removed request for yuzefovich May 12, 2026 17:45

mw5h requested changes May 14, 2026

View reviewed changes

Comment thread pkg/sql/colexecerror/error.go Outdated

mw5h reviewed May 14, 2026

View reviewed changes

KAmbekar force-pushed the vecidx-panic-allowlist branch from 14da968 to 952fb25 Compare May 25, 2026 09:43

KAmbekar requested a review from a team as a code owner May 25, 2026 09:43

KAmbekar commented May 25, 2026

View reviewed changes

KAmbekar requested a review from mw5h May 25, 2026 10:08

cockroach-teamcity added the X-perf-check Microbenchmarks CI: Added to a PR if a performance regression is detected and should be checked label May 25, 2026

KAmbekar force-pushed the vecidx-panic-allowlist branch from 952fb25 to 22a88f9 Compare May 25, 2026 15:32

KAmbekar force-pushed the vecidx-panic-allowlist branch from 22a88f9 to bcddff6 Compare May 25, 2026 18:08

mw5h approved these changes May 26, 2026

View reviewed changes

KAmbekar force-pushed the vecidx-panic-allowlist branch from bcddff6 to fd8a66e Compare May 27, 2026 06:37

trunk-io Bot merged commit 72db412 into cockroachdb:master May 27, 2026
25 checks passed

celeste-cockroachdb Bot added the target-release-26.3.0 label May 27, 2026

Conversation

KAmbekar commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trunk-io Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blathers-crl Bot commented May 12, 2026

Uh oh!

cockroach-teamcity commented May 12, 2026

Uh oh!

KAmbekar May 12, 2026

Choose a reason for hiding this comment

Uh oh!

KAmbekar commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mw5h left a comment

Choose a reason for hiding this comment

Manager.Get is reachable inside CatchVectorizedRuntimeError

Cache-corruption consequence

Suggested resolutions (pick one)

Other observations

Uh oh!

Uh oh!

mw5h left a comment

Choose a reason for hiding this comment

Uh oh!

KAmbekar left a comment

Choose a reason for hiding this comment

Uh oh!

cockroach-teamcity commented May 25, 2026

Uh oh!

cockroach-teamcity commented May 25, 2026

Uh oh!

cockroach-teamcity commented May 25, 2026

Uh oh!

mw5h left a comment

Choose a reason for hiding this comment

Uh oh!

cockroach-teamcity commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KAmbekar commented May 12, 2026 •

edited

Loading

trunk-io Bot commented May 12, 2026 •

edited

Loading

KAmbekar commented May 12, 2026 •

edited

Loading