Skip to content

colexecerror: allow-list vecindex packages for panic catching#170164

Merged
trunk-io[bot] merged 1 commit into
cockroachdb:masterfrom
KAmbekar:vecidx-panic-allowlist
May 27, 2026
Merged

colexecerror: allow-list vecindex packages for panic catching#170164
trunk-io[bot] merged 1 commit into
cockroachdb:masterfrom
KAmbekar:vecidx-panic-allowlist

Conversation

@KAmbekar
Copy link
Copy Markdown
Member

@KAmbekar KAmbekar commented May 12, 2026

Previously, panics from pkg/sql/vecindex were not in the colexecerror
allow-list, so a bug on the vector-index foreground path (e.g. a
dimension mismatch in a search) crashed the node instead of returning
a SQL error to the client. Add the package prefix to the allow-list.

Also fixes a latent bug in Manager.Get exposed once panics from this
code path are caught instead of fatal: a panic during index construction
previously left parked waiters deadlocked forever, because the normal
mustWait=false / Broadcast path was skipped. Manager.Get now installs a
cleanup defer that wakes waiters with a wrapped error, removes the
half-built cache entry so subsequent callers retry, and re-panics so
the calling goroutine's colexecerror catch still fires.

The allow-list does not cover panics on cspann.FixupProcessor background
goroutines, which run under stopper.RunAsyncTask and continue to crash
the node via stopper.recover() by design. The safety argument is for
the foreground SQL execution path only.

Caught panics surface as assertion errors with sentry reports,
preserving bug visibility without crashing the server.

Resolves: #146694
Epic: none

Release note (bug fix): Vector-index queries that previously crashed
the node on internal errors (e.g. dimension mismatch) now return a SQL
error to the client instead.

@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 12, 2026

😎 Merged successfully - details.

@blathers-crl
Copy link
Copy Markdown

blathers-crl Bot commented May 12, 2026

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Comment thread pkg/sql/colexecerror/error.go Outdated
Comment on lines +173 to +179
colPackagesPrefix = "github.com/cockroachdb/cockroach/pkg/col"
encodingPackagePrefix = "github.com/cockroachdb/cockroach/pkg/util/encoding"
execinfraPackagePrefix = "github.com/cockroachdb/cockroach/pkg/sql/execinfra"
sqlColPackagesPrefix = "github.com/cockroachdb/cockroach/pkg/sql/col"
sqlRowPackagesPrefix = "github.com/cockroachdb/cockroach/pkg/sql/row"
sqlSemPackagesPrefix = "github.com/cockroachdb/cockroach/pkg/sql/sem"
sqlVecindexPackagePrefix = "github.com/cockroachdb/cockroach/pkg/sql/vecindex"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Reformatted by crlfmt

@KAmbekar
Copy link
Copy Markdown
Member Author

KAmbekar commented May 12, 2026

What happen when we don't add vecindex in allow list? (which is our current state)
Without the allow-list entry, any panic from vecindex code is treated as an unknown/unsafe panic. The panic recovery logic sees it came from an unrecognized package, assumes it might be dangerous to catch, and re-panics — which crashes the entire CockroachDB node.

By allow-listing vecindex, are we bypassing genuine panics?
All ~24 panics in vecindex are errors.AssertionFailedf — invariant violations / programming bugs. None are triggered by user data directly. But catching them is actually the correct behavior for production:

When CatchVectorizedRuntimeError catches an AssertionFailedf panic from an allow-listed package (lines 140-160 of error.go), it does not silently swallow it. It:

  • Returns it as an error to the client
  • Since it has no PG code (pgcode.Uncategorized), wraps it in errors.NewAssertionErrorWithWrappedErrf (line 159)
  • This triggers a sentry report — so the bug gets reported

When it make sense to crash the server?
When the panicking code was holding a lock or modifying shared state at the time of the panic. If you catch the panic in that situation, the lock stays held forever (deadlock) or shared data is left half-updated (corruption). Other goroutines that depend on that state will silently produce wrong results or hang indefinitely.
Crashing the server is the safer option because the node restarts with clean state — no stuck locks, no corrupted in-memory structures. In a distributed system like CockroachDB, other nodes continue serving traffic, so a single node crash is recoverable.

Where do the vector processors manipulate shared state or hold locks?

Component Locks in production? On search path? Details
vectorSearchProcessor No Yes No mutexes, no atomics
vectorMutationSearchProcessor No Yes No mutexes, no atomics
vecindex.Searcher / MutationSearcher No Yes Simple delegation wrappers
cspann.Index.Search() No Yes No lock acquired for read-only search
cspann.searcher.Next() No Yes No locks held
vecstore.StoreTxn No Yes KV-layer locking only, no in-memory locks
statsManager.ReportSearch() Yes (defer-released) Yes Advisory stats only, lock always released during panic unwinding
memstore Yes No (test-only) Never imported by production code
Manager.Get() Yes No (init-only) Called during processor construction, not during Next()
FixupProcessor.mu Yes No (background) Runs on separate goroutines

Why statsManager.ReportSearch() is safe despite holding a lock:
It uses defer sm.mu.Unlock(), so the lock is always released during panic unwinding before CatchVectorizedRuntimeError's recovery runs. The stats it updates (CV/Z-score) are purely advisory for adaptive search tuning — a partially updated stat cannot corrupt query results or KV state.

Why Manager.Get() is not a concern:
It is called from newVectorSearchProcessor() / newVectorMutationSearchProcessor() during processor construction, not during Next() execution. CatchVectorizedRuntimeError only wraps the Next() call path, so panics in Manager.Get() never reach the allow-list check.

@KAmbekar KAmbekar marked this pull request as ready for review May 12, 2026 07:36
@KAmbekar KAmbekar requested a review from a team as a code owner May 12, 2026 07:36
@KAmbekar KAmbekar requested review from ZhouXing19, cthumuluru-crdb and yuzefovich and removed request for a team May 12, 2026 07:36
@yuzefovich yuzefovich requested review from DrewKimball and mw5h and removed request for yuzefovich May 12, 2026 17:45
Copy link
Copy Markdown
Contributor

@mw5h mw5h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR converts a node crash from vecindex panics into a SQL error — that's a real win for users, and the PR-comment breakdown of locking surfaces in vecindex is the right level of due diligence for an allow-list change.

I have one blocking concern, however: I believe the analysis of Manager.Get is incorrect, and as written this PR converts a node crash into a silent per-index hang on the index-construction path.

Manager.Get is reachable inside CatchVectorizedRuntimeError

The PR comment claims:

Manager.Get() is called from newVectorSearchProcessor() / newVectorMutationSearchProcessor() during processor construction, not during Next() execution. CatchVectorizedRuntimeError only wraps the Next() call path, so panics in Manager.Get() never reach the allow-list check.

Under the default vectorize=on, this is not the case. Verified trace:

  1. pkg/sql/colflow/vectorized_flow.go:1137 wraps the body of setupFlow in colexecerror.CatchVectorizedRuntimeError.
  2. Line 1199 calls colbuilder.NewColOperator with ProcessorConstructor: rowexec.NewProcessor (set at line 1186).
  3. colbuilder/execplan.go::supportedNatively has no case for core.VectorSearch, so it returns errCoreUnsupportedNatively.
  4. NewColOperator (line 836) falls into the createAndWrapRowSource branch.
  5. canWrap permits VectorSearch wrapping (line 345, empty case body).
  6. wrapRowSources line 108 (toWrap, err := newToWrap(toWrapInputs)) invokes the closure that calls rowexec.NewProcessor eagerly at flow-setup time, not lazily inside Next().
  7. rowexec.NewProcessornewVectorSearchProcessor (pkg/sql/rowexec/vector_search.go:65) → getVectorIndexForSearchManager.Get.

Steps 1–7 all run on the same goroutine inside the same CatchVectorizedRuntimeError invocation. Under vectorize=off, the row-based flow doesn't wrap setup, so the risk is specific to the default vectorized path.

Cache-corruption consequence

In pkg/sql/vecindex/manager.go::Get (lines 92–173), if a panic occurs inside the inner closure (e.g. from vecstore.New, cspann.NewIndex, quantize.NewRaBitQuantizer, or any runtime panic on that path):

  • The defer chain releases m.mu cleanly. ✓
  • But lines 162–172 are skipped, so:
    • e.mustWait stays true
    • e.waitCond.Broadcast() never fires
    • The cache entry is never cleared

Subsequent Get calls for the same (tableID, indexID) reach e.waitCond.Wait() (line 111) and block forever. The previously loud node crash becomes a silent per-index hang requiring a process restart to recover.

Suggested resolutions (pick one)

  1. (recommended) Make Manager.Get panic-safe in this PR. Add a deferred cleanup that, on abnormal exit, sets e.mustWait = false, populates e.err with a sentinel error, calls e.waitCond.Broadcast(), and removes the cache entry. Sketch:

    e = &indexEntry{mustWait: true, waitCond: sync.Cond{L: &m.mu}}
    m.mu.indexes[idxKey] = e
    
    var idx *cspann.Index
    var err error
    defer func() {
        if e.mustWait {
            // Abnormal exit (panic). Unblock waiters and clear the entry
            // so the next Get retries.
            e.mustWait = false
            if err == nil {
                err = errors.AssertionFailedf("vector index construction panicked")
            }
            e.err = err
            e.waitCond.Broadcast()
            delete(m.mu.indexes, idxKey)
        }
    }()
  2. Narrow the allow-list to specific subpackages (e.g. pkg/sql/vecindex/cspann, pkg/sql/vecindex/vecstore, pkg/sql/vecindex/quantize) so that Manager.Get panics still crash.

Other observations

  • No regression test. The 4-statement repro from #146693 / #146694 would make a clean logictest. The existing dimension-mismatch coverage in pkg/sql/logictest/testdata/logic_test/vector_index is on a non-vector-index path; the panic case from the linked issue is specifically the VECTOR INDEX + LIMIT shape.
  • Tangential nit: pkg/sql/vecindex/manager.go:170 writes m.mu.indexes[idxKey] = nil on the error path. This leaves a dead nil *indexEntry in the map (since Go's map index doesn't delete on nil assignment). It happens to work because the subsequent e := m.mu.indexes[idxKey]; if e != nil test treats it as missing, but a delete(m.mu.indexes, idxKey) would be cleaner. Pre-existing; not introduced by this PR.
  • Naming nit posted inline.

(made with /review-crdb)

Comment thread pkg/sql/colexecerror/error.go Outdated
Copy link
Copy Markdown
Contributor

@mw5h mw5h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mw5h reviewed 1 file and all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on cthumuluru-crdb, DrewKimball, KAmbekar, and ZhouXing19).

@KAmbekar KAmbekar force-pushed the vecidx-panic-allowlist branch from 14da968 to 952fb25 Compare May 25, 2026 09:43
@KAmbekar KAmbekar requested a review from a team as a code owner May 25, 2026 09:43
Copy link
Copy Markdown
Member Author

@KAmbekar KAmbekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No regression test. The 4-statement repro from #146693 / #146694 would make a clean logictest. The existing dimension-mismatch coverage in pkg/sql/logictest/testdata/logic_test/vector_index is on a non-vector-index path; the panic case from the linked issue is specifically the VECTOR INDEX + LIMIT shape.

To test the allow-list, I added TestVectorIndexPanicCaught. It uses a testing knob to force a panic inside vecindex, then checks the panic is caught instead of crashing the node.

Note: Your check from #146848 already catches this SQL before it reaches vecindex code. The existing test at vector_index:519 covers it. Adding the same SQL again would just test your check, not the allow-list.

Tangential nit: pkg/sql/vecindex/manager.go:170 writes m.mu.indexes[idxKey] = nil on the error path. This leaves a dead nil *indexEntry in the map (since Go's map index doesn't delete on nil assignment). It happens to work because the subsequent e := m.mu.indexes[idxKey]; if e != nil test treats it as missing, but a delete(m.mu.indexes, idxKey) would be cleaner. Pre-existing; not introduced by this PR.

Added.

Naming nit posted inline.

Updated.

@KAmbekar made 1 comment.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on cthumuluru-crdb, DrewKimball, mw5h, and ZhouXing19).

@KAmbekar KAmbekar requested a review from mw5h May 25, 2026 10:08
@cockroach-teamcity
Copy link
Copy Markdown
Member

⚪ Sysbench [SQL, 3node, oltp_read_write]
Metric Old Commit New Commit Delta Note
sec/op 9.744m ±3% 9.694m ±3% ~ p=0.902 n=15
allocs/op 6.297k ±0% 6.297k ±0% ~ p=0.878 n=15
Reproduce

benchdiff binaries:

mkdir -p benchdiff/952fb25/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/952fb2538c38f1b48be8c8f0d3d87e3c6deed56e/bin/pkg_sql_tests benchdiff/952fb25/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/952fb25/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/0c5b33a/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/0c5b33ae3e54ec2f2f9f0acfae9d3a5ef1c00ddb/bin/pkg_sql_tests benchdiff/0c5b33a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/0c5b33a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/SQL/3node/oltp_read_write$ --old=0c5b33a --new=952fb25 --memprofile ./pkg/sql/tests
🔴 Sysbench [KV, 3node, oltp_read_only]
Metric Old Commit New Commit Delta Note
🔴 sec/op 3.041m ±0% 3.070m ±1% +0.94% p=0.001 n=15
allocs/op 2.101k ±0% 2.101k ±0% ~ p=0.107 n=15
Reproduce

benchdiff binaries:

mkdir -p benchdiff/952fb25/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/952fb2538c38f1b48be8c8f0d3d87e3c6deed56e/bin/pkg_sql_tests benchdiff/952fb25/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/952fb25/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/0c5b33a/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/0c5b33ae3e54ec2f2f9f0acfae9d3a5ef1c00ddb/bin/pkg_sql_tests benchdiff/0c5b33a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/0c5b33a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_read_only$ --old=0c5b33a --new=952fb25 --memprofile ./pkg/sql/tests
⚪ Sysbench [KV, 3node, oltp_write_only]
Metric Old Commit New Commit Delta Note
sec/op 2.824m ±1% 2.843m ±0% +0.64% p=0.008 n=15
allocs/op 4.202k ±0% 4.203k ±0% ~ p=0.927 n=15
Reproduce

benchdiff binaries:

mkdir -p benchdiff/952fb25/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/952fb2538c38f1b48be8c8f0d3d87e3c6deed56e/bin/pkg_sql_tests benchdiff/952fb25/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/952fb25/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/0c5b33a/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/0c5b33ae3e54ec2f2f9f0acfae9d3a5ef1c00ddb/bin/pkg_sql_tests benchdiff/0c5b33a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/0c5b33a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_write_only$ --old=0c5b33a --new=952fb25 --memprofile ./pkg/sql/tests
Artifacts

download:

mkdir -p new
gcloud storage cp gs://cockroach-microbench-ci/artifacts/952fb2538c38f1b48be8c8f0d3d87e3c6deed56e/26394125893-1/\* new/
mkdir -p old
gcloud storage cp gs://cockroach-microbench-ci/artifacts/0c5b33ae3e54ec2f2f9f0acfae9d3a5ef1c00ddb/26394125893-1/\* old/

built with commit: 952fb2538c38f1b48be8c8f0d3d87e3c6deed56e

@cockroach-teamcity cockroach-teamcity added the X-perf-check Microbenchmarks CI: Added to a PR if a performance regression is detected and should be checked label May 25, 2026
@KAmbekar KAmbekar force-pushed the vecidx-panic-allowlist branch from 952fb25 to 22a88f9 Compare May 25, 2026 15:32
@cockroach-teamcity
Copy link
Copy Markdown
Member

⚪ Sysbench [SQL, 3node, oltp_read_write]
Metric Old Commit New Commit Delta Note
sec/op 9.978m ±1% 10.00m ±1% ~ p=0.870 n=15
allocs/op 6.292k ±0% 6.303k ±1% ~ p=0.099 n=15
Reproduce

benchdiff binaries:

mkdir -p benchdiff/22a88f9/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/22a88f9154d8649d625f9d51e1ad87793e3e01ab/bin/pkg_sql_tests benchdiff/22a88f9/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/22a88f9/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/fc8cddc/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/fc8cddcb212a3204c098e1964bd61aeaf5570ff5/bin/pkg_sql_tests benchdiff/fc8cddc/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/fc8cddc/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/SQL/3node/oltp_read_write$ --old=fc8cddc --new=22a88f9 --memprofile ./pkg/sql/tests
⚪ Sysbench [KV, 3node, oltp_read_only]
Metric Old Commit New Commit Delta Note
sec/op 3.084m ±1% 3.109m ±1% ~ p=0.161 n=15
allocs/op 2.102k ±0% 2.102k ±0% ~ p=0.123 n=15
Reproduce

benchdiff binaries:

mkdir -p benchdiff/22a88f9/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/22a88f9154d8649d625f9d51e1ad87793e3e01ab/bin/pkg_sql_tests benchdiff/22a88f9/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/22a88f9/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/fc8cddc/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/fc8cddcb212a3204c098e1964bd61aeaf5570ff5/bin/pkg_sql_tests benchdiff/fc8cddc/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/fc8cddc/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_read_only$ --old=fc8cddc --new=22a88f9 --memprofile ./pkg/sql/tests
⚪ Sysbench [KV, 3node, oltp_write_only]
Metric Old Commit New Commit Delta Note
sec/op 2.837m ±0% 2.835m ±1% ~ p=0.775 n=15
allocs/op 4.201k ±0% 4.203k ±0% ~ p=0.628 n=15
Reproduce

benchdiff binaries:

mkdir -p benchdiff/22a88f9/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/22a88f9154d8649d625f9d51e1ad87793e3e01ab/bin/pkg_sql_tests benchdiff/22a88f9/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/22a88f9/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/fc8cddc/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/fc8cddcb212a3204c098e1964bd61aeaf5570ff5/bin/pkg_sql_tests benchdiff/fc8cddc/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/fc8cddc/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_write_only$ --old=fc8cddc --new=22a88f9 --memprofile ./pkg/sql/tests
Artifacts

download:

mkdir -p new
gcloud storage cp gs://cockroach-microbench-ci/artifacts/22a88f9154d8649d625f9d51e1ad87793e3e01ab/26408125903-1/\* new/
mkdir -p old
gcloud storage cp gs://cockroach-microbench-ci/artifacts/fc8cddcb212a3204c098e1964bd61aeaf5570ff5/26408125903-1/\* old/

built with commit: 22a88f9154d8649d625f9d51e1ad87793e3e01ab

@KAmbekar KAmbekar force-pushed the vecidx-panic-allowlist branch from 22a88f9 to bcddff6 Compare May 25, 2026 18:08
@cockroach-teamcity
Copy link
Copy Markdown
Member

⚪ Sysbench [SQL, 3node, oltp_read_write]
Metric Old Commit New Commit Delta Note
sec/op 9.469m ±1% 9.460m ±0% ~ p=0.870 n=15
allocs/op 6.295k ±0% 6.291k ±0% ~ p=0.197 n=15
Reproduce

benchdiff binaries:

mkdir -p benchdiff/bcddff6/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/bcddff61a21c6ffd1b6373cd0d5e50e0cf1511e0/bin/pkg_sql_tests benchdiff/bcddff6/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/bcddff6/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/3e97b8a/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/3e97b8ab41072413abcea3f807fabc09efaee330/bin/pkg_sql_tests benchdiff/3e97b8a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/3e97b8a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/SQL/3node/oltp_read_write$ --old=3e97b8a --new=bcddff6 --memprofile ./pkg/sql/tests
⚪ Sysbench [KV, 3node, oltp_read_only]
Metric Old Commit New Commit Delta Note
sec/op 3.094m ±1% 3.069m ±1% -0.82% p=0.011 n=15
allocs/op 2.102k ±0% 2.101k ±0% -0.05% p=0.002 n=15
Reproduce

benchdiff binaries:

mkdir -p benchdiff/bcddff6/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/bcddff61a21c6ffd1b6373cd0d5e50e0cf1511e0/bin/pkg_sql_tests benchdiff/bcddff6/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/bcddff6/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/3e97b8a/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/3e97b8ab41072413abcea3f807fabc09efaee330/bin/pkg_sql_tests benchdiff/3e97b8a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/3e97b8a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_read_only$ --old=3e97b8a --new=bcddff6 --memprofile ./pkg/sql/tests
⚪ Sysbench [KV, 3node, oltp_write_only]
Metric Old Commit New Commit Delta Note
sec/op 2.914m ±3% 2.837m ±2% ~ p=0.202 n=15
allocs/op 4.213k ±0% 4.205k ±0% ~ p=0.182 n=15
Reproduce

benchdiff binaries:

mkdir -p benchdiff/bcddff6/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/bcddff61a21c6ffd1b6373cd0d5e50e0cf1511e0/bin/pkg_sql_tests benchdiff/bcddff6/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/bcddff6/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/3e97b8a/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/3e97b8ab41072413abcea3f807fabc09efaee330/bin/pkg_sql_tests benchdiff/3e97b8a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/3e97b8a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_write_only$ --old=3e97b8a --new=bcddff6 --memprofile ./pkg/sql/tests
Artifacts

download:

mkdir -p new
gcloud storage cp gs://cockroach-microbench-ci/artifacts/bcddff61a21c6ffd1b6373cd0d5e50e0cf1511e0/26413783108-1/\* new/
mkdir -p old
gcloud storage cp gs://cockroach-microbench-ci/artifacts/3e97b8ab41072413abcea3f807fabc09efaee330/26413783108-1/\* old/

built with commit: bcddff61a21c6ffd1b6373cd0d5e50e0cf1511e0

Copy link
Copy Markdown
Contributor

@mw5h mw5h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@mw5h reviewed 11 files and all commit messages, and made 2 comments.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on cthumuluru-crdb, DrewKimball, KAmbekar, and ZhouXing19).


pkg/sql/vecindex/manager.go line 149 at r4 (raw file):

			e.err = errors.NewAssertionErrorWithWrappedErrf(
				cause, "vector index construction panicked")
		case nil:

nit: this is dead code because panic(nil) becomes *runtime.PanicNilError. This whole section could probably be something like:

r := recover()
cause, ok := r.(error)
if !ok {
  // Covers both nil (Goexit-style exit) and non-error panic values.
  cause = errors.Newf("non-error panic value: %v", r)
}
e.err = errors.NewAssertionErrorWithWrappedErrf(cause, "vector index construction panicked")

Previously, panics originating from the vecindex packages were not in
the colexecerror allow-list, causing them to crash the server instead
of being returned as SQL errors. For example, a dimension mismatch
query against a vector index would bring down the node.

Catching these panics is safe for two reasons. The vecindex search
path holds no in-memory locks (it relies on KV-layer locking via
vecstore), so a panic mid-operation cannot tear shared state. And
Manager.Get — the one in-memory cache on the path — now cleans up
after a panic during cache-miss construction: waiters are woken with
an error and the failed entry is removed so the next call retries.

Caught panics are returned as assertion errors with sentry reports,
preserving bug visibility without crashing the server.

Resolves: cockroachdb#146694
Epic: none

Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@KAmbekar KAmbekar force-pushed the vecidx-panic-allowlist branch from bcddff6 to fd8a66e Compare May 27, 2026 06:37
@cockroach-teamcity
Copy link
Copy Markdown
Member

⚪ Sysbench [SQL, 3node, oltp_read_write]
Metric Old Commit New Commit Delta Note
sec/op 10.16m ±8% 10.12m ±8% ~ p=1.000 n=15
allocs/op 6.301k ±1% 6.292k ±0% ~ p=0.087 n=15
Reproduce

benchdiff binaries:

mkdir -p benchdiff/fd8a66e/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/fd8a66e81046db8c6a3faebfce672294b8b8f392/bin/pkg_sql_tests benchdiff/fd8a66e/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/fd8a66e/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/a081a00/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/a081a00a50d84e8da201bbe55b898ec41b9b013b/bin/pkg_sql_tests benchdiff/a081a00/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/a081a00/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/SQL/3node/oltp_read_write$ --old=a081a00 --new=fd8a66e --memprofile ./pkg/sql/tests
⚪ Sysbench [KV, 3node, oltp_read_only]
Metric Old Commit New Commit Delta Note
sec/op 3.045m ±1% 3.051m ±1% ~ p=0.174 n=15
allocs/op 2.101k ±0% 2.101k ±0% ~ p=0.671 n=15
Reproduce

benchdiff binaries:

mkdir -p benchdiff/fd8a66e/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/fd8a66e81046db8c6a3faebfce672294b8b8f392/bin/pkg_sql_tests benchdiff/fd8a66e/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/fd8a66e/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/a081a00/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/a081a00a50d84e8da201bbe55b898ec41b9b013b/bin/pkg_sql_tests benchdiff/a081a00/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/a081a00/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_read_only$ --old=a081a00 --new=fd8a66e --memprofile ./pkg/sql/tests
⚪ Sysbench [KV, 3node, oltp_write_only]
Metric Old Commit New Commit Delta Note
sec/op 2.823m ±1% 2.842m ±2% ~ p=0.461 n=15
allocs/op 4.199k ±0% 4.203k ±0% +0.10% p=0.018 n=15
Reproduce

benchdiff binaries:

mkdir -p benchdiff/fd8a66e/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/fd8a66e81046db8c6a3faebfce672294b8b8f392/bin/pkg_sql_tests benchdiff/fd8a66e/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/fd8a66e/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/a081a00/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/a081a00a50d84e8da201bbe55b898ec41b9b013b/bin/pkg_sql_tests benchdiff/a081a00/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/a081a00/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests

benchdiff command:

# NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_write_only$ --old=a081a00 --new=fd8a66e --memprofile ./pkg/sql/tests
Artifacts

download:

mkdir -p new
gcloud storage cp gs://cockroach-microbench-ci/artifacts/fd8a66e81046db8c6a3faebfce672294b8b8f392/26495148089-1/\* new/
mkdir -p old
gcloud storage cp gs://cockroach-microbench-ci/artifacts/a081a00a50d84e8da201bbe55b898ec41b9b013b/26495148089-1/\* old/

built with commit: fd8a66e81046db8c6a3faebfce672294b8b8f392

@trunk-io trunk-io Bot merged commit 72db412 into cockroachdb:master May 27, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

target-release-26.3.0 X-perf-check Microbenchmarks CI: Added to a PR if a performance regression is detected and should be checked

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vecindex: consider allow-listing the relevant packages in colexecerror to catch panics from

3 participants