Skip to content

Cap RequestErrUnprepared retry recursion on query and batch paths#1952

Open
arloliu wants to merge 1 commit into
apache:trunkfrom
arloliu:fix/cap-unprep-retry-recursion
Open

Cap RequestErrUnprepared retry recursion on query and batch paths#1952
arloliu wants to merge 1 commit into
apache:trunkfrom
arloliu:fix/cap-unprep-retry-recursion

Conversation

@arloliu

@arloliu arloliu commented May 17, 2026

Copy link
Copy Markdown

Summary

Conn.executeQuery and Conn.executeBatch both respond to a *RequestErrUnprepared by evicting the prepared-statement cache entry and recursing on themselves with no upper bound.
When the server persistently re-reports the same statement as unprepared after re-prepare — for example a coordinator thrashing its prepared-statement cache under high statement cardinality, or a misbehaving proxy/fork — the recursion never terminates. The goroutine stack eventually exceeds runtime.SetMaxStack (1 GiB by default), at which point Go's runtime crashes the entire process with an unrecoverable stack-overflow throw. No recover() can intercept it.

The failure mode is reachable from any prepared-statement workload; the batch path is on the hot write path. No attacker is required — organic Cassandra prep-cache thrash is sufficient.

This PR caps re-prepare retries on both paths at 5 and surfaces a descriptive error wrapping the underlying *RequestErrUnprepared so errors.As continues to work.

Approach

  • New unexported helpers executeQueryWithUnprepRetries / executeBatchWithUnprepRetries take an unprepAttempt int and increment it on each retry.
  • Existing entry points (executeQuery, executeBatch) become thin wrappers that start the counter at 0.
  • When the counter reaches maxUnprepRetries = 5, the iter is returned with err = fmt.Errorf("…after N re-prepare attempts: %w", serverErr).
  • Behavior is a strict superset of the prior code: queries and batches that succeed within 5 attempts behave exactly as before. Only the pathological no-progress case is changed.

Tests

Test fake-server (conn_test.go) gains:

  • an always-unprep opPrepare case returning id=99,
  • an opBatch arm that replies ErrCodeUnprepared when any statement carries id=99,
  • case-insensitive verb-trimming in the opPrepare query-name parser (select|insert|update|delete) so DML in batches can reach the always-unprep case.

unprep_retry_test.go (new):

  • TestExecuteQuery_UnprepRetryIsCapped drives the always-unprep path through Query.Exec.
  • TestExecuteBatch_UnprepRetryIsCapped drives it through Session.ExecuteBatch.
  • Each asserts: no infinite recursion, error mentions "re-prepare attempts", errors.As(err, &*RequestErrUnprepared) succeeds, and the server received exactly maxUnprepRetries + 1 prepare/execute (resp. prepare/batch) pairs.

Test plan

  • make check clean
  • make test-unit green (both new tests pass)
  • CI on GitHub Actions

Notes

No CASSGO ticket attached; happy to file one and amend the commit / CHANGELOG entry if the committer prefers.

Conn.executeQuery and Conn.executeBatch both responded to a
*RequestErrUnprepared by evicting the prepared-statement cache entry
and recursing on themselves with no upper bound. When the server
persistently re-reports the same statement as unprepared after
re-prepare (a coordinator thrashing its prepared-statement cache,
or a misbehaving proxy/fork), the recursion never terminates. The
goroutine stack eventually exceeds runtime.SetMaxStack (1 GiB by
default), at which point Go runtime.throw crashes the entire process
with an unrecoverable stack-overflow — no recover() can intercept it.
The failure mode is reachable from any prepared-statement workload;
the batch path is on the hot write path.

Cap recursion on both paths at maxUnprepRetries = 5 by threading an
unprepAttempt counter through unexported helpers
executeQueryWithUnprepRetries and executeBatchWithUnprepRetries. When
the cap fires, return an Iter whose err is

  fmt.Errorf("...after N re-prepare attempts: %w", serverErr)

The %w wrap means callers can:

  - Detect cap-driven failures via error message pattern.
  - Recover the underlying *RequestErrUnprepared with errors.As to
    inspect the StatementId the server kept rejecting.

Behavior is a strict superset of the prior code: queries and batches
that succeed within 5 attempts behave exactly as before. Only the
pathological no-progress case is changed.

Test fake-server (conn_test.go) gains an always-unprep opPrepare
case returning id=99, an opBatch arm that replies ErrCodeUnprepared
when any statement carries id=99, and case-insensitive verb-trimming
in the opPrepare query-name parser (select/insert/update/delete) so
DML in batches can reach the always-unprep case. Two new tests in
unprep_retry_test.go drive the always-unprep path through Query.Exec
and Session.ExecuteBatch respectively, assert no infinite recursion,
verify the wrap is recoverable via errors.As, and check that the
server received exactly maxUnprepRetries+1 prepare/execute (resp.
prepare/batch) pairs.

Patch by Arlo Liu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant