Skip to content

Draft: fix COMMIT/ROLLBACK retry data-loss footgun (PostgreSQL & MySQL)#1216

Closed
skunkworker wants to merge 6 commits into
jruby:72-stablefrom
skunkworker:fix_retry_rollback_commit_dataloss
Closed

Draft: fix COMMIT/ROLLBACK retry data-loss footgun (PostgreSQL & MySQL)#1216
skunkworker wants to merge 6 commits into
jruby:72-stablefrom
skunkworker:fix_retry_rollback_commit_dataloss

Conversation

@skunkworker

@skunkworker skunkworker commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

NOTE: Used Opus 4.8 to identify this specific bug and add a test.

Fix: COMMIT/ROLLBACK retry data-loss footgun (PostgreSQL & MySQL)

The bug

The shared ArJdbc::Abstract::TransactionSupport mixin (lib/arjdbc/abstract/transaction_support.rb) implements every transaction control statement — BEGIN, COMMIT,
ROLLBACK, savepoints — with the same options:

  def commit_db_transaction
    log('COMMIT', 'TRANSACTION') do
      with_raw_connection(allow_retry: true, materialize_transactions: false) do |conn|
        conn.commit
      end
    end
  end

allow_retry: true tells ActiveRecord's with_raw_connection machinery that the operation is safe to replay if the connection drops. That's fine for BEGIN
(idempotent) but dangerous for COMMIT on a networked database. The failure sequence:

  1. App opens a transaction, writes rows.
  2. The backend connection is dropped right at commit time (pgbouncer reaping an idle/server connection, network blip, server restart).
  3. With allow_retry: true, AR catches the retryable ConnectionFailed, calls reconnect!(restore_transactions: true) — which replays BEGIN and savepoints but not
    the data writes (they died with the old backend).
  4. AR re-runs COMMIT on the fresh connection, against an empty transaction. It succeeds.
  5. The caller is told the commit succeeded. The writes are silently gone.

This diverges from ActiveRecord's own adapters, which deliberately use allow_retry: false for COMMIT/ROLLBACK on both PostgreSQL and MySQL (only SQLite, a local
file, uses true).

Why it became newly reachable on PostgreSQL

The recent pgbouncer patch (06e7afa) started translating backend disconnects into a retryable ActiveRecord::ConnectionFailed. Before that translation, a dropped
COMMIT surfaced as a raw PSQLException that AR never retried — so the footgun was latent. After it, the retry path is live, making silent data loss reachable in
normal pgbouncer deployments.

The fix

Override commit_db_transaction and exec_rollback_db_transaction in each adapter to force allow_retry: false, matching ActiveRecord's native adapters.
(materialize_transactions: true also matches AR's native contract for these methods.)

PostgreSQL — lib/arjdbc/postgresql/adapter.rb:585. The overrides live in the ArJdbc::PostgreSQL module, which is included after TransactionSupport, so they win in
the method resolution order.

MySQL — lib/arjdbc/mysql/adapter.rb:215. Here ArJdbc::MySQL is a native Java module, so the overrides are placed directly in the Mysql2Adapter class body
(class-level definitions override the included mixin).

Both adapters now use:

  def commit_db_transaction
    log('COMMIT', 'TRANSACTION') do
      with_raw_connection(allow_retry: false, materialize_transactions: true) do |conn|
        conn.commit
      end
    end
  end

  def exec_rollback_db_transaction
    log('ROLLBACK', 'TRANSACTION') do
      with_raw_connection(allow_retry: false, materialize_transactions: true) do |conn|
        conn.rollback
      end
    end
  end

BEGIN and savepoint methods are intentionally left with allow_retry: true — replaying them is safe and is what enables clean reconnection after an idle-pool drop.

MySQL caveat (why its fix is defensive)

Empirically verified: on MySQL/MariaDB a dropped-socket COMMIT currently raises a plain ActiveRecord::JDBCError, not a retryable ConnectionFailed. Since
retryable_connection_error? only matches ConnectionFailed/ConnectionNotEstablished, AR won't retry a failed MySQL commit today regardless of the flag — the
footgun is masked on MySQL. The MySQL fix therefore pins the safe contract so it can't be silently reintroduced if MySQL ever gains ConnectionFailed translation
(as PostgreSQL just did).

Also patch mysql

The gap: Last week's Postgres patch (06e7afa) translates backend disconnects into a retryable ActiveRecord::ConnectionFailed. MySQL had no equivalent —
Mysql2Adapter#translate_exception only matched two message patterns then called super. super (native AbstractMysqlAdapter) dispatches on error_number, so it only
catches the vendor "server gone" codes (2006/2013/…). But Connector/J and the MariaDB driver raise communications failures with SQLState class 08 (e.g. 08S01
"Communications link failure") and errorCode 0 — which neither path caught. So a proxy/server dropping an idle connection surfaced as a raw
JDBCError/StatementInvalid and AR's reconnect-and-retry never fired. (This is also why the earlier COMMIT-retry fix was latent on MySQL.)

The fix (lib/arjdbc/mysql/adapter.rb), mirroring the PG structure:

  • A connection_lost?(exception) helper checking, in order: SQLState class 08 (08000/08001/08003/08004/08006/08007/08S01), the server-gone vendor codes (2006,
    2013, 1053, 1927, 4031), connection-failure message patterns, and a wrapped SQLRecoverableException/SQLNonTransientConnectionException cause.
  • translate_exception now short-circuits to ConnectionFailed when connection_lost? is true, before the existing message cases and super.

Also fix: PostgreSQL connection leak on failed establishment

PostgreSQLRubyJdbcConnection#newConnection opened a physical backend via super.newConnection() and then ran connection.unwrap(PGConnection.class) plus six pgConnection.addDataType(...) registrations with no cleanup. If any of that post-connect setup threw, the already-open server-side connection was leaked — the caller only sees the exception and never receives a handle it could close.

Leaked backends accumulate against PostgreSQL's max_connections, which is exactly the kind of self-poisoning that produces a "too many clients already" / ConnectionTimeoutError cascade across a test suite or a long-running app.

The fix (src/java/arjdbc/postgresql/PostgreSQLRubyJdbcConnection.java) wraps the post-connect setup in a try/catch that closes the physical connection (suppressing any close error onto the original exception) before re-raising — mirroring how the native adapter discards a connection it could not fully establish.

Verified: a full PostgreSQL Rails test-suite run shows zero "too many clients already" occurrences (the only timeout markers trace to connection_pool_test.rb, a test that deliberately uses a 0.4s timeout).

Also add: SQL warnings support (AR 7.2 db_warnings_action)

The JDBC PostgreSQL adapter does not inherit the native pg adapter, so libpq's set_notice_receiver warning path never existed — db_warnings_action was a no-op and warning-related tests failed with undefined method 'to_a' for 0:Integer. PostgreSQL RAISE WARNING / NOTICE messages land on the JDBC Statement's warning chain and are lost when the statement closes, so they must be captured on the Java side.

  • RubyJdbcConnection#execute now captures statement.getWarnings() before the statement closes and exposes them via #last_warnings.
  • PostgreSQLRubyJdbcConnection#newWarning maps each PSQLWarning's ServerErrorMessage to [message, sqlstate, severity] (clean primary message, SQLSTATE, and "WARNING"/"NOTICE" level).
  • The PG raw_execute override (lib/arjdbc/postgresql/database_statements.rb) dispatches captured warnings through handle_warnings to ActiveRecord.db_warnings_action, applies warning_ignored? (NOTICE-level and the db_warnings_ignore allowlist are skipped), and normalises the Integer update-count return to [] for PG::Result API parity.

This clears the db_warnings_action log/raise/proc/allowlist tests. (The jar lib/arjdbc/jdbc/adapter_java.jar is gitignored and rebuilt during packaging/CI, so only the Java/Ruby source is committed.)


Also: reduce per-connection PostgreSQL OID type-map cost + forward exec retry kwargs

Profiling the lazy-connection path surfaced two adapter-side costs unrelated to the transaction fixes above. Ported from the 71-stable lazy-connection remediation work.

Background. Under AR 7.2's lazy-connection model the PostgreSQL adapter connects on first use and re-runs configure_connection on every reconnect. configure_connection ends with reload_type_map, which runs three pg_type/pg_range catalog sweeps to build the OID type map — and nothing shared those rows across connections. Under JRuby (no GVL, a genuinely concurrent pool, full-cost reconnects with no cheap PQreset) this expensive metadata bootstrap is paid repeatedly, on the request path.

A1 — build the OID type map once per adapter instance. @type_map is created in #initialize, and the automatic #initialize_type_map in #configure_connection is guarded by @type_map_initialized so reconnects reuse the map (OIDs are stable per server). #reload_type_map remains the explicit refresh for schema/type changes (extensions, enums, create_enum, etc.). It deliberately calls initialize_type_map (not reload_type_map) in the guarded path so a fresh adapter's first connect does not invalidate the shared cache another connection to the same database already populated.

A2 — share the catalog rows across connections to the same database. The first connection runs the three catalog sweeps and caches the resulting row-sets; every other connection builds its own isolated type_map from the shared copy with zero DB round-trips. The cache is scoped per-database — [host, port, database, jdbc_url], recovering the database name from the URL path for url-only configs — and invalidated by #reload_type_map. Thread-safety (JRuby has no GVL): a Mutex guards the cache Hash for all get/store/clear, the catalog queries run outside the lock (no blocking, no deadlock on re-entry, first-writer-wins on store), and cached rows are deep-frozen so concurrent readers need no lock and TypeMapInitializer#run (which only reads) can never mutate the shared copy.

Also — forward allow_retry: / materialize_transactions: from internal_exec_query. The method's signature accepted these kwargs but silently dropped them — line 47 called with_raw_connection do |conn| without forwarding, unlike raw_execute. Now forwarded, so retry/materialization intent is honored for queries that route through internal_exec_query.

Files: lib/arjdbc/postgresql/adapter.rb (A1), lib/arjdbc/postgresql/oid_types.rb (A2), lib/arjdbc/abstract/database_statements.rb (kwarg forwarding).

Verified: PostgreSQL suite unchanged at 311 tests; the 1 failure + 1 error are pre-existing and reproduce on the base branch (Ruby 3.4 File.exists? removal in a rake test; a prepared-statement timezone assertion) — this change introduces no new failures.

@skunkworker

Copy link
Copy Markdown
Contributor Author

Looks like the specs are failing because of the i18n v1.15.0 Fiber bug.

skunkworker and others added 3 commits June 18, 2026 18:43
…handling

Transaction control statements were defined in the shared
ArJdbc::Abstract::TransactionSupport mixin with allow_retry: true. That is
safe for the idempotent BEGIN/savepoint statements but dangerous for
COMMIT/ROLLBACK: if the backend connection drops at commit time (pgbouncer
reaping a connection, a network blip, a server restart), AR's
with_raw_connection machinery reconnects, replays BEGIN (but not the data
writes, which died with the old backend) and re-runs COMMIT against an empty
transaction - reporting success while silently losing the writes. This
became reachable on PostgreSQL once backend disconnects started being
translated to a retryable ActiveRecord::ConnectionFailed.

Override commit_db_transaction / exec_rollback_db_transaction in the
PostgreSQL and MySQL adapters to force allow_retry: false (matching AR's
native adapters); BEGIN and savepoints intentionally keep allow_retry: true.

MySQL additionally lacked translation of dropped connections: a lost backend
surfaced as a plain JDBCError (SQLState class 08 / errorCode 0), which AR's
retry path never matched. Add a connection_lost? helper (SQLState 08*,
server-gone vendor codes, message patterns, and wrapped
SQLRecoverableException/SQLNonTransientConnectionException causes) and
short-circuit translate_exception to ConnectionFailed.

Also: cap i18n < 1.15.0 (that release needs Ruby 3.2+, which we can't assume
across supported JRubies) and fix a timezone test in test/simple.rb where
with_timezone_config must wrap Time.use_zone to avoid the global-state leak
guard.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…r JDBC

Correct savepoint behaviour in the shared transaction support and fix the
PostgreSQL#client_min_messages getter, which did not work under the JDBC
adapter (it does not inherit the native pg adapter implementation).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…upport

1.1 Connection leak: PostgreSQLRubyJdbcConnection#newConnection opened a
physical backend via super.newConnection() but then ran unwrap()/addDataType()
without cleanup. Any failure there leaked the open server-side connection,
since the caller only saw the exception and never got a handle to close it.
Wrap the post-connect setup so the connection is closed (suppressing close
errors) before re-raising, mirroring the native adapter discarding a
connection it could not fully establish.

1.2 SQL warnings (AR 7.2 db_warnings_action): capture statement.getWarnings()
in RubyJdbcConnection#execute before the statement closes and expose them via
#last_warnings. PostgreSQLRubyJdbcConnection#newWarning maps PSQLWarning's
ServerErrorMessage to [message, sqlstate, severity]. The PG raw_execute
override dispatches them through handle_warnings to ActiveRecord
.db_warnings_action and normalises the Integer update-count return to [] for
PG::Result API parity.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@skunkworker skunkworker force-pushed the fix_retry_rollback_commit_dataloss branch from c3bbd43 to 5bd86de Compare June 19, 2026 01:43
skunkworker and others added 3 commits June 23, 2026 10:29
Quality cleanup of the COMMIT/ROLLBACK retry fix plus added regression
coverage. No behavior change for the PostgreSQL/MySQL paths the original
fix targeted.

- Consolidate COMMIT/ROLLBACK no-retry into the shared TransactionSupport
  mixin (allow_retry: false, materialize_transactions: true), matching the
  savepoint precedent and AR's native adapters, and delete the byte-for-byte
  identical overrides from the PostgreSQL and MySQL adapters. BEGIN stays
  retryable (idempotent). This also extends the safe contract to the sqlite3
  and generic JDBC adapters, which inherit the mixin.

- Capture SQL warnings lazily: RubyJdbcConnection#execute now stores the raw
  SQLWarning chain (cheap reference) and #last_warnings maps it to Ruby
  tuples on demand, avoiding a per-execute array allocation on the common
  path where ActiveRecord.db_warnings_action is disabled.

- Collapse client_min_messages to a safe-navigation one-liner.

- Add tests: sqlite3 transaction_no_retry_test pins the mixin contract
  (commit/rollback/savepoints no-retry; BEGIN still retryable) and PG
  warnings_test covers warning capture/dispatch plus the NOTICE-filtered
  and no-warning negative cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Under AR 7.2's lazy-connection model the PostgreSQL adapter rebuilt the
entire pg_type OID type map on every reconnect (configure_connection ends
with reload_type_map) and never shared the catalog rows across connections.
Under JRuby (no GVL, genuinely concurrent pool, full-cost reconnects) that
is an expensive metadata bootstrap paid on the request path.

A1 - build the OID type map once per adapter instance: create @type_map in
  #initialize and guard the automatic #initialize_type_map in
  #configure_connection with @type_map_initialized so reconnects reuse the
  map. #reload_type_map remains the explicit refresh for schema/type changes.

A2 - share the pg_type catalog rows across connections to the same database.
  The first connection runs the three catalog sweeps and caches the rows;
  other connections build their own isolated type_map from the shared copy
  with no DB round-trips. Cache is scoped per-database (host/port/name + JDBC
  url, recovering name from the url path for url-only configs) and invalidated
  by #reload_type_map. Thread-safe: a Mutex guards the cache Hash, the catalog
  queries run outside the lock, and cached rows are deep-frozen so concurrent
  readers need no lock.

Also forward allow_retry:/materialize_transactions: from
internal_exec_query into with_raw_connection (the signature accepted them but
silently dropped them, unlike raw_execute).

Ported from the 71-stable lazy-connection remediation work. PostgreSQL suite
unchanged: 311 tests, the 1 failure + 1 error are pre-existing (Ruby 3.4
File.exists? removal; prepared-statement timezone handling).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@headius

headius commented Jun 25, 2026

Copy link
Copy Markdown
Member

There's a large amount of extra information here that I can't easily verify. Without being able to independently verify these statements, I'll need some additional follow-up here.

Breaking it down

The important parts seem to be:

This diverges from ActiveRecord's own adapters, which deliberately use allow_retry: false for COMMIT/ROLLBACK on both PostgreSQL and MySQL

Override commit_db_transaction and exec_rollback_db_transaction in each adapter to force allow_retry: false, matching ActiveRecord's native adapters.

Matching the mainstream adapters is almost always the correct fix unless there's a good reason why the JDBC versions should not follow. In this case I would like to see correlation with lines in the CRuby adapters before merging. A glance at the changes seem to fit the description of the bug and fix, and logically the description makes sense.

Going off the rails

The MySQL fix therefore pins the safe contract so it can't be silently reintroduced if MySQL ever gains ConnectionFailed translation
(as PostgreSQL just did).

I assume that you are pasting multiple phases of your LLM dev session here, but this all ends up making the individual fixes very confusing.

Those patches go on to actually make the same change for MySQL, so we're looking at several incremental LLM changes balled up into a single description. That is significantly more difficult to verify independently. We need a summary by you, the submitter of each change being made, and if there are multiple commits in a single pull request they should have a good reason for being combined.

This is not to say the changes made are incorrect, but the logic to reach them can't be followed anymore. We do want the equivalent error message intelligence applied to MySQL, but it would be better done and easier to review as a separate patch.

The rest of this section's description and justification appears to be a retread of #1215, which we already established was a good fix that should be applied to MySQL. It is noise here.

And from here, the additional patches here are largely unrelated to the foot-gun and to each other. They should all be submitted as separate pull requests with independent justification.

LLM policy

As with JRuby, we will accept LLM-driven pull requests but they must be accompanied by discussion with the submitted (sans LLM, i.e. you must be able to justify all changes and own them). The changes made should be of a common topic and not a ball of "fix everything I found in this session".

Thank you for putting the time into driving your LLM to make these analyses and fixes. Break them up into topic PRs and we can start discussing and merging them.

@skunkworker skunkworker changed the title fix COMMIT/ROLLBACK retry data-loss footgun (PostgreSQL & MySQL) Draft: fix COMMIT/ROLLBACK retry data-loss footgun (PostgreSQL & MySQL) Jun 25, 2026
@headius

headius commented Jun 27, 2026

Copy link
Copy Markdown
Member

Closing in favor of finer-grained PRs #1218 and #1219 (already merged).

@headius headius closed this Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants