Draft: fix COMMIT/ROLLBACK retry data-loss footgun (PostgreSQL & MySQL)#1216
Draft: fix COMMIT/ROLLBACK retry data-loss footgun (PostgreSQL & MySQL)#1216skunkworker wants to merge 6 commits into
Conversation
|
Looks like the specs are failing because of the i18n |
…handling Transaction control statements were defined in the shared ArJdbc::Abstract::TransactionSupport mixin with allow_retry: true. That is safe for the idempotent BEGIN/savepoint statements but dangerous for COMMIT/ROLLBACK: if the backend connection drops at commit time (pgbouncer reaping a connection, a network blip, a server restart), AR's with_raw_connection machinery reconnects, replays BEGIN (but not the data writes, which died with the old backend) and re-runs COMMIT against an empty transaction - reporting success while silently losing the writes. This became reachable on PostgreSQL once backend disconnects started being translated to a retryable ActiveRecord::ConnectionFailed. Override commit_db_transaction / exec_rollback_db_transaction in the PostgreSQL and MySQL adapters to force allow_retry: false (matching AR's native adapters); BEGIN and savepoints intentionally keep allow_retry: true. MySQL additionally lacked translation of dropped connections: a lost backend surfaced as a plain JDBCError (SQLState class 08 / errorCode 0), which AR's retry path never matched. Add a connection_lost? helper (SQLState 08*, server-gone vendor codes, message patterns, and wrapped SQLRecoverableException/SQLNonTransientConnectionException causes) and short-circuit translate_exception to ConnectionFailed. Also: cap i18n < 1.15.0 (that release needs Ruby 3.2+, which we can't assume across supported JRubies) and fix a timezone test in test/simple.rb where with_timezone_config must wrap Time.use_zone to avoid the global-state leak guard. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…r JDBC Correct savepoint behaviour in the shared transaction support and fix the PostgreSQL#client_min_messages getter, which did not work under the JDBC adapter (it does not inherit the native pg adapter implementation). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…upport 1.1 Connection leak: PostgreSQLRubyJdbcConnection#newConnection opened a physical backend via super.newConnection() but then ran unwrap()/addDataType() without cleanup. Any failure there leaked the open server-side connection, since the caller only saw the exception and never got a handle to close it. Wrap the post-connect setup so the connection is closed (suppressing close errors) before re-raising, mirroring the native adapter discarding a connection it could not fully establish. 1.2 SQL warnings (AR 7.2 db_warnings_action): capture statement.getWarnings() in RubyJdbcConnection#execute before the statement closes and expose them via #last_warnings. PostgreSQLRubyJdbcConnection#newWarning maps PSQLWarning's ServerErrorMessage to [message, sqlstate, severity]. The PG raw_execute override dispatches them through handle_warnings to ActiveRecord .db_warnings_action and normalises the Integer update-count return to [] for PG::Result API parity. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
c3bbd43 to
5bd86de
Compare
Quality cleanup of the COMMIT/ROLLBACK retry fix plus added regression coverage. No behavior change for the PostgreSQL/MySQL paths the original fix targeted. - Consolidate COMMIT/ROLLBACK no-retry into the shared TransactionSupport mixin (allow_retry: false, materialize_transactions: true), matching the savepoint precedent and AR's native adapters, and delete the byte-for-byte identical overrides from the PostgreSQL and MySQL adapters. BEGIN stays retryable (idempotent). This also extends the safe contract to the sqlite3 and generic JDBC adapters, which inherit the mixin. - Capture SQL warnings lazily: RubyJdbcConnection#execute now stores the raw SQLWarning chain (cheap reference) and #last_warnings maps it to Ruby tuples on demand, avoiding a per-execute array allocation on the common path where ActiveRecord.db_warnings_action is disabled. - Collapse client_min_messages to a safe-navigation one-liner. - Add tests: sqlite3 transaction_no_retry_test pins the mixin contract (commit/rollback/savepoints no-retry; BEGIN still retryable) and PG warnings_test covers warning capture/dispatch plus the NOTICE-filtered and no-warning negative cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Under AR 7.2's lazy-connection model the PostgreSQL adapter rebuilt the entire pg_type OID type map on every reconnect (configure_connection ends with reload_type_map) and never shared the catalog rows across connections. Under JRuby (no GVL, genuinely concurrent pool, full-cost reconnects) that is an expensive metadata bootstrap paid on the request path. A1 - build the OID type map once per adapter instance: create @type_map in #initialize and guard the automatic #initialize_type_map in #configure_connection with @type_map_initialized so reconnects reuse the map. #reload_type_map remains the explicit refresh for schema/type changes. A2 - share the pg_type catalog rows across connections to the same database. The first connection runs the three catalog sweeps and caches the rows; other connections build their own isolated type_map from the shared copy with no DB round-trips. Cache is scoped per-database (host/port/name + JDBC url, recovering name from the url path for url-only configs) and invalidated by #reload_type_map. Thread-safe: a Mutex guards the cache Hash, the catalog queries run outside the lock, and cached rows are deep-frozen so concurrent readers need no lock. Also forward allow_retry:/materialize_transactions: from internal_exec_query into with_raw_connection (the signature accepted them but silently dropped them, unlike raw_execute). Ported from the 71-stable lazy-connection remediation work. PostgreSQL suite unchanged: 311 tests, the 1 failure + 1 error are pre-existing (Ruby 3.4 File.exists? removal; prepared-statement timezone handling). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
There's a large amount of extra information here that I can't easily verify. Without being able to independently verify these statements, I'll need some additional follow-up here. Breaking it downThe important parts seem to be:
Matching the mainstream adapters is almost always the correct fix unless there's a good reason why the JDBC versions should not follow. In this case I would like to see correlation with lines in the CRuby adapters before merging. A glance at the changes seem to fit the description of the bug and fix, and logically the description makes sense. Going off the rails
I assume that you are pasting multiple phases of your LLM dev session here, but this all ends up making the individual fixes very confusing. Those patches go on to actually make the same change for MySQL, so we're looking at several incremental LLM changes balled up into a single description. That is significantly more difficult to verify independently. We need a summary by you, the submitter of each change being made, and if there are multiple commits in a single pull request they should have a good reason for being combined. This is not to say the changes made are incorrect, but the logic to reach them can't be followed anymore. We do want the equivalent error message intelligence applied to MySQL, but it would be better done and easier to review as a separate patch. The rest of this section's description and justification appears to be a retread of #1215, which we already established was a good fix that should be applied to MySQL. It is noise here. And from here, the additional patches here are largely unrelated to the foot-gun and to each other. They should all be submitted as separate pull requests with independent justification. LLM policyAs with JRuby, we will accept LLM-driven pull requests but they must be accompanied by discussion with the submitted (sans LLM, i.e. you must be able to justify all changes and own them). The changes made should be of a common topic and not a ball of "fix everything I found in this session". Thank you for putting the time into driving your LLM to make these analyses and fixes. Break them up into topic PRs and we can start discussing and merging them. |
NOTE: Used Opus 4.8 to identify this specific bug and add a test.
Fix: COMMIT/ROLLBACK retry data-loss footgun (PostgreSQL & MySQL)
The bug
The shared ArJdbc::Abstract::TransactionSupport mixin (lib/arjdbc/abstract/transaction_support.rb) implements every transaction control statement — BEGIN, COMMIT,
ROLLBACK, savepoints — with the same options:
allow_retry: truetells ActiveRecord's with_raw_connection machinery that the operation is safe to replay if the connection drops. That's fine for BEGIN(idempotent) but dangerous for COMMIT on a networked database. The failure sequence:
llow_retry: true, AR catches the retryable ConnectionFailed, callsreconnect!(restore_transactions: true) — which replays BEGIN and savepoints but notthe data writes (they died with the old backend).
This diverges from ActiveRecord's own adapters, which deliberately use
allow_retry: falsefor COMMIT/ROLLBACK on both PostgreSQL and MySQL (only SQLite, a localfile, uses true).
Why it became newly reachable on PostgreSQL
The recent pgbouncer patch (06e7afa) started translating backend disconnects into a retryable ActiveRecord::ConnectionFailed. Before that translation, a dropped
COMMIT surfaced as a raw PSQLException that AR never retried — so the footgun was latent. After it, the retry path is live, making silent data loss reachable in
normal pgbouncer deployments.
The fix
Override commit_db_transaction and exec_rollback_db_transaction in each adapter to force
allow_retry: false, matching ActiveRecord's native adapters.(materialize_transactions: true also matches AR's native contract for these methods.)
PostgreSQL — lib/arjdbc/postgresql/adapter.rb:585. The overrides live in the ArJdbc::PostgreSQL module, which is included after TransactionSupport, so they win in
the method resolution order.
MySQL — lib/arjdbc/mysql/adapter.rb:215. Here ArJdbc::MySQL is a native Java module, so the overrides are placed directly in the Mysql2Adapter class body
(class-level definitions override the included mixin).
Both adapters now use:
BEGIN and savepoint methods are intentionally left with
allow_retry: true— replaying them is safe and is what enables clean reconnection after an idle-pool drop.MySQL caveat (why its fix is defensive)
Empirically verified: on MySQL/MariaDB a dropped-socket COMMIT currently raises a plain ActiveRecord::JDBCError, not a retryable ConnectionFailed. Since
retryable_connection_error? only matches ConnectionFailed/ConnectionNotEstablished, AR won't retry a failed MySQL commit today regardless of the flag — the
footgun is masked on MySQL. The MySQL fix therefore pins the safe contract so it can't be silently reintroduced if MySQL ever gains ConnectionFailed translation
(as PostgreSQL just did).
Also patch mysql
The gap: Last week's Postgres patch (06e7afa) translates backend disconnects into a retryable ActiveRecord::ConnectionFailed. MySQL had no equivalent —
Mysql2Adapter#translate_exception only matched two message patterns then called super. super (native AbstractMysqlAdapter) dispatches on error_number, so it only
catches the vendor "server gone" codes (2006/2013/…). But Connector/J and the MariaDB driver raise communications failures with SQLState class 08 (e.g. 08S01
"Communications link failure") and errorCode 0 — which neither path caught. So a proxy/server dropping an idle connection surfaced as a raw
JDBCError/StatementInvalid and AR's reconnect-and-retry never fired. (This is also why the earlier COMMIT-retry fix was latent on MySQL.)
The fix (lib/arjdbc/mysql/adapter.rb), mirroring the PG structure:
2013, 1053, 1927, 4031), connection-failure message patterns, and a wrapped SQLRecoverableException/SQLNonTransientConnectionException cause.
Also fix: PostgreSQL connection leak on failed establishment
PostgreSQLRubyJdbcConnection#newConnectionopened a physical backend viasuper.newConnection()and then ranconnection.unwrap(PGConnection.class)plus sixpgConnection.addDataType(...)registrations with no cleanup. If any of that post-connect setup threw, the already-open server-side connection was leaked — the caller only sees the exception and never receives a handle it could close.Leaked backends accumulate against PostgreSQL's
max_connections, which is exactly the kind of self-poisoning that produces a "too many clients already" /ConnectionTimeoutErrorcascade across a test suite or a long-running app.The fix (
src/java/arjdbc/postgresql/PostgreSQLRubyJdbcConnection.java) wraps the post-connect setup in atry/catchthat closes the physical connection (suppressing any close error onto the original exception) before re-raising — mirroring how the native adapter discards a connection it could not fully establish.Verified: a full PostgreSQL Rails test-suite run shows zero "too many clients already" occurrences (the only timeout markers trace to
connection_pool_test.rb, a test that deliberately uses a 0.4s timeout).Also add: SQL warnings support (AR 7.2
db_warnings_action)The JDBC PostgreSQL adapter does not inherit the native pg adapter, so libpq's
set_notice_receiverwarning path never existed —db_warnings_actionwas a no-op and warning-related tests failed withundefined method 'to_a' for 0:Integer. PostgreSQLRAISE WARNING/ NOTICE messages land on the JDBC Statement's warning chain and are lost when the statement closes, so they must be captured on the Java side.RubyJdbcConnection#executenow capturesstatement.getWarnings()before the statement closes and exposes them via#last_warnings.PostgreSQLRubyJdbcConnection#newWarningmaps eachPSQLWarning'sServerErrorMessageto[message, sqlstate, severity](clean primary message, SQLSTATE, and "WARNING"/"NOTICE" level).raw_executeoverride (lib/arjdbc/postgresql/database_statements.rb) dispatches captured warnings throughhandle_warningstoActiveRecord.db_warnings_action, applieswarning_ignored?(NOTICE-level and thedb_warnings_ignoreallowlist are skipped), and normalises the Integer update-count return to[]forPG::ResultAPI parity.This clears the
db_warnings_actionlog/raise/proc/allowlist tests. (The jarlib/arjdbc/jdbc/adapter_java.jaris gitignored and rebuilt during packaging/CI, so only the Java/Ruby source is committed.)Also: reduce per-connection PostgreSQL OID type-map cost + forward exec retry kwargs
Profiling the lazy-connection path surfaced two adapter-side costs unrelated to the transaction fixes above. Ported from the 71-stable lazy-connection remediation work.
Background. Under AR 7.2's lazy-connection model the PostgreSQL adapter connects on first use and re-runs
configure_connectionon every reconnect.configure_connectionends withreload_type_map, which runs threepg_type/pg_rangecatalog sweeps to build the OID type map — and nothing shared those rows across connections. Under JRuby (no GVL, a genuinely concurrent pool, full-cost reconnects with no cheapPQreset) this expensive metadata bootstrap is paid repeatedly, on the request path.A1 — build the OID type map once per adapter instance.
@type_mapis created in#initialize, and the automatic#initialize_type_mapin#configure_connectionis guarded by@type_map_initializedso reconnects reuse the map (OIDs are stable per server).#reload_type_mapremains the explicit refresh for schema/type changes (extensions, enums,create_enum, etc.). It deliberately callsinitialize_type_map(notreload_type_map) in the guarded path so a fresh adapter's first connect does not invalidate the shared cache another connection to the same database already populated.A2 — share the catalog rows across connections to the same database. The first connection runs the three catalog sweeps and caches the resulting row-sets; every other connection builds its own isolated
type_mapfrom the shared copy with zero DB round-trips. The cache is scoped per-database —[host, port, database, jdbc_url], recovering the database name from the URL path for url-only configs — and invalidated by#reload_type_map. Thread-safety (JRuby has no GVL): aMutexguards the cacheHashfor all get/store/clear, the catalog queries run outside the lock (no blocking, no deadlock on re-entry, first-writer-wins on store), and cached rows are deep-frozen so concurrent readers need no lock andTypeMapInitializer#run(which only reads) can never mutate the shared copy.Also — forward
allow_retry:/materialize_transactions:frominternal_exec_query. The method's signature accepted these kwargs but silently dropped them — line 47 calledwith_raw_connection do |conn|without forwarding, unlikeraw_execute. Now forwarded, so retry/materialization intent is honored for queries that route throughinternal_exec_query.Files:
lib/arjdbc/postgresql/adapter.rb(A1),lib/arjdbc/postgresql/oid_types.rb(A2),lib/arjdbc/abstract/database_statements.rb(kwarg forwarding).Verified: PostgreSQL suite unchanged at 311 tests; the 1 failure + 1 error are pre-existing and reproduce on the base branch (Ruby 3.4
File.exists?removal in a rake test; a prepared-statement timezone assertion) — this change introduces no new failures.