fix(jdbc): add Hikari keepalive and bounded socket timeout#41851
fix(jdbc): add Hikari keepalive and bounded socket timeout#41851wyattwalter wants to merge 3 commits into
Conversation
A production pod recently hung for 66 hours with every Postgres query timing out at the 10s action timeout, recovering only on pod replacement. The root cause is the HikariCP connection-adder thread blocking indefinitely inside pgjdbc's auth read because socketTimeout defaults to 0 (infinite). Since addConnectionExecutor is single-threaded, one stuck creator wedges the entire pool. Two complementary changes across 4 JDBC plugins (Postgres, MSSQL, Oracle, Redshift): - Hikari setKeepaliveTime(150s): sends an isValid() probe on idle connections, well under AWS NAT Gateway's 350s idle-eviction default. Prevents the most common trigger that forces the pool to create a new connection in the first place. - Driver-level socket read timeout, default 600s, exposed via ConnectionPoolConfig.getSocketTimeoutSeconds() and configurable via appsmith.plugin.jdbc.socket-timeout-seconds (env: APPSMITH_PLUGIN_JDBC_SOCKET_TIMEOUT_SECONDS). Bounds every socket read including the auth-time read where the connection-adder was observed hung in the thread dump, so the pool can always recover. Driver-specific property mapping: - Postgres / Redshift (pgjdbc-derived): socketTimeout (seconds) - MSSQL: socketTimeout (milliseconds) - Oracle: oracle.net.READ_TIMEOUT (milliseconds) - Snowflake: keepalive only (driver uses a different property mechanism; full coverage deferred to a follow-up) Oracle and Redshift previously had no ConnectionPoolConfig injection; constructor-injected here following the existing Postgres/MSSQL pattern. Verified via toxiproxy + iptables harness that reproduces the wedge locally; thread dump shows zero CPU progress across snapshots. With the fix, the pool recovers automatically once the network path is restored. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
/build-deploy-preview skip-tests=true |
WalkthroughAdds a reactive socket-timeout configuration and applies it to Hikari/datasource setup across SQL Server, Oracle, Postgres, and Redshift plugins; Snowflake receives TCP keepalive tuning. Tests updated to provide mock socket-timeout values. ChangesSocket Timeout Configuration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Deploying Your Preview: https://github.com/appsmithorg/appsmith/actions/runs/26537670169. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@app/server/appsmith-plugins/oraclePlugin/src/main/java/com/external/plugins/utils/OracleDatasourceUtils.java`:
- Around line 452-455: The current code sets the socket read timeout with
config.addDataSourceProperty("oracle.net.READ_TIMEOUT",
String.valueOf(socketTimeoutSeconds * 1000)); but the Oracle thin driver expects
the documented property key "oracle.jdbc.ReadTimeout" (milliseconds). Change the
property key passed to config.addDataSourceProperty to "oracle.jdbc.ReadTimeout"
and keep the millisecond conversion using socketTimeoutSeconds * 1000
(reference: OracleDatasourceUtils, the config.addDataSourceProperty call and the
socketTimeoutSeconds variable).
In
`@app/server/appsmith-server/src/main/java/com/appsmith/server/configurations/connectionpool/ConnectionPoolConfigCEImpl.java`:
- Around line 27-30: getSocketTimeoutSeconds in ConnectionPoolConfigCEImpl
returns raw socketTimeoutSeconds which allows 0 or negative values to
disable/break read timeouts; validate and clamp or reject non-positive values
before returning by checking socketTimeoutSeconds in getSocketTimeoutSeconds and
returning Mono.just(...) with Math.max(socketTimeoutSeconds,
DEFAULT_SOCKET_TIMEOUT_SECONDS) (or throw a validation error) so the method
always yields a positive bounded timeout; reference the getSocketTimeoutSeconds
method, socketTimeoutSeconds field, and DEFAULT_SOCKET_TIMEOUT_SECONDS when
making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 026d4f9c-8282-4b88-ad9e-612e22feed33
📒 Files selected for processing (9)
app/server/appsmith-interfaces/src/main/java/com/appsmith/external/configurations/connectionpool/ConnectionPoolConfigCE.javaapp/server/appsmith-plugins/mssqlPlugin/src/main/java/com/external/plugins/MssqlPlugin.javaapp/server/appsmith-plugins/oraclePlugin/src/main/java/com/external/plugins/OraclePlugin.javaapp/server/appsmith-plugins/oraclePlugin/src/main/java/com/external/plugins/utils/OracleDatasourceUtils.javaapp/server/appsmith-plugins/postgresPlugin/src/main/java/com/external/plugins/PostgresPlugin.javaapp/server/appsmith-plugins/redshiftPlugin/src/main/java/com/external/plugins/RedshiftPlugin.javaapp/server/appsmith-plugins/redshiftPlugin/src/main/java/com/external/utils/RedshiftDatasourceUtils.javaapp/server/appsmith-plugins/snowflakePlugin/src/main/java/com/external/plugins/SnowflakePlugin.javaapp/server/appsmith-server/src/main/java/com/appsmith/server/configurations/connectionpool/ConnectionPoolConfigCEImpl.java
Three follow-up fixes from CI and review:
1. Unit-test compile errors: adding the ConnectionPoolConfig constructor
parameter to OraclePluginExecutor and RedshiftPluginExecutor broke
five test files that were instantiating them with no-arg. Updated
those tests (4 Oracle + 1 Redshift with 4 spy calls) to pass a local
MockConnectionPoolConfig matching the existing Postgres/MSSQL pattern.
2. Updated the new getSocketTimeoutSeconds() method on every existing
MockConnectionPoolConfig in test code (Postgres, MSSQL, 3x MySQL) so
the interface change does not break compilation.
3. CodeRabbit feedback applied:
- Oracle JDBC thin driver property is oracle.jdbc.ReadTimeout
(milliseconds), not oracle.net.READ_TIMEOUT. The latter is a
TNS/listener setting and would have been silently ignored.
- Clamp non-positive socket-timeout overrides to the default so a
stray 0 or negative value cannot re-introduce the infinite-read
wedge this config exists to prevent.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
This PR has not seen activitiy for a while. It will be closed in 7 days unless further activity is detected. |
|
/build-deploy-preview skip-tests=true |
|
Deploying Your Preview: https://github.com/appsmithorg/appsmith/actions/runs/27284912122. |
|
Deploy-Preview-URL: https://ce-41851.dp.appsmith.com |
|
This PR has not seen activitiy for a while. It will be closed in 7 days unless further activity is detected. |
|
This PR has been closed because of inactivity. |
Summary
Resolves APP-15267.
A production pod recently hung for ~66 hours with every Postgres query timing out at the 10s action timeout, recovering only on pod replacement. Root cause: HikariCP's connection-adder thread blocks indefinitely inside pgjdbc's auth read because
socketTimeoutdefaults to0(infinite). SinceaddConnectionExecutoris single-threaded, one stuck creator wedges the entire pool — every subsequent borrow fails at the action timeout because no replacement connection is ever produced.Two complementary changes across 4 JDBC plugins (Postgres, MSSQL, Oracle, Redshift):
setKeepaliveTime(150_000)—Connection.isValid()probe on idle connections every 150s, comfortably under AWS NAT's 350s idle-eviction default. Prevents the most common trigger that forces the pool to create new connections.appsmith.plugin.jdbc.socket-timeout-seconds(envAPPSMITH_PLUGIN_JDBC_SOCKET_TIMEOUT_SECONDS). 600s leaves plenty of headroom for legitimately slow queries (default action timeout is 10s) while ensuring the pool can always recover.Driver-specific property mapping:
socketTimeoutsocketTimeoutoracle.net.READ_TIMEOUTOracle and Redshift previously had no
ConnectionPoolConfiginjection; constructor-injected here following the existing Postgres/MSSQL pattern.Reproduction and validation
Reproduced the wedge end-to-end with toxiproxy + iptables against a local Appsmith v2.0 container. Thread dump confirmed the hang in
HikariPool-N connection adderatdoAuthentication.receiveChar, with byte-identical CPU time across snapshots ~7 minutes apart — definitive proof the thread was parked indefinitely. With the fix, the pool recovers automatically once the network path is restored.Test plan
APPSMITH_PLUGIN_JDBC_SOCKET_TIMEOUT_SECONDSis unsetFollow-ups (not in this PR)
networkTimeoutplumbing (its driver uses a different property mechanism that lives in upstream Properties)Connection.setNetworkTimeout(actionTimeout + buffer)inexecuteCommonfor more precise per-query bounding (defense in depth)🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Improvements
Tests
Warning
Tests have not run on the HEAD fedd1e4 yet
Tue, 09 Jun 2026 21:00:59 UTC