Skip to content

fix(jdbc): add Hikari keepalive and bounded socket timeout#41851

Closed
wyattwalter wants to merge 3 commits into
releasefrom
fix/jdbc-pool-keepalive-and-socket-timeout
Closed

fix(jdbc): add Hikari keepalive and bounded socket timeout#41851
wyattwalter wants to merge 3 commits into
releasefrom
fix/jdbc-pool-keepalive-and-socket-timeout

Conversation

@wyattwalter

@wyattwalter wyattwalter commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Resolves APP-15267.

A production pod recently hung for ~66 hours with every Postgres query timing out at the 10s action timeout, recovering only on pod replacement. Root cause: HikariCP's connection-adder thread blocks indefinitely inside pgjdbc's auth read because socketTimeout defaults to 0 (infinite). Since addConnectionExecutor is single-threaded, one stuck creator wedges the entire pool — every subsequent borrow fails at the action timeout because no replacement connection is ever produced.

Two complementary changes across 4 JDBC plugins (Postgres, MSSQL, Oracle, Redshift):

  • Hikari setKeepaliveTime(150_000)Connection.isValid() probe on idle connections every 150s, comfortably under AWS NAT's 350s idle-eviction default. Prevents the most common trigger that forces the pool to create new connections.
  • Driver-level socket read timeout — bounds every socket read including the auth-time read where the connection-adder was observed hung. Default 600s (10 min), configurable via appsmith.plugin.jdbc.socket-timeout-seconds (env APPSMITH_PLUGIN_JDBC_SOCKET_TIMEOUT_SECONDS). 600s leaves plenty of headroom for legitimately slow queries (default action timeout is 10s) while ensuring the pool can always recover.

Driver-specific property mapping:

Plugin Property Unit
Postgres / Redshift socketTimeout seconds
MSSQL socketTimeout milliseconds
Oracle oracle.net.READ_TIMEOUT milliseconds
Snowflake keepalive only (driver uses a different property mechanism; full coverage deferred)

Oracle and Redshift previously had no ConnectionPoolConfig injection; constructor-injected here following the existing Postgres/MSSQL pattern.

Reproduction and validation

Reproduced the wedge end-to-end with toxiproxy + iptables against a local Appsmith v2.0 container. Thread dump confirmed the hang in HikariPool-N connection adder at doAuthentication.receiveChar, with byte-identical CPU time across snapshots ~7 minutes apart — definitive proof the thread was parked indefinitely. With the fix, the pool recovers automatically once the network path is restored.

Test plan

  • Deploy preview comes up cleanly and Postgres datasources connect normally
  • Verify default 600s applies when APPSMITH_PLUGIN_JDBC_SOCKET_TIMEOUT_SECONDS is unset
  • Verify env-var override is honored (set to a different value, observe behavior)
  • Confirm a normal Postgres query under 600s server-side latency executes successfully
  • (Manual repro) Simulate connection-adder wedge via toxiproxy + iptables; confirm pool recovers within ~600s instead of hanging indefinitely

Follow-ups (not in this PR)

  • Snowflake networkTimeout plumbing (its driver uses a different property mechanism that lives in upstream Properties)
  • Consider per-query Connection.setNetworkTimeout(actionTimeout + buffer) in executeCommon for more precise per-query bounding (defense in depth)
  • Production observability: periodic Hikari pool-stats logging (not just on successful query) so the next recurrence — if any — surfaces faster

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Configurable socket timeout for database connections with a 600-second default, exposed via configuration.
  • Improvements

    • Connection pools across multiple database plugins now respect the configured socket timeout and enable TCP keepalive probes to reduce idle connection drops.
  • Tests

    • Updated tests to provide and validate the new socket timeout and pool-size parameters.

Review Change Stack

Warning

Tests have not run on the HEAD fedd1e4 yet


Tue, 09 Jun 2026 21:00:59 UTC

A production pod recently hung for 66 hours with every Postgres query
timing out at the 10s action timeout, recovering only on pod replacement.
The root cause is the HikariCP connection-adder thread blocking
indefinitely inside pgjdbc's auth read because socketTimeout defaults to
0 (infinite). Since addConnectionExecutor is single-threaded, one stuck
creator wedges the entire pool.

Two complementary changes across 4 JDBC plugins (Postgres, MSSQL, Oracle,
Redshift):

- Hikari setKeepaliveTime(150s): sends an isValid() probe on idle
  connections, well under AWS NAT Gateway's 350s idle-eviction default.
  Prevents the most common trigger that forces the pool to create a
  new connection in the first place.

- Driver-level socket read timeout, default 600s, exposed via
  ConnectionPoolConfig.getSocketTimeoutSeconds() and configurable via
  appsmith.plugin.jdbc.socket-timeout-seconds (env:
  APPSMITH_PLUGIN_JDBC_SOCKET_TIMEOUT_SECONDS). Bounds every socket read
  including the auth-time read where the connection-adder was observed
  hung in the thread dump, so the pool can always recover.

Driver-specific property mapping:
- Postgres / Redshift (pgjdbc-derived): socketTimeout (seconds)
- MSSQL: socketTimeout (milliseconds)
- Oracle: oracle.net.READ_TIMEOUT (milliseconds)
- Snowflake: keepalive only (driver uses a different property mechanism;
  full coverage deferred to a follow-up)

Oracle and Redshift previously had no ConnectionPoolConfig injection;
constructor-injected here following the existing Postgres/MSSQL pattern.

Verified via toxiproxy + iptables harness that reproduces the wedge
locally; thread dump shows zero CPU progress across snapshots. With the
fix, the pool recovers automatically once the network path is restored.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@wyattwalter

Copy link
Copy Markdown
Contributor Author

/build-deploy-preview skip-tests=true

@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

Adds a reactive socket-timeout configuration and applies it to Hikari/datasource setup across SQL Server, Oracle, Postgres, and Redshift plugins; Snowflake receives TCP keepalive tuning. Tests updated to provide mock socket-timeout values.

Changes

Socket Timeout Configuration

Layer / File(s) Summary
Socket timeout contract and server configuration
app/server/appsmith-interfaces/src/main/java/com/appsmith/external/configurations/connectionpool/ConnectionPoolConfigCE.java, app/server/appsmith-server/src/main/java/com/appsmith/server/configurations/connectionpool/ConnectionPoolConfigCEImpl.java
Adds getSocketTimeoutSeconds() to the connection-pool interface; implements it with Spring-injected appsmith.plugin.jdbc.socket-timeout-seconds (default 600s).
SQL Server plugin socket timeout integration
app/server/appsmith-plugins/mssqlPlugin/src/main/java/com/external/plugins/MssqlPlugin.java
Defers pool creation with Mono.zip(maxPoolSize, socketTimeoutSeconds) and extends createConnectionPool to set TCP keepalive and a JDBC socketTimeout property (ms).
Postgres plugin socket timeout integration
app/server/appsmith-plugins/postgresPlugin/src/main/java/com/external/plugins/PostgresPlugin.java
Zips max pool size and socket timeout from config, updates createConnectionPool(...) signature, and sets setKeepaliveTime(150_000) plus socketTimeout datasource property.
Oracle plugin socket timeout integration
app/server/appsmith-plugins/oraclePlugin/src/main/java/com/external/plugins/OraclePlugin.java, app/server/appsmith-plugins/oraclePlugin/src/main/java/com/external/plugins/utils/OracleDatasourceUtils.java
Constructor-injects ConnectionPoolConfig, retrieves socket timeout, passes it to createConnectionPool(datasourceConfiguration, socketTimeoutSeconds), and sets Oracle JDBC oracle.jdbc.ReadTimeout plus keepalive.
Redshift plugin socket timeout integration
app/server/appsmith-plugins/redshiftPlugin/src/main/java/com/external/plugins/RedshiftPlugin.java, app/server/appsmith-plugins/redshiftPlugin/src/main/java/com/external/utils/RedshiftDatasourceUtils.java
Injects ConnectionPoolConfig into executor, reads socket timeout, updates createConnectionPool to accept it, and configures keepalive and driver socketTimeout property.
Snowflake plugin TCP keepalive configuration
app/server/appsmith-plugins/snowflakePlugin/src/main/java/com/external/plugins/SnowflakePlugin.java
Adds config.setKeepaliveTime(150_000) to Hikari setup to probe idle connections.
Test updates: MockConnectionPoolConfig implementations
app/server/appsmith-plugins/*/src/test/java/...
Multiple plugin tests add or update MockConnectionPoolConfig to return Mono.just(600) for getSocketTimeoutSeconds() and construct executors with the mock config.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

ok-to-test

Suggested reviewers

  • sondermanish

Poem

⏳ Socket timeouts now flow,
Across pools where JDBC goes,
Keepalives knock when links sleep,
Milliseconds bound each read,
Tests mock the tide so builders know.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title concisely and accurately describes the main change: adding Hikari keepalive configuration and socket timeout bounds to JDBC plugins.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is comprehensive and follows the template structure with clear motivation, technical details, issue reference, and test plan.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/jdbc-pool-keepalive-and-socket-timeout

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown

Deploying Your Preview: https://github.com/appsmithorg/appsmith/actions/runs/26537670169.
Workflow: On demand build Docker image and deploy preview.
skip-tests: true.
env: ``.
PR: 41851.
recreate: .
base-image-tag: .

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@app/server/appsmith-plugins/oraclePlugin/src/main/java/com/external/plugins/utils/OracleDatasourceUtils.java`:
- Around line 452-455: The current code sets the socket read timeout with
config.addDataSourceProperty("oracle.net.READ_TIMEOUT",
String.valueOf(socketTimeoutSeconds * 1000)); but the Oracle thin driver expects
the documented property key "oracle.jdbc.ReadTimeout" (milliseconds). Change the
property key passed to config.addDataSourceProperty to "oracle.jdbc.ReadTimeout"
and keep the millisecond conversion using socketTimeoutSeconds * 1000
(reference: OracleDatasourceUtils, the config.addDataSourceProperty call and the
socketTimeoutSeconds variable).

In
`@app/server/appsmith-server/src/main/java/com/appsmith/server/configurations/connectionpool/ConnectionPoolConfigCEImpl.java`:
- Around line 27-30: getSocketTimeoutSeconds in ConnectionPoolConfigCEImpl
returns raw socketTimeoutSeconds which allows 0 or negative values to
disable/break read timeouts; validate and clamp or reject non-positive values
before returning by checking socketTimeoutSeconds in getSocketTimeoutSeconds and
returning Mono.just(...) with Math.max(socketTimeoutSeconds,
DEFAULT_SOCKET_TIMEOUT_SECONDS) (or throw a validation error) so the method
always yields a positive bounded timeout; reference the getSocketTimeoutSeconds
method, socketTimeoutSeconds field, and DEFAULT_SOCKET_TIMEOUT_SECONDS when
making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 026d4f9c-8282-4b88-ad9e-612e22feed33

📥 Commits

Reviewing files that changed from the base of the PR and between 99b2c28 and c00a042.

📒 Files selected for processing (9)
  • app/server/appsmith-interfaces/src/main/java/com/appsmith/external/configurations/connectionpool/ConnectionPoolConfigCE.java
  • app/server/appsmith-plugins/mssqlPlugin/src/main/java/com/external/plugins/MssqlPlugin.java
  • app/server/appsmith-plugins/oraclePlugin/src/main/java/com/external/plugins/OraclePlugin.java
  • app/server/appsmith-plugins/oraclePlugin/src/main/java/com/external/plugins/utils/OracleDatasourceUtils.java
  • app/server/appsmith-plugins/postgresPlugin/src/main/java/com/external/plugins/PostgresPlugin.java
  • app/server/appsmith-plugins/redshiftPlugin/src/main/java/com/external/plugins/RedshiftPlugin.java
  • app/server/appsmith-plugins/redshiftPlugin/src/main/java/com/external/utils/RedshiftDatasourceUtils.java
  • app/server/appsmith-plugins/snowflakePlugin/src/main/java/com/external/plugins/SnowflakePlugin.java
  • app/server/appsmith-server/src/main/java/com/appsmith/server/configurations/connectionpool/ConnectionPoolConfigCEImpl.java

Three follow-up fixes from CI and review:

1. Unit-test compile errors: adding the ConnectionPoolConfig constructor
   parameter to OraclePluginExecutor and RedshiftPluginExecutor broke
   five test files that were instantiating them with no-arg. Updated
   those tests (4 Oracle + 1 Redshift with 4 spy calls) to pass a local
   MockConnectionPoolConfig matching the existing Postgres/MSSQL pattern.

2. Updated the new getSocketTimeoutSeconds() method on every existing
   MockConnectionPoolConfig in test code (Postgres, MSSQL, 3x MySQL) so
   the interface change does not break compilation.

3. CodeRabbit feedback applied:
   - Oracle JDBC thin driver property is oracle.jdbc.ReadTimeout
     (milliseconds), not oracle.net.READ_TIMEOUT. The latter is a
     TNS/listener setting and would have been silently ignored.
   - Clamp non-positive socket-timeout overrides to the default so a
     stray 0 or negative value cannot re-introduce the infinite-read
     wedge this config exists to prevent.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown

This PR has not seen activitiy for a while. It will be closed in 7 days unless further activity is detected.

@github-actions github-actions Bot added the Stale label Jun 6, 2026
@wyattwalter

Copy link
Copy Markdown
Contributor Author

/build-deploy-preview skip-tests=true

@github-actions

Copy link
Copy Markdown

Deploying Your Preview: https://github.com/appsmithorg/appsmith/actions/runs/27284912122.
Workflow: On demand build Docker image and deploy preview.
skip-tests: true.
env: ``.
PR: 41851.
recreate: .
base-image-tag: .

@github-actions

Copy link
Copy Markdown

Deploy-Preview-URL: https://ce-41851.dp.appsmith.com

@github-actions github-actions Bot removed the Stale label Jun 10, 2026
@github-actions

Copy link
Copy Markdown

This PR has not seen activitiy for a while. It will be closed in 7 days unless further activity is detected.

@github-actions github-actions Bot added the Stale label Jun 18, 2026
@github-actions

Copy link
Copy Markdown

This PR has been closed because of inactivity.

@github-actions github-actions Bot closed this Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant