Skip to content

fix: implement connection tracking in metrics#4672

Open
mkleczek wants to merge 1 commit intoPostgREST:mainfrom
mkleczek:refactor/connection-tracking
Open

fix: implement connection tracking in metrics#4672
mkleczek wants to merge 1 commit intoPostgREST:mainfrom
mkleczek:refactor/connection-tracking

Conversation

@mkleczek
Copy link
Copy Markdown
Collaborator

@mkleczek mkleczek commented Feb 25, 2026

DISCLAIMER:
This commit was authored entirely by a human without the assistance of LLMs.

Right now metrics observation handler does not track database connections but updates a single Gauge based on HasqlPoolObs events.

This is problematic because Hasql pool reports various connection events in various states that make it impossible to predict the state change from the received event. The connection state machine is not simple and to precisely report the number of connections in various states, it is necessary to track their lifecycles.

Fixes #4622

@steve-chavez
Copy link
Copy Markdown
Member

Fixes #4622

Fix should have an entry on the CHANGELOG

Comment thread src/PostgREST/Metrics.hs Outdated
Comment thread src/PostgREST/Metrics.hs
@mkleczek mkleczek force-pushed the refactor/connection-tracking branch 5 times, most recently from 0f475e3 to 592e5a3 Compare March 10, 2026 14:53
@mkleczek mkleczek force-pushed the refactor/connection-tracking branch 2 times, most recently from 1f3b4bc to bce1f6e Compare March 13, 2026 06:49
@mkleczek mkleczek force-pushed the refactor/connection-tracking branch 8 times, most recently from 7383b59 to 9b3c004 Compare April 2, 2026 14:45
@mkleczek mkleczek force-pushed the refactor/connection-tracking branch from 9b3c004 to 6af7b30 Compare April 8, 2026 12:33
-- (there should be none but we need to verify that)
threadDelay $ 1 * sec

it "Should track in use connections" $ do
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Does this reproduce the problem on #4622? If not, could we add a new repro test or is that too complicated?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Does this reproduce the problem on #4622? If not, could we add a new repro test or is that too complicated?

As far as I understand it might be possible to reproduce once we have toxiproxy in place. The issue happens when there connectivity issues.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that perhaps you could restart #4674? Should be easier I think now that we don't have slocat.

I also wonder if there's a simpler/focused way by just killing the connection like mentioned on #1766 (comment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a case like this, I wouldn't block on adding an exactly-reproducing test of the original issue: We really changed the entire approach to calculating this metric, from simply reacting to certain events to tracking each connection. I find it's unlikely that someone accidentally introduces a regression here.

That being said, I think we absolutely need some commentary in Metrics.hs about why we are doing the connection tracking and which problems this prevents. With such commentary, someone looking at this later will know what to look for when rewriting the whole thing again.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that perhaps you could restart #4674? Should be easier I think now that we don't have slocat.

See #4833 #4836

Copy link
Copy Markdown
Collaborator Author

@mkleczek mkleczek Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, I think we absolutely need some commentary in Metrics.hs

Done (see inline comment in Metrics.observationMetrics). I am not 100% sure if the comment is in the right place so happy to move it if you think it is necessary.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a case like this, I wouldn't block on adding an exactly-reproducing test of the original issue: We really changed the entire approach to calculating this metric, from simply reacting to certain events to tracking each connection.

Yeah, I was thinking we should take this chance to add the required infra to test otherwise we'll end up with a "test debt" (like on #1766).

But I guess we still have #4670 as another chance to test an unreliable network.

So I won't block here if you approve here @wolfgangwalther

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really changed the entire approach to calculating this metric, from simply reacting to certain events to tracking each connection. I find it's unlikely that someone accidentally introduces a regression here.

In light of #4672 (comment), I think there is a chance that we'll rewrite this whole thing again when we have hasql-pool vendored. Thus, I agree with:

Yeah, I was thinking we should take this chance to add the required infra to test otherwise we'll end up with a "test debt" (like on #1766).

@steve-chavez
Copy link
Copy Markdown
Member

@mkleczek Since this is a fix, let's change the commit, PR title and add a fix entry in the CHANGELOG.

@mkleczek mkleczek force-pushed the refactor/connection-tracking branch 5 times, most recently from 27d9a1d to e438eed Compare April 14, 2026 07:02
@steve-chavez
Copy link
Copy Markdown
Member

@mkleczek Could you please address the feedback here? I've repeated the same twice now 😩

@mkleczek
Copy link
Copy Markdown
Collaborator Author

mkleczek commented Apr 14, 2026

@mkleczek Could you please address the feedback here? I've repeated the same twice now 😩

Done

@mkleczek mkleczek force-pushed the refactor/connection-tracking branch from e438eed to 41b439e Compare April 14, 2026 19:47
@mkleczek mkleczek requested a review from steve-chavez April 14, 2026 19:48
@mkleczek mkleczek changed the title refactor: implement connection tracking in metrics fix: implement connection tracking in metrics Apr 14, 2026
@mkleczek mkleczek force-pushed the refactor/connection-tracking branch 2 times, most recently from 3d14884 to decc1a0 Compare April 15, 2026 08:02
Comment thread CHANGELOG.md Outdated
@mkleczek mkleczek force-pushed the refactor/connection-tracking branch 2 times, most recently from d541ff0 to c351c30 Compare April 20, 2026 08:56
Comment thread CHANGELOG.md Outdated
@mkleczek mkleczek force-pushed the refactor/connection-tracking branch 2 times, most recently from c3aa110 to 43e0d89 Compare April 21, 2026 04:54
Comment thread postgrest.cabal
-- (there should be none but we need to verify that)
threadDelay $ 1 * sec

it "Should track in use connections" $ do
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a case like this, I wouldn't block on adding an exactly-reproducing test of the original issue: We really changed the entire approach to calculating this metric, from simply reacting to certain events to tracking each connection. I find it's unlikely that someone accidentally introduces a regression here.

That being said, I think we absolutely need some commentary in Metrics.hs about why we are doing the connection tracking and which problems this prevents. With such commentary, someone looking at this later will know what to look for when rewriting the whole thing again.

@mkleczek mkleczek force-pushed the refactor/connection-tracking branch 2 times, most recently from b0079b3 to b66bc15 Compare April 21, 2026 17:28
Comment thread src/PostgREST/Metrics.hs
Comment on lines +70 to +78
-- Handle pool observations with connection tracking
-- this is necessary because it is not possible
-- to accurately maintain open/in use conneciton counts
-- statelessly based only on pool observation events.
-- The reason is that hasql-pool emits TerminatedConnectionStatus
-- both for connections successfully established and failed when connecting.
-- When receiving TerminatedConnectionStatus we have to find out
-- if we can decrement established connection count. To do that we have to track
-- established connections.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment - this helps a lot.

Q: Would this reasoning be different once we vendor hasql, including hasql-pool, and could change the respective events? Would this make the fix here simpler, because we could keep using the same events?

Copy link
Copy Markdown
Collaborator Author

@mkleczek mkleczek Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment - this helps a lot.

Q: Would this reasoning be different once we vendor hasql, including hasql-pool, and could change the respective events? Would this make the fix here simpler, because we could keep using the same events?

Yes.
Quite frankly - hasql-pool should expose the metrics itself because it already tracks this information anyway. Publishing events is the wrong abstraction here.

EDIT: I mean hasql-pool should expose functions providing the actual numbers. Exposing them as Prometheus metrics should be done similar to how ConnTrack is exposed in this PR.

@mkleczek mkleczek force-pushed the refactor/connection-tracking branch 8 times, most recently from 0563abd to 423afeb Compare April 28, 2026 12:26
Right now metrics observation handler does not track database connections but updates a single Gauge based on HasqlPoolObs events. This is problematic because Hasql pool reports various connection events in multiple phases. The connection state machine is not simple and to precisely report the number of connections in various states, it is necessary to track their lifecycles.

This change adds a ConnTrack data structure and logic to track database connections lifecycles. At the moment it supports "connected" and "inUse" connection counts precisely. The "pgrst_db_pool_available" metric is implemented on top of ConnTrack instead of a simple Gauge.
@mkleczek mkleczek force-pushed the refactor/connection-tracking branch from 423afeb to c18acfa Compare April 29, 2026 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Race condition in pool_available metric causes negative values during network instability

3 participants