fix: implement connection tracking in metrics#4672
fix: implement connection tracking in metrics#4672mkleczek wants to merge 1 commit intoPostgREST:mainfrom
Conversation
Fix should have an entry on the CHANGELOG |
0f475e3 to
592e5a3
Compare
1f3b4bc to
bce1f6e
Compare
7383b59 to
9b3c004
Compare
9b3c004 to
6af7b30
Compare
| -- (there should be none but we need to verify that) | ||
| threadDelay $ 1 * sec | ||
|
|
||
| it "Should track in use connections" $ do |
There was a problem hiding this comment.
Q: Does this reproduce the problem on #4622? If not, could we add a new repro test or is that too complicated?
There was a problem hiding this comment.
Q: Does this reproduce the problem on #4622? If not, could we add a new repro test or is that too complicated?
As far as I understand it might be possible to reproduce once we have toxiproxy in place. The issue happens when there connectivity issues.
There was a problem hiding this comment.
For that perhaps you could restart #4674? Should be easier I think now that we don't have slocat.
I also wonder if there's a simpler/focused way by just killing the connection like mentioned on #1766 (comment)
There was a problem hiding this comment.
For a case like this, I wouldn't block on adding an exactly-reproducing test of the original issue: We really changed the entire approach to calculating this metric, from simply reacting to certain events to tracking each connection. I find it's unlikely that someone accidentally introduces a regression here.
That being said, I think we absolutely need some commentary in Metrics.hs about why we are doing the connection tracking and which problems this prevents. With such commentary, someone looking at this later will know what to look for when rewriting the whole thing again.
There was a problem hiding this comment.
That being said, I think we absolutely need some commentary in
Metrics.hs
Done (see inline comment in Metrics.observationMetrics). I am not 100% sure if the comment is in the right place so happy to move it if you think it is necessary.
There was a problem hiding this comment.
For a case like this, I wouldn't block on adding an exactly-reproducing test of the original issue: We really changed the entire approach to calculating this metric, from simply reacting to certain events to tracking each connection.
Yeah, I was thinking we should take this chance to add the required infra to test otherwise we'll end up with a "test debt" (like on #1766).
But I guess we still have #4670 as another chance to test an unreliable network.
So I won't block here if you approve here @wolfgangwalther
There was a problem hiding this comment.
We really changed the entire approach to calculating this metric, from simply reacting to certain events to tracking each connection. I find it's unlikely that someone accidentally introduces a regression here.
In light of #4672 (comment), I think there is a chance that we'll rewrite this whole thing again when we have hasql-pool vendored. Thus, I agree with:
Yeah, I was thinking we should take this chance to add the required infra to test otherwise we'll end up with a "test debt" (like on #1766).
|
@mkleczek Since this is a |
27d9a1d to
e438eed
Compare
e438eed to
41b439e
Compare
3d14884 to
decc1a0
Compare
d541ff0 to
c351c30
Compare
c3aa110 to
43e0d89
Compare
| -- (there should be none but we need to verify that) | ||
| threadDelay $ 1 * sec | ||
|
|
||
| it "Should track in use connections" $ do |
There was a problem hiding this comment.
For a case like this, I wouldn't block on adding an exactly-reproducing test of the original issue: We really changed the entire approach to calculating this metric, from simply reacting to certain events to tracking each connection. I find it's unlikely that someone accidentally introduces a regression here.
That being said, I think we absolutely need some commentary in Metrics.hs about why we are doing the connection tracking and which problems this prevents. With such commentary, someone looking at this later will know what to look for when rewriting the whole thing again.
b0079b3 to
b66bc15
Compare
| -- Handle pool observations with connection tracking | ||
| -- this is necessary because it is not possible | ||
| -- to accurately maintain open/in use conneciton counts | ||
| -- statelessly based only on pool observation events. | ||
| -- The reason is that hasql-pool emits TerminatedConnectionStatus | ||
| -- both for connections successfully established and failed when connecting. | ||
| -- When receiving TerminatedConnectionStatus we have to find out | ||
| -- if we can decrement established connection count. To do that we have to track | ||
| -- established connections. |
There was a problem hiding this comment.
Thanks for the comment - this helps a lot.
Q: Would this reasoning be different once we vendor hasql, including hasql-pool, and could change the respective events? Would this make the fix here simpler, because we could keep using the same events?
There was a problem hiding this comment.
Thanks for the comment - this helps a lot.
Q: Would this reasoning be different once we vendor hasql, including hasql-pool, and could change the respective events? Would this make the fix here simpler, because we could keep using the same events?
Yes.
Quite frankly - hasql-pool should expose the metrics itself because it already tracks this information anyway. Publishing events is the wrong abstraction here.
EDIT: I mean hasql-pool should expose functions providing the actual numbers. Exposing them as Prometheus metrics should be done similar to how ConnTrack is exposed in this PR.
0563abd to
423afeb
Compare
Right now metrics observation handler does not track database connections but updates a single Gauge based on HasqlPoolObs events. This is problematic because Hasql pool reports various connection events in multiple phases. The connection state machine is not simple and to precisely report the number of connections in various states, it is necessary to track their lifecycles. This change adds a ConnTrack data structure and logic to track database connections lifecycles. At the moment it supports "connected" and "inUse" connection counts precisely. The "pgrst_db_pool_available" metric is implemented on top of ConnTrack instead of a simple Gauge.
423afeb to
c18acfa
Compare
DISCLAIMER:
This commit was authored entirely by a human without the assistance of LLMs.Right now metrics observation handler does not track database connections but updates a single Gauge based on HasqlPoolObs events.
This is problematic because Hasql pool reports various connection events in various states that make it impossible to predict the state change from the received event. The connection state machine is not simple and to precisely report the number of connections in various states, it is necessary to track their lifecycles.
Fixes #4622