Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ All notable changes to this project will be documented in this file. From versio
### Fixed

- Fix unnecessary connection pool flushes during schema cache reloading by @mkleczek in #4645
- Fix race condition in pool_available metric causing negative values during network instability by @mkleczek in #4622

## [14.9] - 2026-04-10

Expand Down
1 change: 1 addition & 0 deletions postgrest.cabal
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,7 @@ library
, stm-hamt >= 1.2 && < 2
, focus >= 1.0 && < 2
, some >= 1.0.4.1 && < 2
, uuid >= 1.3 && < 2
Comment thread
wolfgangwalther marked this conversation as resolved.
-- -fno-spec-constr may help keep compile time memory use in check,
-- see https://gitlab.haskell.org/ghc/ghc/issues/16017#note_219304
-- -optP-Wno-nonportable-include-path
Expand Down
65 changes: 53 additions & 12 deletions src/PostgREST/Metrics.hs
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,10 @@
-}
module PostgREST.Metrics
( init
, ConnTrack
, ConnStats (..)
, MetricsState (..)
, connectionCounts
, observationMetrics
, metricsToText
) where
Expand All @@ -19,13 +22,18 @@

import PostgREST.Observation


import Protolude
import Control.Arrow ((&&&))
import Data.Bitraversable (bisequenceA)
import Data.Tuple.Extra (both)
import Data.UUID (UUID)
import qualified Focus
import Protolude
import qualified StmHamt.SizedHamt as SH

data MetricsState =
MetricsState {
poolTimeouts :: Counter,
poolAvailable :: Gauge,
connTrack :: ConnTrack,
poolWaiting :: Gauge,
poolMaxSize :: Gauge,
schemaCacheLoads :: Vector Label1 Counter,
Expand All @@ -40,7 +48,7 @@
whenM getRTSStatsEnabled $ void $ register PMG.ghcMetrics
metricState <- MetricsState <$>
register (counter (Info "pgrst_db_pool_timeouts_total" "The total number of pool connection timeouts")) <*>
register (gauge (Info "pgrst_db_pool_available" "Available connections in the pool")) <*>
register (Metric ((identity &&& dbPoolAvailable) <$> connectionTracker)) <*>
register (gauge (Info "pgrst_db_pool_waiting" "Requests waiting to acquire a pool connection")) <*>
register (gauge (Info "pgrst_db_pool_max" "Max pool connections")) <*>
register (vector "status" $ counter (Info "pgrst_schema_cache_loads_total" "The total number of times the schema cache was loaded")) <*>
Expand All @@ -50,20 +58,28 @@
register (counter (Info "pgrst_jwt_cache_evictions_total" "The total number of JWT cache evictions"))
setGauge (poolMaxSize metricState) (fromIntegral configDbPoolSize)
pure metricState
where
dbPoolAvailable = (pure . noLabelsGroup (Info "pgrst_db_pool_available" "Available connections in the pool") GaugeType . calcAvailable <$>) . connectionCounts
where
calcAvailable = liftA2 (-) connected inUse
toSample name labels = Sample name labels . encodeUtf8 . show
noLabelsGroup info sampleType = SampleGroup info sampleType . pure . toSample (metricName info) mempty

-- Only some observations are used as metrics
observationMetrics :: MetricsState -> ObservationHandler
observationMetrics MetricsState{..} obs = case obs of
PoolAcqTimeoutObs -> do
incCounter poolTimeouts
(HasqlPoolObs (SQL.ConnectionObservation _ status)) -> case status of
SQL.ReadyForUseConnectionStatus _ -> do
incGauge poolAvailable
SQL.InUseConnectionStatus -> do
decGauge poolAvailable
SQL.TerminatedConnectionStatus _ -> do
decGauge poolAvailable
SQL.ConnectingConnectionStatus -> pure ()
-- Handle pool observations with connection tracking
-- this is necessary because it is not possible
-- to accurately maintain open/in use conneciton counts
-- statelessly based only on pool observation events.
-- The reason is that hasql-pool emits TerminatedConnectionStatus
-- both for connections successfully established and failed when connecting.
-- When receiving TerminatedConnectionStatus we have to find out
-- if we can decrement established connection count. To do that we have to track
-- established connections.
Comment on lines +73 to +81
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment - this helps a lot.

Q: Would this reasoning be different once we vendor hasql, including hasql-pool, and could change the respective events? Would this make the fix here simpler, because we could keep using the same events?

Copy link
Copy Markdown
Collaborator Author

@mkleczek mkleczek Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment - this helps a lot.

Q: Would this reasoning be different once we vendor hasql, including hasql-pool, and could change the respective events? Would this make the fix here simpler, because we could keep using the same events?

Yes.
Quite frankly - hasql-pool should expose the metrics itself because it already tracks this information anyway. Publishing events is the wrong abstraction here.

EDIT: I mean hasql-pool should expose functions providing the actual numbers. Exposing them as Prometheus metrics should be done similar to how ConnTrack is exposed in this PR.

(HasqlPoolObs sqlObs) -> trackConnections connTrack sqlObs
PoolRequest ->
incGauge poolWaiting
PoolRequestFullfilled ->
Expand All @@ -81,3 +97,28 @@

metricsToText :: IO LBS.ByteString
metricsToText = exportMetricsAsText

data ConnStats = ConnStats {
connected :: Int,
inUse :: Int
} deriving (Eq, Show)

Check warning on line 104 in src/PostgREST/Metrics.hs

View check run for this annotation

Codecov / codecov/patch

src/PostgREST/Metrics.hs#L104

Added line #L104 was not covered by tests

data ConnTrack = ConnTrack { connTrackConnected :: SH.SizedHamt UUID, connTrackInUse :: SH.SizedHamt UUID }

connectionTracker :: IO ConnTrack
connectionTracker = ConnTrack <$> SH.newIO <*> SH.newIO

trackConnections :: ConnTrack -> SQL.Observation -> IO ()
trackConnections ConnTrack{..} (SQL.ConnectionObservation uuid status) = case status of
SQL.ReadyForUseConnectionStatus _ -> atomically $
SH.insert identity uuid connTrackConnected *>
SH.focus Focus.delete identity uuid connTrackInUse
SQL.TerminatedConnectionStatus _ -> atomically $
SH.focus Focus.delete identity uuid connTrackConnected *>
SH.focus Focus.delete identity uuid connTrackInUse
SQL.InUseConnectionStatus -> atomically $
SH.insert identity uuid connTrackInUse
_ -> mempty

connectionCounts :: ConnTrack -> IO ConnStats
connectionCounts = atomically . fmap (uncurry ConnStats) . bisequenceA . both SH.size . (connTrackConnected &&& connTrackInUse)
Comment thread
steve-chavez marked this conversation as resolved.
46 changes: 37 additions & 9 deletions test/observability/Observation/MetricsSpec.hs
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,21 @@

module Observation.MetricsSpec where

import Data.List (lookup)
import Network.Wai (Application)
import Data.List (lookup)
import qualified Hasql.Pool.Observation as SQL
import Network.Wai (Application)
import ObsHelper
import qualified PostgREST.AppState as AppState
import PostgREST.Config (AppConfig (configDbSchemas))
import qualified PostgREST.Metrics as Metrics
import qualified PostgREST.AppState as AppState
import PostgREST.Config (AppConfig (configDbSchemas))
import PostgREST.Metrics (ConnStats (..),
MetricsState (..),
connectionCounts)
import PostgREST.Observation
import Prometheus (getCounter, getVectorWith)
import Protolude
import Test.Hspec (SpecWith, describe, it)
import Test.Hspec.Wai (getState)
import Prometheus (getCounter, getVectorWith)
import Test.Hspec (SpecWith, describe, it)
import Test.Hspec.Wai (getState)

import Protolude

spec :: SpecWith (SpecState, Application)
spec = describe "Server started with metrics enabled" $ do
Expand Down Expand Up @@ -71,9 +75,33 @@ spec = describe "Server started with metrics enabled" $ do
-- (there should be none but we need to verify that)
threadDelay $ 1 * sec

it "Should track in use connections" $ do
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Does this reproduce the problem on #4622? If not, could we add a new repro test or is that too complicated?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Does this reproduce the problem on #4622? If not, could we add a new repro test or is that too complicated?

As far as I understand it might be possible to reproduce once we have toxiproxy in place. The issue happens when there connectivity issues.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that perhaps you could restart #4674? Should be easier I think now that we don't have slocat.

I also wonder if there's a simpler/focused way by just killing the connection like mentioned on #1766 (comment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a case like this, I wouldn't block on adding an exactly-reproducing test of the original issue: We really changed the entire approach to calculating this metric, from simply reacting to certain events to tracking each connection. I find it's unlikely that someone accidentally introduces a regression here.

That being said, I think we absolutely need some commentary in Metrics.hs about why we are doing the connection tracking and which problems this prevents. With such commentary, someone looking at this later will know what to look for when rewriting the whole thing again.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that perhaps you could restart #4674? Should be easier I think now that we don't have slocat.

See #4833 #4836

Copy link
Copy Markdown
Collaborator Author

@mkleczek mkleczek Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, I think we absolutely need some commentary in Metrics.hs

Done (see inline comment in Metrics.observationMetrics). I am not 100% sure if the comment is in the right place so happy to move it if you think it is necessary.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a case like this, I wouldn't block on adding an exactly-reproducing test of the original issue: We really changed the entire approach to calculating this metric, from simply reacting to certain events to tracking each connection.

Yeah, I was thinking we should take this chance to add the required infra to test otherwise we'll end up with a "test debt" (like on #1766).

But I guess we still have #4670 as another chance to test an unreliable network.

So I won't block here if you approve here @wolfgangwalther

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really changed the entire approach to calculating this metric, from simply reacting to certain events to tracking each connection. I find it's unlikely that someone accidentally introduces a regression here.

In light of #4672 (comment), I think there is a chance that we'll rewrite this whole thing again when we have hasql-pool vendored. Thus, I agree with:

Yeah, I was thinking we should take this chance to add the required infra to test otherwise we'll end up with a "test debt" (like on #1766).

SpecState{specAppState = appState, specMetrics = metrics, specObsChan} <- getState
let waitFor = waitForObs specObsChan

liftIO $ checkState' metrics [
-- we expect in use connections to be the same once finished
inUseConnections (+ 0)
] $ do
signal <- newEmptyMVar
-- make sure waiting thread is signaled
bracket_ (pure ()) (putMVar signal ()) $
-- expecting one more connection in use
checkState' metrics [
inUseConnections (+ 1)
] $ do
-- start a thread hanging on a single connection until signaled
void $ forkIO $ void $ AppState.usePool appState $ liftIO (readMVar signal)
-- main thread waits for ConnectionObservation with InUseConnectionStatus
-- after which used connections count should be incremented
waitFor (1 * sec) "InUseConnectionStatus" $ \x -> [ o | o@(HasqlPoolObs (SQL.ConnectionObservation _ SQL.InUseConnectionStatus)) <- pure x]

-- hanging thread was signaled and should return the connection
waitFor (1 * sec) "ReadyForUseConnectionStatus" $ \x -> [ o | o@(HasqlPoolObs (SQL.ConnectionObservation _ (SQL.ReadyForUseConnectionStatus _))) <- pure x]
Comment on lines +79 to +100
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at this test-case.. and it just feels like an imperative language like python is much better to represent such a step-by-step test.

I can't even make full sense of the test, but I believe pytest + toxiproxy + metrics endpoint should allow the same test, right?


where
-- prometheus-client api to handle vectors is convoluted
schemaCacheLoads label = expectField @"schemaCacheLoads" $
fmap (maybe (0::Int) round . lookup label) . (`getVectorWith` getCounter)
inUseConnections = expectField @"connTrack" ((inUse <$>) . connectionCounts)
sec = 1000000