Skip to content

feat: produce OpenTelemetry traces with hs-opentelemetry#3140

Open
develop7 wants to merge 1 commit into
PostgREST:mainfrom
develop7:feat_opentelemetry-traces
Open

feat: produce OpenTelemetry traces with hs-opentelemetry#3140
develop7 wants to merge 1 commit into
PostgREST:mainfrom
develop7:feat_opentelemetry-traces

Conversation

@develop7
Copy link
Copy Markdown
Collaborator

@develop7 develop7 commented Jan 4, 2024

This PR introduces producing OpenTelemetry traces containing, among others, metrics same as in ServerTiming header from before.

TODO:

Running:

I sort of gave up deploying and configuring all the moving bits locally, so you'd need to create the honeycomb.io account for this one (or ask me for the invite). After that, it's quite straightforward:

  1. Build PostgREST executable with stack build, and get its path with stack exec -- which postgrest
  2. Get a PostgreSQL server running (e.g. run nix-shell, then postgrest-with-postgresql-15 --fixture ./test/load/fixture.sql -- cat). Note the server URL, you'll need it when running PostgREST server
  3. get a JWT token with default secret by running postgrest-jwt --exp 36000 postgrest_test_anonymous
  4. Run PostgREST server with
    OTEL_EXPORTER_OTLP_ENDPOINT='https://api.honeycomb.io/' \
    OTEL_EXPORTER_OTLP_HEADERS="x-honeycomb-team=<honeycomb_api_key>"  \
    OTEL_SERVICE_NAME='PostgREST' OTEL_LOG_LEVEL='debug' OTEL_TRACES_SAMPLER='always_on' \
    PGRST_DB_URI='<postgresql_server_url>'   \
    postgrest-run 
  5. request some data using the JWT token from above and check the honeycomb dashboard for the traces:

image

Tests

hspec tests are also instrumented, for those to produce traces you need to set OTEL_* vars only:

OTEL_EXPORTER_OTLP_ENDPOINT='https://api.honeycomb.io/' \
OTEL_EXPORTER_OTLP_HEADERS="x-honeycomb-team=<honeycomb_api_key>"  \
OTEL_SERVICE_NAME='PostgREST' OTEL_LOG_LEVEL='debug' OTEL_TRACES_SAMPLER='always_on' \
postgrest-test-spec

@steve-chavez
Copy link
Copy Markdown
Member

Awesome work! 🔥 🔥

I sort of gave up deploying and configuring all the moving bits locally, so you'd need to create the honeycomb.io account for this one

Found this Nix flake that contains an OTel GUI: https://flakestry.dev/flake/github/FriendsOfOpenTelemetry/opentelemetry-nix/1.0.1

I'll try to integrate that once the PR is ready for review.

@develop7 develop7 force-pushed the feat_opentelemetry-traces branch from 8c0e16a to 64a0ee9 Compare January 29, 2024 17:01
@develop7
Copy link
Copy Markdown
Collaborator Author

The recent problem I'm seemingly stuck with is hs-opentelemetry is using UnliftIO, which seems not quite composable with our (implicit, correct?) monad stack. So the deeper into the call stack the instrumented code is (the one I'm trying to wrap with inSpan), the more ridiculously complex it should be changed to be instrumented, i.e. https://github.com/PostgREST/postgrest/pull/3140/files#diff-5de3ff2b2d013b33dccece6ead9aeb61feffeb0fbd6e38779750511394cf9701R156-R157, up to the point I have no idea how to proceed further (e.g. wrapping App.handleRequests cases with their own spans, which is semantically correct)

There's a more straightforward MonadIO-involving opentelemetry library, with less activity and quite different approach to the telemetry data export (GHC eventlog → file/pipe by the GHC runtime). It looks less invasive approach, refactoring-wise, but requires more hoops to jump to actually deliver traces to Honeycomb/Lightstep/whatnot (pull eventlog → convert it to zipkin/jaeger/b3 → upload somewhere for analysis).

It also seems to boil down to the conceptual choice between online and offline traces' delivery-wise, or push and pull model.

@steve-chavez @wolfgangwalther @laurenceisla what do you think guys?

@steve-chavez
Copy link
Copy Markdown
Member

@develop7 Would vault help? It was introduced on #1988, I recall it helped with IORef handling.

It's still used on

jwtDurKey :: Vault.Key Double
jwtDurKey = unsafePerformIO Vault.newKey
{-# NOINLINE jwtDurKey #-}
getJwtDur :: Wai.Request -> Maybe Double
getJwtDur = Vault.lookup jwtDurKey . Wai.vault

I'm still not that familiar with OTel but the basic idea I had was to store these traces on AppState and export them async.

@develop7 develop7 force-pushed the feat_opentelemetry-traces branch from 6b891c2 to 586e7a1 Compare February 12, 2024 14:26
@steve-chavez
Copy link
Copy Markdown
Member

@develop7 Recently merged #3213, which logs schema cache stats to stderr. Perhaps that can be used for introductory OTel integration instead? It might be easier since the scache stats are already in IO space.

@develop7
Copy link
Copy Markdown
Collaborator Author

Would vault help?

hs-opentelemetry is using it already

basic idea I had was to store these traces on AppState and export them async

Not only that, you want traces in tests too, for one.

The good news is hs-opentelemetry-utils-exceptions seems to be just what we need, let me try it.

Perhaps that can be used for introductory OTel integration instead?

Good call @steve-chavez, thank you for the suggestion. Will try too.

@develop7
Copy link
Copy Markdown
Collaborator Author

image

it works!

@steve-chavez
Copy link
Copy Markdown
Member

steve-chavez commented Feb 21, 2024

Since now we have an observer function and Observation module

handleRequest :: AuthResult -> AppConfig -> AppState.AppState -> Bool -> Bool -> PgVersion -> ApiRequest -> SchemaCache ->
Maybe Double -> Maybe Double -> (Observation -> IO ()) -> Handler IO Wai.Response
handleRequest AuthResult{..} conf appState authenticated prepared pgVer apiReq@ApiRequest{..} sCache jwtTime parseTime observer =

data Observation
= AdminStartObs (Maybe Int)
| AppStartObs ByteString
| AppServerPortObs NS.PortNumber

Perhaps we can add some observations for the timings?

Also the Logger is now used like:

logObservation :: LoggerState -> Observation -> IO ()
logObservation loggerState obs = logWithZTime loggerState $ observationMessage obs

CmdRun -> App.run appState (Logger.logObservation loggerState))

For OTel, maybe the following would make sense:

otelState <- Otel.init

App.run appState (Logger.logObservation loggerState >> OTel.tracer otelState)) 

@develop7 develop7 force-pushed the feat_opentelemetry-traces branch from dc882f1 to 7794848 Compare February 23, 2024 15:44
@develop7
Copy link
Copy Markdown
Collaborator Author

Perhaps we can add some observations for the timings?

Agreed, server timings definitely belong there.

@develop7 develop7 force-pushed the feat_opentelemetry-traces branch from 7794848 to 398206b Compare February 23, 2024 16:04
@develop7 develop7 force-pushed the feat_opentelemetry-traces branch from 398206b to 4cd99c6 Compare March 7, 2024 14:58
@develop7 develop7 requested a review from steve-chavez March 11, 2024 15:37
@develop7 develop7 marked this pull request as ready for review March 11, 2024 15:38
@develop7
Copy link
Copy Markdown
Collaborator Author

Okay, the PR is in the cooking for long enough, let's pull the plug and start small. Let's have it reviewed while I'm fixing the remaining CI failures.

@develop7 develop7 force-pushed the feat_opentelemetry-traces branch from 4cd99c6 to 94d2b9b Compare March 11, 2024 15:49
Comment thread cabal.project Outdated
@wolfgangwalther
Copy link
Copy Markdown
Member

hs-opentelemetry is, according to the repo, in alpha state. According to the TODO list above, the issue tracker and the repo description, it does not support:

  • GHC 9.8.x
  • Windows
  • Metrics or Logging

I don't think we depend on this in the current state. And we should certainly not depend on an even-less-maintained fork of the same.

So to go forward here, there needs to be some effort put into the upstream package first, to make it usable for us.

@develop7 develop7 force-pushed the feat_opentelemetry-traces branch 2 times, most recently from 590d142 to e809a65 Compare March 12, 2024 16:31
@develop7 develop7 marked this pull request as draft March 29, 2024 16:43
@develop7
Copy link
Copy Markdown
Collaborator Author

A status update:

  • GHC 9.8: hs-opentelemetry-sdk doesn't build against 9.8 because of hs-opentelemetry-exporter-otlpproto-lens chain. Given the upstream of the latter being bit unresponsive for the suggestions to bump upper bounds, I've managed to make the latter build for 9.8 in develop7/proto-lens@985290f, but haven't figured out how to pick it up to the project since it depends on the google's protobuf compiler installed and the protobuf's source checked out. Another approach is to not use hs-o-sdk and hs-o-e-otlp altogether, which I probably should've tried way before.

@wolfgangwalther
Copy link
Copy Markdown
Member

  • GHC 9.8: hs-opentelemetry-sdk doesn't build against 9.8 because of hs-opentelemetry-exporter-otlpproto-lens chain. Given the upstream of the latter being bit unresponsive for the suggestions to bump upper bounds, I've managed to make the latter build for 9.8 in develop7/proto-lens@985290f,

Hm. I looked at your fork. It depends on support for GHC 9.8 in ghc-source-gen. This repo has a PR, which just was updated 3 days ago. I wouldn't call that "unresponsive", yet. Once ghc-source-gen is GHC 9.8 compatible, you could open a PR to update bounds in proto-lens itself. But since the last release for GHC 9.6 support was in December... I would not expect this to take too long to get responded to. It certainly doesn't look like it's unmaintained.

I guess for GHC 9.8 support it's just a matter of time.

What about the other issues mentioned above? Were you able to make progress on those?

@mkleczek
Copy link
Copy Markdown
Collaborator

mkleczek commented Apr 4, 2024

The recent problem I'm seemingly stuck with is hs-opentelemetry is using UnliftIO, which seems not quite composable with our (implicit, correct?) monad stack. So the deeper into the call stack the instrumented code is (the one I'm trying to wrap with inSpan), the more ridiculously complex it should be changed to be instrumented, i.e. https://github.com/PostgREST/postgrest/pull/3140/files#diff-5de3ff2b2d013b33dccece6ead9aeb61feffeb0fbd6e38779750511394cf9701R156-R157, up to the point I have no idea how to proceed further (e.g. wrapping App.handleRequests cases with their own spans, which is semantically correct)

There's a more straightforward MonadIO-involving opentelemetry library, with less activity and quite different approach to the telemetry data export (GHC eventlog → file/pipe by the GHC runtime). It looks less invasive approach, refactoring-wise, but requires more hoops to jump to actually deliver traces to Honeycomb/Lightstep/whatnot (pull eventlog → convert it to zipkin/jaeger/b3 → upload somewhere for analysis).

It also seems to boil down to the conceptual choice between online and offline traces' delivery-wise, or push and pull model.

@steve-chavez @wolfgangwalther @laurenceisla what do you think guys?

In my prototype I actually played with replacing HASQL Session with an https://github.com/haskell-effectful/effectful based monad to make it extensible:

https://github.com/mkleczek/hasql-api/blob/master/src/Hasql/Api/Eff/Session.hs#L37

Using it in PostgREST required some mixins usage in Cabal:

29b946e#diff-eb6a76805a0bd3204e7abf68dcceb024912d0200dee7e4e9b9bce3040153f1e1R140

Some work was required in PostgREST startup/configuration code to set-up appropriate effect handlers and middlewares but the changes were quite well isolated.

At the end of the day I think basing your monad stack on an effect library (effectful, cleff etc.) is the way forward as it makes the solution highly extensible and configurable.

@develop7 develop7 force-pushed the feat_opentelemetry-traces branch from e809a65 to 4697009 Compare October 23, 2024 17:01
@develop7 develop7 force-pushed the feat_opentelemetry-traces branch from 650d008 to ac33872 Compare October 31, 2024 13:40
@develop7
Copy link
Copy Markdown
Collaborator Author

develop7 commented Nov 4, 2024

Update: rebased the PR against latest master, updated hs-opentelemetry (with the Windows support merged!) & asked hs-opentelemetry maintainers to cut a new release in iand675/hs-opentelemetry#154 so we don't have to depend on forks again.

@steve-chavez
Copy link
Copy Markdown
Member

An idea to test this.

Could we have opentelemetry-collector (included in Nixpkgs) collect all the generated OTel traces and then use its fileexporter to match the expected outputs?

I'm imagining the test could be similar to how we capture our schema cache snapshots (ref).

@laurenceisla
Copy link
Copy Markdown
Member

Could we have opentelemetry-collector (included in Nixpkgs) collect all the generated OTel traces and then use its fileexporter to match the expected outputs?

@develop7 is this something that can be done here? The manual tests are working right now as mentioned in #3140 (comment), so maybe we could implement this test later on if it's not that feasible to do here? (cc. @steve-chavez)

Comment thread CHANGELOG.md Outdated
@develop7
Copy link
Copy Markdown
Collaborator Author

re: tests with actual otel collector — I've managed to prototype collector-including tests that compile, fail and and now I'm working on making them pass and be useful.

@develop7
Copy link
Copy Markdown
Collaborator Author

the test is in, chose to manage collector binary from Haskell for a change

Comment thread test/observability/fixtures/otel-traces.snapshot.json
Comment thread test/observability/Observation/OpenTelemetry.hs Outdated
Comment thread test/observability/ObsHelper.hs
@develop7
Copy link
Copy Markdown
Collaborator Author

develop7 commented Apr 6, 2026

Discovered a way to relay metrics to OpenTelemetry collector with the otelcol's own https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/prometheusreceiver, added a rough config draft to the OpenTelemetry page; should help work the lack of native metrics in hs-otel around for now.

@steve-chavez @wolfgangwalther I've realized opentelemetry.rst is too small for separate page, but for a section at observability.rst, it should be fine, WDYT?

@steve-chavez
Copy link
Copy Markdown
Member

Discovered a way to relay metrics to OpenTelemetry collector with the otelcol's own https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/prometheusreceiver, added a rough config draft to the OpenTelemetry page; should help work the lack of native metrics in hs-otel around for now.

Cool, that looks good.

@steve-chavez @wolfgangwalther I've realized opentelemetry.rst is too small for separate page, but for a section at observability.rst, it should be fine, WDYT?

Yes, it would fit better since observability is inside References too; integrations currently don't contain PostgREST features.

Copy link
Copy Markdown
Member

@wolfgangwalther wolfgangwalther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nix build seems to need some fixes for some of the opentelemetry dependencies. You can try to mark unbroken first, then see what the build error is etc.

I will have a look at the nixpkgs side of this as well.

Comment thread default.nix Outdated
Comment thread postgrest.cabal Outdated
Comment thread stack.yaml Outdated
@develop7
Copy link
Copy Markdown
Collaborator Author

You can try to mark unbroken first, then see what the build error is etc.

@wolfgangwalther it's tasty-quickcheck being upgraded in the snapshot to 0.11 and hs-otel-api wants < 0.11, see https://hydra.nixos.org/build/285112733/nixlog/1

Actually there's hs-opentelemetry-api v0.3.0.1 released sometime ago that would also let me drop my homemade workaround; I'll try with upgrading it first then.

UPD: all right, how do I unbreak a nix package for us?

@develop7
Copy link
Copy Markdown
Collaborator Author

@wolfgangwalther I've updated hs-otel deps to latest upstream versions, unbreaking the broken ones as a byproduct; upstream is a bit sloppy TBH — while they do update package sources and bump deps, but do not publish all the packages on hackage, ugh

@wolfgangwalther
Copy link
Copy Markdown
Member

while they do update package sources and bump deps, but do not publish all the packages on hackage, ugh

Yeah, I noticed that while working on the Nixpkgs side to unbreak these deps.

I've updated hs-otel deps to latest upstream versions, unbreaking the broken ones as a byproduct;

Yeah, that's a fine approach for now, we will want to switch back to the Nixpkgs-provided versions eventually, though.

Comment thread nix/tools/tests.nix Outdated
Comment thread nix/tools/tests.nix Outdated
@develop7
Copy link
Copy Markdown
Collaborator Author

all right, cabal builds are green as well now, had to add few VCS deps to the cabal.project

Copy link
Copy Markdown
Member

@wolfgangwalther wolfgangwalther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most things are smaller nits. I guess we need to get this over the line now.

Comment thread docs/references/configuration.rst Outdated
Comment thread docs/references/observability.rst Outdated
Comment thread docs/references/observability.rst Outdated
Comment thread docs/postgrest.dict Outdated
Comment thread src/PostgREST/AppState.hs Outdated
Comment thread src/PostgREST/AppState.hs
Comment thread src/PostgREST/CLI.hs Outdated
Comment on lines +35 to +37
let tracer' = if configServerOtelEnabled conf && tracerIsEnabled tracer
then Just tracer
else Nothing
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which scenario is config-server-otel-enabled = true and tracerIsEnabled = false? Aka can it happen that one tries to enable otel via the config option, but otel is effectively still disabled? How? Do / should we log this or throw an error or so?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which scenario is config-server-otel-enabled = true and tracerIsEnabled = false? Aka can it happen that one tries to enable otel via the config option, but otel is effectively still disabled? How?

tracerIsEnabled == true when otel is enabled (it is by default) but no trace collectors to post them to have been configured, per the iand675/hs-opentelemetry#201. This results in resources being consumed for tracing but no traces actually make their way to a consumer. That's why I'm taking this into account and plugging up the case with an explicit noop.

Do / should we log this or throw an error or so?

We do not, but this is a great idea, and yes, we should, now that we have our own otel-related config parameter. Thank you for pointing that out. IMO this warrants for a warning rather than an hard error (so the user wouldn't have to configure the whole otel stack in order to continue playing around with postgrest). Let me add the thing in a jiff.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there seems no way to log a warning and continue, there's either "here's config value I've parsed" or fail "something bad happened". Failure it is then. UPD: OR I could move the tracerIsEnabled value to AppState, still emit a warning, just later on during app initialization and let us control otel-enabled in runtime which might come in handy for end user given how resource-consuming OTel currently is. @wolfgangwalther @steve-chavez WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let us control otel-enabled in runtime which might come in handy for end user given how resource-consuming OTel currently is

You mean, I could start with otel-enabled = false, but provide all the setup for otel environment variables themselves (so tracerIsEnabled = true). Then, when I have specific use-case I want to look at, I enable otel-enabled = true via reloading and the data starts flowing?

Sounds useful to me.

Comment thread src/PostgREST/OpenTelemetry.hs Outdated
Comment thread test/io/test_io.py Outdated
Comment thread cabal.project Outdated
@wolfgangwalther
Copy link
Copy Markdown
Member

wolfgangwalther commented Apr 22, 2026

The nix build seems to need some fixes for some of the opentelemetry dependencies. You can try to mark unbroken first, then see what the build error is etc.

I will have a look at the nixpkgs side of this as well.

PR here: NixOS/nixpkgs#512285

@wolfgangwalther
Copy link
Copy Markdown
Member

I rebased, resolved conflicts and addressed most of my own feedback from above. There's still some comments to look at - and some CI results, not sure whether I was able to test everything locally.

@wolfgangwalther
Copy link
Copy Markdown
Member

wolfgangwalther commented May 1, 2026

The following comments are still open:

What's the plan for getting these resolved?

@develop7
Copy link
Copy Markdown
Collaborator Author

develop7 commented May 2, 2026

@wolfgangwalther they're being addressed, will push updates on Monday

@develop7
Copy link
Copy Markdown
Collaborator Author

develop7 commented May 5, 2026 via email

Comment thread src/PostgREST/App.hs Outdated
postgrest :: LogLevel -> AppState.AppState -> IO () -> Wai.Application
postgrest :: HasCallStack => LogLevel -> AppState.AppState -> IO () -> Wai.Application
postgrest logLevel appState connWorker =
OTel.middleware appState .
Copy link
Copy Markdown
Member

@steve-chavez steve-chavez May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we merged #4884, I think this middleware shouldn't be needed anymore?

Maybe we can root cause the perf issues once we integrate with the new ResponseObs too.

Copy link
Copy Markdown
Collaborator Author

@develop7 develop7 May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This middleware's purpose is to pick up Trace ID from the request headers and initialize our tracer's parent_trace_id with it, so we'll produce traces that are part of traces' chain, thus ensuring their continuity so the trace viewer will be able to know they are the part of a certain transaction.

That could improve the performance though, now that hs-otel is the only Vault client

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So if we could get rid of the traceHeaderMiddleware then the OTel middleware wouldn't be necessary right?

Comment thread src/PostgREST/AppState.hs Outdated
Comment thread postgrest.cabal Outdated
Comment thread test/spec/Main.hs
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all those changes to test/spec left-overs from before rebasing on top of the move to test/observability? It doesn't look like we need the changes to that extend - just passing Nothing in as the tracer should be enough here?

Comment thread cabal.project Outdated
Comment thread stack.yaml Outdated
Comment thread docs/references/configuration.rst Outdated
=============== =================================
**Type** Boolean
**Default** False
**Reloadable** N
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here we say this config option is not reloadable, yet in #3140 (comment) we discussed making it so. Which of these is true?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, my bad

Comment thread src/PostgREST/Logger.hs
Comment on lines +238 to +239
OTelNoConfigSupplied ->
pure $ "No OpenTelemetry configuration via environment variables is supplied. " <> "Please refer https://docs.postgrest.org/en/stable/references/observability.html#opentelemetry for details."
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can I trigger this?

I tried PGRST_OTEL_ENABLED=true postgrest-with-pg-18 -t test/spec/fixtures/load.sql postgrest-run, but that doesn't log this. I also tried with log-level debug, nope.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw OTEL_SDK_DISABLED=true in, that's when tracerIsEnabled tracer == false

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is, though, it prevents propagating TraceID downstream. Luckily, we don't have to, yet.

More importantly, a default OTel configuration does collect all the traces (because OTEL_TRACES_SAMPLER's default parentbased_always_on means "same as for parent span, always_on if there's none), sends them to default OTEL_EXPORTER_OTLP_ENDPOINTofhttp://localhost:4317`and does propagate (OTEL_PROPAGATORS=…) trace ID, parent span ID and sampling type (tracecontext), along with custom metadata (baggage) downstream.

* Introduces producing OpenTelemetry traces with hs-opentelemetry.
* Adds OTel spans over the whole application
  loop and over each request processing phase
* Preliminary OTel tracing support in spec tests
* Disables tracing in load and memory tests
fix deps
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

7 participants