feat: produce OpenTelemetry traces with hs-opentelemetry#3140
Conversation
|
Awesome work! 🔥 🔥
Found this Nix flake that contains an OTel GUI: https://flakestry.dev/flake/github/FriendsOfOpenTelemetry/opentelemetry-nix/1.0.1 I'll try to integrate that once the PR is ready for review. |
8c0e16a to
64a0ee9
Compare
|
The recent problem I'm seemingly stuck with is There's a more straightforward It also seems to boil down to the conceptual choice between online and offline traces' delivery-wise, or push and pull model. @steve-chavez @wolfgangwalther @laurenceisla what do you think guys? |
|
@develop7 Would vault help? It was introduced on #1988, I recall it helped with IORef handling. It's still used on postgrest/src/PostgREST/Auth.hs Lines 160 to 165 in d2fb67f I'm still not that familiar with OTel but the basic idea I had was to store these traces on AppState and export them async. |
6b891c2 to
586e7a1
Compare
Not only that, you want traces in tests too, for one. The good news is
Good call @steve-chavez, thank you for the suggestion. Will try too. |
0830a45 to
dc882f1
Compare
|
Since now we have an postgrest/src/PostgREST/App.hs Lines 170 to 172 in 229bc77 postgrest/src/PostgREST/Observation.hs Lines 15 to 18 in 229bc77 Perhaps we can add some observations for the timings? Also the Logger is now used like: postgrest/src/PostgREST/Logger.hs Lines 53 to 54 in 7c6c056 postgrest/src/PostgREST/CLI.hs Line 50 in 7c6c056 For OTel, maybe the following would make sense: otelState <- Otel.init
App.run appState (Logger.logObservation loggerState >> OTel.tracer otelState)) |
dc882f1 to
7794848
Compare
Agreed, server timings definitely belong there. |
7794848 to
398206b
Compare
398206b to
4cd99c6
Compare
|
Okay, the PR is in the cooking for long enough, let's pull the plug and start small. Let's have it reviewed while I'm fixing the remaining CI failures. |
4cd99c6 to
94d2b9b
Compare
|
I don't think we depend on this in the current state. And we should certainly not depend on an even-less-maintained fork of the same. So to go forward here, there needs to be some effort put into the upstream package first, to make it usable for us. |
590d142 to
e809a65
Compare
|
A status update:
|
Hm. I looked at your fork. It depends on support for GHC 9.8 in I guess for GHC 9.8 support it's just a matter of time. What about the other issues mentioned above? Were you able to make progress on those? |
In my prototype I actually played with replacing HASQL Session with an https://github.com/haskell-effectful/effectful based monad to make it extensible: https://github.com/mkleczek/hasql-api/blob/master/src/Hasql/Api/Eff/Session.hs#L37 Using it in PostgREST required some mixins usage in Cabal: 29b946e#diff-eb6a76805a0bd3204e7abf68dcceb024912d0200dee7e4e9b9bce3040153f1e1R140 Some work was required in PostgREST startup/configuration code to set-up appropriate effect handlers and middlewares but the changes were quite well isolated. At the end of the day I think basing your monad stack on an effect library (effectful, cleff etc.) is the way forward as it makes the solution highly extensible and configurable. |
e809a65 to
4697009
Compare
650d008 to
ac33872
Compare
|
Update: rebased the PR against latest |
|
An idea to test this. Could we have I'm imagining the test could be similar to how we capture our schema cache snapshots (ref). |
@develop7 is this something that can be done here? The manual tests are working right now as mentioned in #3140 (comment), so maybe we could implement this test later on if it's not that feasible to do here? (cc. @steve-chavez) |
|
re: tests with actual otel collector — I've managed to prototype collector-including tests that compile, fail and and now I'm working on making them pass and be useful. |
|
the test is in, chose to manage collector binary from Haskell for a change |
|
Discovered a way to relay metrics to OpenTelemetry collector with the otelcol's own https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/prometheusreceiver, added a rough config draft to the OpenTelemetry page; should help work the lack of native metrics in hs-otel around for now. @steve-chavez @wolfgangwalther I've realized |
Cool, that looks good.
Yes, it would fit better since |
wolfgangwalther
left a comment
There was a problem hiding this comment.
The nix build seems to need some fixes for some of the opentelemetry dependencies. You can try to mark unbroken first, then see what the build error is etc.
I will have a look at the nixpkgs side of this as well.
@wolfgangwalther it's Actually there's hs-opentelemetry-api v0.3.0.1 released sometime ago that would also let me drop my homemade workaround; I'll try with upgrading it first then. UPD: all right, how do I unbreak a nix package for us? |
|
@wolfgangwalther I've updated |
Yeah, I noticed that while working on the Nixpkgs side to unbreak these deps.
Yeah, that's a fine approach for now, we will want to switch back to the Nixpkgs-provided versions eventually, though. |
|
all right, cabal builds are green as well now, had to add few VCS deps to the cabal.project |
wolfgangwalther
left a comment
There was a problem hiding this comment.
Most things are smaller nits. I guess we need to get this over the line now.
| let tracer' = if configServerOtelEnabled conf && tracerIsEnabled tracer | ||
| then Just tracer | ||
| else Nothing |
There was a problem hiding this comment.
Which scenario is config-server-otel-enabled = true and tracerIsEnabled = false? Aka can it happen that one tries to enable otel via the config option, but otel is effectively still disabled? How? Do / should we log this or throw an error or so?
There was a problem hiding this comment.
Which scenario is config-server-otel-enabled = true and tracerIsEnabled = false? Aka can it happen that one tries to enable otel via the config option, but otel is effectively still disabled? How?
tracerIsEnabled == true when otel is enabled (it is by default) but no trace collectors to post them to have been configured, per the iand675/hs-opentelemetry#201. This results in resources being consumed for tracing but no traces actually make their way to a consumer. That's why I'm taking this into account and plugging up the case with an explicit noop.
Do / should we log this or throw an error or so?
We do not, but this is a great idea, and yes, we should, now that we have our own otel-related config parameter. Thank you for pointing that out. IMO this warrants for a warning rather than an hard error (so the user wouldn't have to configure the whole otel stack in order to continue playing around with postgrest). Let me add the thing in a jiff.
There was a problem hiding this comment.
there seems no way to log a warning and continue, there's either "here's config value I've parsed" or fail "something bad happened". Failure it is then. UPD: OR I could move the tracerIsEnabled value to AppState, still emit a warning, just later on during app initialization and let us control otel-enabled in runtime which might come in handy for end user given how resource-consuming OTel currently is. @wolfgangwalther @steve-chavez WDYT?
There was a problem hiding this comment.
let us control
otel-enabledin runtime which might come in handy for end user given how resource-consuming OTel currently is
You mean, I could start with otel-enabled = false, but provide all the setup for otel environment variables themselves (so tracerIsEnabled = true). Then, when I have specific use-case I want to look at, I enable otel-enabled = true via reloading and the data starts flowing?
Sounds useful to me.
PR here: NixOS/nixpkgs#512285 |
|
I rebased, resolved conflicts and addressed most of my own feedback from above. There's still some comments to look at - and some CI results, not sure whether I was able to test everything locally. |
|
The following comments are still open:
What's the plan for getting these resolved? |
|
@wolfgangwalther they're being addressed, will push updates on Monday |
|
yes, precisely
|
| postgrest :: LogLevel -> AppState.AppState -> IO () -> Wai.Application | ||
| postgrest :: HasCallStack => LogLevel -> AppState.AppState -> IO () -> Wai.Application | ||
| postgrest logLevel appState connWorker = | ||
| OTel.middleware appState . |
There was a problem hiding this comment.
Now that we merged #4884, I think this middleware shouldn't be needed anymore?
Maybe we can root cause the perf issues once we integrate with the new ResponseObs too.
There was a problem hiding this comment.
This middleware's purpose is to pick up Trace ID from the request headers and initialize our tracer's parent_trace_id with it, so we'll produce traces that are part of traces' chain, thus ensuring their continuity so the trace viewer will be able to know they are the part of a certain transaction.
That could improve the performance though, now that hs-otel is the only Vault client
There was a problem hiding this comment.
I see. So if we could get rid of the traceHeaderMiddleware then the OTel middleware wouldn't be necessary right?
There was a problem hiding this comment.
Are all those changes to test/spec left-overs from before rebasing on top of the move to test/observability? It doesn't look like we need the changes to that extend - just passing Nothing in as the tracer should be enough here?
| =============== ================================= | ||
| **Type** Boolean | ||
| **Default** False | ||
| **Reloadable** N |
There was a problem hiding this comment.
So here we say this config option is not reloadable, yet in #3140 (comment) we discussed making it so. Which of these is true?
| OTelNoConfigSupplied -> | ||
| pure $ "No OpenTelemetry configuration via environment variables is supplied. " <> "Please refer https://docs.postgrest.org/en/stable/references/observability.html#opentelemetry for details." |
There was a problem hiding this comment.
How can I trigger this?
I tried PGRST_OTEL_ENABLED=true postgrest-with-pg-18 -t test/spec/fixtures/load.sql postgrest-run, but that doesn't log this. I also tried with log-level debug, nope.
There was a problem hiding this comment.
throw OTEL_SDK_DISABLED=true in, that's when tracerIsEnabled tracer == false
There was a problem hiding this comment.
The problem is, though, it prevents propagating TraceID downstream. Luckily, we don't have to, yet.
More importantly, a default OTel configuration does collect all the traces (because OTEL_TRACES_SAMPLER's default parentbased_always_on means "same as for parent span, always_on if there's none), sends them to default OTEL_EXPORTER_OTLP_ENDPOINTofhttp://localhost:4317`and does propagate (OTEL_PROPAGATORS=…) trace ID, parent span ID and sampling type (tracecontext), along with custom metadata (baggage) downstream.
* Introduces producing OpenTelemetry traces with hs-opentelemetry. * Adds OTel spans over the whole application loop and over each request processing phase * Preliminary OTel tracing support in spec tests * Disables tracing in load and memory tests fix deps

This PR introduces producing OpenTelemetry traces containing, among others, metrics same as in
ServerTimingheader from before.TODO:
build with Nix as well (for now Stack only)DONE for freemake an example of exporting log messageshs-opentelemetrydoesn't support logging, per Logging roadmap iand675/hs-opentelemetry#100makeseems like impossible without major refactoringgetTraceravailable globally: we're interested in using as many different spans as it makes sense, sogetTracershould be available everywhere, as described inhs-opentelemetry-sdk's READMEhs-opentelemetry-waimiddlewarelook into failing Windows buildshs-opentelemetry-sdkdepends onunix, tracking in Windows support iand675/hs-opentelemetry#109Running:
I sort of gave up deploying and configuring all the moving bits locally, so you'd need to create the honeycomb.io account for this one (or ask me for the invite). After that, it's quite straightforward:
stack build, and get its path withstack exec -- which postgrestnix-shell, thenpostgrest-with-postgresql-15 --fixture ./test/load/fixture.sql -- cat). Note the server URL, you'll need it when running PostgREST serverpostgrest-jwt --exp 36000 postgrest_test_anonymousTests
hspec tests are also instrumented, for those to produce traces you need to set
OTEL_*vars only: