nix(loadtest): group results by URL / report minimum latency / merge errors+mixed by wolfgangwalther · Pull Request #4902 · PostgREST/postgrest

wolfgangwalther · 2026-05-08T20:29:07Z

Quite a few changes at once, but I think this is a big improvement to how we can work with the loadtests.

I already pushed two commits two days ago to make the jwt-rsa-cache-worst job optional in CI and avoid cancelling all others when that one fails.

This PR then adds a series of commits on top, which improves the reporting for the loadtest. With the more detailed reporting per URL, we can merge the errors + mixed jobs. I envision something similar for the JWT loadtests, so that we will end up with only two loadtest suites: One with statically defined targets and one with dynamically created ones. It might even be possible to merge these two as well, but that's for much later.

This does not provide any automatic result checking by CI - that's something to add later in a different PR.

A lot of text and information in the commit messages. Make sure to read those, too.

steve-chavez · 2026-05-18T17:04:03Z

Using the minimum latency is an estimation of how fast the code can run
in the best case. This might not be a number relevant for production,
but it's much more directly related to the code we write.

Interesting, so we're changing to "less is better" now with the "minimum latency" right? Before it was "more is better" with the rate.

Looking at https://github.com/PostgREST/postgrest/actions/runs/26036478763?pr=4902#summary-76536097215:

There is a perf drop for /films?columns=id,title,year,runtime,genres,director,actors,plot,posterUrl
it would be nice to include the HTTP method on the results.

Ultimately, we only look at the `rate` column, so we can just as well remove all other columns. This makes the next step, when we split results by request type, much less noisy.

Different requests hit different code paths and perform very differently. By looking at each request type separately, we should be able to get a much better idea of what kind of change in performance we're looking at and where the root cause might be. It will hopefully also allow us to migrate some of the other test-cases into the main loadtest.

We previously used "rate", i.e. number of requests per second, as the primary metric to judge loadtest results. However, this has always been varying from run to run quite a bit, especially in CI where other jobs possibly run on the same VM host. The run-to-run variance has massively increased after splitting the results up per request. Example run in CI with rate on the PR introducing this change (on which we would expect no change at all): | rate [1/s] | main | head | Δ | |:-----------------------------------|-------:|-------:|-----:| | / | 9.4 | 9.5 | 1% | | /actors | 870.4 | 1023.0 | 18% | | /actors?actor=eq.1 | 188.5 | 198.6 | 5% | | /actors?actor=eq.1&columns=name | 197.3 | 167.1 | -15% | | /actors?select=*,roles(*,films(*)) | 153.9 | 144.9 | -6% | | /films?columns=id,title | 157.9 | 182.6 | 16% | | /films?columns=id,title,year,... | 87.0 | 87.1 | 0% | | /roles | 204.5 | 267.3 | 31% | | /rpc/call_me | 231.3 | 208.8 | -10% | | /rpc/call_me?name=John | 212.2 | 201.7 | -5% | From the data we can easily tell that the very reason that rate as a paramter has only worked, so far, because the data was *heavily* dominated by the requests on the root endpoint for OpenAPI. The longer duration makes the request much less vulnerable for concurrent activity. For all other requests its essentially not possible to judge the effect of a PR this way. One way to counter this would be to massively increase the time the loadtest runs. More samples will result in a smoother average. However, that's not practical for usability of CI. In the original PR PostgREST#1812 I already evaluated using the *minimum latency* as the most reliable criteriumi, but this has never really caught on. The theory behind this is: The variation in timings between requests is happening because of concurrent activity, priority chosen by the scheduler, availability of resources and such - all factors *outside* our control, and *irrelevant* to the Haskell code we're writing. Using the minimum latency is an estimation of how fast the code can run *in the best case*. This might not be a number relevant for production, but it's much more directly related to the code we write. Here's to show how variation becomes *much* smaller with minimum latency as the parameter: | min latency [μs] | main | head | Δ | |:-----------------------------------|---------:|-------:|-----:| | / | 1275.3 | 1263.6 | -1% | | /actors | 10.0 | 9.9 | -1% | | /actors?actor=eq.1 | 50.7 | 48.3 | -5% | | /actors?actor=eq.1&columns=name | 54.1 | 54.0 | 0% | | /actors?select=*,roles(*,films(*)) | 63.2 | 61.9 | -2% | | /films?columns=id,title | 51.1 | 50.7 | -1% | | /films?columns=id,title,year,... | 121.9 | 121.8 | 0% | | /roles | 42.9 | 42.6 | -1% | | /rpc/call_me | 45.6 | 45.4 | 0% | | /rpc/call_me?name=John | 44.4 | 44.2 | 0% | Since we're separating results per request now, we can only sensibly focus on *one* parameter - otherwise this would get really clunky UI-wise. Especially for automated CI failures, minimum latency is the logical choice.

Because we separate loadtest results per URL now, we can move the error tests into the regular mixed bag of loadtests - we will be able to tell from the misspelled URLs when we hit a regression in that area. We should be able to do similar things for JWT tests, but we'll need more infrastructure here.

This gives us a much better idea immediately when looking at the report. It also allows us to differentiate between different return codes on the same URI, which might come in handy when dealing with expired JWT and such. A nice side-effect: All the error-related requests are now grouped together in the 4xx section.

wolfgangwalther · 2026-05-18T17:42:56Z

Interesting, so we're changing to "less is better" now with the "minimum latency" right? Before it was "more is better" with the rate.

Correct.

There is a perf drop for /films?columns=id,title,year,runtime,genres,director,actors,plot,posterUrl

Interesting. I have started bisecting this.

it would be nice to include the HTTP method on the results.

Right. I also wanted the returned status code to be part of it. Added both, I think that works well.

steve-chavez

Right. I also wanted the returned status code to be part of it. Added both, I think that works well.

Looks much better 👍

wolfgangwalther · 2026-05-18T18:13:25Z

There is a perf drop for /films?columns=id,title,year,runtime,genres,director,actors,plot,posterUrl

Interesting. I have started bisecting this.

And I'm lead to baebacf as the first bad commit. Maybe one of the changed dependencies causes this? Also note that the commit did change the memory limits for exactly the same kind of request, so that matches the observation.

mkleczek · 2026-05-18T18:28:09Z

There is a perf drop for /films?columns=id,title,year,runtime,genres,director,actors,plot,posterUrl

Interesting. I have started bisecting this.

And I'm lead to baebacf as the first bad commit. Maybe one of the changed dependencies causes this?

That is quite possible - the most suspect is warp. We've been on quite an old version before.

wolfgangwalther · 2026-05-18T18:34:27Z

the most suspect is warp.

Currently bisecting this as well (at least by version).

mkleczek

I remember @steve-chavez opposed adding error cases to mixed load test and decided to have a separate errors load test.

I personally support having a wider spectrum of cases in the mixed load test but want to make sure the reasons given earlier by @steve-chavez are no longer relevant.

wolfgangwalther · 2026-05-18T18:39:16Z

I believe the concern was, that it was not easy to tell whether a regression was caused by error cases or regular requests? The grouping by URL would solve that concern.

mkleczek

I think minimum is wrong metric - it is really not something you can trust - it is too noisy. Usually the best performance metric is Pxx latency: most of the times P90 or P95 (sometimes even P99 but that is also noisy).

Also - we really want to measure "worst case" - not "best case". If out of 100 requests 10s each, one is 0.01s then minimum really badly shows how performant is the software.

wolfgangwalther · 2026-05-18T18:49:27Z

I think minimum is wrong metric - it is really not something you can trust - it is too noisy. Usually the best performance metric is Pxx latency: most of the times P90 or P95 (sometimes even P99 but that is also noisy).

I do agree that this applies when you expect there to be a "true mean" and the noise happening in both directions.

That's not the case here. The underlying theory, which I put in the commit message, is that in the absence of any other running process, there is an optimal execution of the code we write. The best-case. Noise by other processes, resource contention and such all only make it slower.

Or in practical terms: I tried many things, but minimum latency was the most reliable for me (in my subjective judgement!).

Also - we really want to measure "worst case" - not "best case". If out of 100 requests 10s each, one is 0.01s then minimum really badly shows how performant is the software.

No, given the reasoning above we are not interested in worst case at all. This would essentially just mean to test how performant GitHub Actions are.

These loadtests are not (and were never!) designed to give production-like numbers, where you can say how PostgREST performs in the environment its in. That's what you need stuff like postgrest-benchmark or other things for. These tests are meant as a protection against performance regressions because of writing inefficient haskell code itself. There are a big number of other scenarios, that these tests are not designed to test, most importantly everything that involves multi-threading: The loadtest runs with pool=1, not more.

wolfgangwalther · 2026-05-18T19:01:12Z

That's not the case here. The underlying theory, which I put in the commit message, is that in the absence of any other running process, there is an optimal execution of the code we write. The best-case. Noise by other processes, resource contention and such all only make it slower.

Or let me put it in differently: The data is not normally distributed, because the minimum is not open-ended. There is a theoretical lower bound (the one where all other processes just play along perfectly and the scheduler does all the right things) under which we can not fall. Thus, we're really looking for an approximation to that, the minimum is the best approximation we can get.

wolfgangwalther · 2026-05-18T19:05:47Z

the most suspect is warp.

Currently bisecting this as well (at least by version).

It seems this happened from warp 3.4.12 to 3.4.13.

Changes for 3.4.13: yesodweb/wai@warp-tls-3.4.12...cd6b6e1

Quite a few.

mkleczek · 2026-05-18T19:17:27Z

No, given the reasoning above we are not interested in worst case at all. This would essentially just mean to test how performant GitHub Actions are.

But what information does minimum give us? You have two runs of two different versions - comparing minimum latency between them is meaningless.

The only reasonable data point to assess regressions is Pxx latency - ie. we want to ensure we do not make the software worse. Minimum won't tell you that for sure.

wolfgangwalther · 2026-05-18T19:24:31Z

But what information does minimum give us? You have two runs of two different versions - comparing minimum latency between them is meaningless.

The only reasonable data point to assess regressions is Pxx latency - ie. we want to ensure we do not make the software worse. Minimum won't tell you that for sure.

Well, what else can I tell you except that it works very well?

Have a look at the PR introducing the loadtest, this comment #1812 (comment). This is how noisy minimum based metrics are (the numbers are "hypothetical requests per second based on minimum latency", essentially just 1 / minimum latency that I calculated for convenience back then "more is better"). The minimum was perfectly able to identify all our issues. It is perfectly able to identify the issue now. Back when I introduced this, I looked at all the numbers. Minimum, mean, median, Psomething. The minimum appeared to be the most reliable. I'm happy to be proven wrong, if you can show me that Psomething is indeed better, I'm all for it. But I already did all the data collection back then, and I'm not keen on doing it again today.

mkleczek · 2026-05-18T19:43:34Z

But what information does minimum give us? You have two runs of two different versions - comparing minimum latency between them is meaningless.
The only reasonable data point to assess regressions is Pxx latency - ie. we want to ensure we do not make the software worse. Minimum won't tell you that for sure.

Well, what else can I tell you except that it works very well?

It doesn't really. Let me explain below.

Have a look at the PR introducing the loadtest, this comment #1812 (comment). This is how noisy minimum based metrics are (the numbers are "hypothetical requests per second based on minimum latency", essentially just 1 / minimum latency that I calculated for convenience back then "more is better"). The minimum was perfectly able to identify all our issues.

The problem is that these numbers are really not very meaningful, for example they would stay the same if you introduced a global lock that only allows one concurrent request at a time making others wait (minimum latency would stay the same).

The performance issues from this comment would be perfectly well identified by Pxx latency as well but Pxx would show you regressions that minimum won't catch.

With all due respect, measuring P9x latency is a well established and standard practice. If you have links to any papers arguing for measuring minimum latency to assess software performance, I am happy to read them and change my mind.

wolfgangwalther · 2026-05-18T19:58:14Z

The problem is that these numbers are really not very meaningful, for example they would stay the same if you introduced a global lock that only allows one concurrent request at a time making others wait (minimum latency would stay the same).

That global lock is essentially already in place, because vegeta does not attack in parallel (1 worker only) and we only have a pool size of 1. Again, the loadtest is not designed for anything multi-threaded.

mkleczek · 2026-05-18T20:06:14Z

The problem is that these numbers are really not very meaningful, for example they would stay the same if you introduced a global lock that only allows one concurrent request at a time making others wait (minimum latency would stay the same).

That global lock is essentially already in place, because vegeta does not attack in parallel (1 worker only) and we only have a pool size of 1. Again, the loadtest is not designed for anything multi-threaded.

Didn't know about that and that is a problem in itself. But anyway that was just one example really, a different one that shows the same for effectively single-threaded software is excessive memory usage causing garbage collection pauses - again miminum is going to stay the same.

Pxx will show you the same regressions and more - it is simply more meaningful measure.

wolfgangwalther · 2026-05-18T21:34:42Z

Didn't know about that and that is a problem in itself.

The problem is that we don't have many resources to work with in GitHub Actions and the data is inherently noisy because of that. Not reducing scope to essentially single-threaded execution will make the results unusable.

But anyway that was just one example really, a different one that shows the same for effectively single-threaded software is excessive memory usage causing garbage collection pauses - again miminum is going to stay the same.

Thanks, that I understand.

Pxx will show you the same regressions and more - it is simply more meaningful measure.

I did some tests again. I ran the loadtest twice, once head vs main, and once head vs 14.11. We expect to see no difference whatsoever on the first test... and possibly a difference on the comparison to v14.11, at least to our earlier observations. I then compared the percentage difference between different metrics, from minimum all the way up to maximum. Positive numbers are better, negative numbers are worse (because I have a -1 in my calculation there...).

head / main ratio	Δ P0	Δ P5	Δ P50	Δ P95	Δ P100
200 GET /	0.1%	0.1%	0.6%	1.6%	-216.1%
200 GET /actors?select=,roles(,films(*))	-2.6%	-1.4%	-0.6%	2.6%	-23.8%
200 GET /rpc/call_me?name=John	2.0%	0.0%	0.3%	1.0%	-114.6%
200 HEAD /actors?actor=eq.1	-2.4%	-3.6%	-2.0%	1.4%	44.9%
200 OPTIONS /actors	7.7%	-1.7%	-1.4%	0.0%	-38.6%
200 POST /rpc/call_me	4.9%	0.4%	0.0%	1.5%	-465.8%
201 POST /films?columns=id,title	-2.9%	-0.9%	-0.5%	2.5%	37.7%
201 POST /films?columns=id,title,year,runtime,genres,director,actors,plot,posterUrl	0.3%	0.1%	0.3%	2.3%	68.0%
204 DELETE /roles	2.1%	0.0%	0.0%	1.4%	-17.4%
204 PATCH /actors?actor=eq.1	0.3%	0.0%	-0.3%	-6.2%	50.0%
204 PUT /actors?actor=eq.1&columns=name	-0.6%	0.3%	0.5%	3.3%	25.8%
400 GET /actors?select=,rolws(,films(*))	1.1%	0.0%	-0.8%	1.3%	78.7%
401 GET /actors_1	-1.8%	0.4%	0.0%	0.6%	80.0%
404 GET /actoxs?actor=eq.1	1.2%	-3.3%	-1.4%	-0.6%	-12.2%
404 GET /rpc/call_me_x?name=John	0.0%	1.2%	-1.0%	2.3%	35.6%

head / v14.11 ratio	Δ P0	Δ P5	Δ P50	Δ P95	Δ P100
200 GET /	0.0%	-0.9%	-1.0%	-1.0%	1.1%
200 GET /actors?select=,roles(,films(*))	-1.0%	3.7%	4.0%	3.4%	-163.2%
200 GET /rpc/call_me?name=John	-2.0%	-2.2%	-2.6%	-2.7%	-47.4%
200 HEAD /actors?actor=eq.1	-7.2%	-8.2%	-10.5%	-8.4%	68.3%
200 OPTIONS /actors	-4.1%	1.7%	1.4%	-1.1%	-11.8%
200 POST /rpc/call_me	-3.3%	-3.0%	-3.2%	-2.1%	-116.7%
201 POST /films?columns=id,title	-3.3%	-3.6%	-4.9%	-2.4%	-50.9%
201 POST /films?columns=id,title,year,runtime,genres,director,actors,plot,posterUrl	-8.7%	-5.8%	-5.6%	-4.0%	21.5%
204 DELETE /roles	-4.9%	-3.6%	-3.8%	-2.9%	-163.3%
204 PATCH /actors?actor=eq.1	-2.9%	-3.6%	-4.3%	-2.9%	57.3%
204 PUT /actors?actor=eq.1&columns=name	-2.8%	-3.5%	-3.1%	-2.2%	-198.4%
400 GET /actors?select=,rolws(,films(*))	5.2%	9.8%	8.4%	15.4%	2.6%
401 GET /actors_1	-0.9%	-2.6%	-1.9%	-2.5%	-240.4%
404 GET /actoxs?actor=eq.1	-5.2%	-6.1%	-3.8%	-7.7%	-27.1%
404 GET /rpc/call_me_x?name=John	2.8%	2.4%	2.9%	3.1%	4.6%

My takeaways:

It should be easy to agree on that P100 / maximum is really bad a metric :D. I did not expect it any different, though.
It's very important for us to report those percentage differences, I believe. When Steve looked at the data earlier, he assumed a performance regression in the specific POST case with many columns. Looking at this data, it seems like we have a performance regression throughout all requests. Maybe a tiny bit more pronounced on that POST request - but even more so on the HEAD request. And we missed that one, likely due to a smaller absolute difference. I guess warp just got a tad slower across everything?
Two of the four error cases seem to be improved. I'm not sure whether that's coincidence and noisy data or a real pattern.
The P95 data seems to be less noisy than the minimum. Even less noisy is P50, the median.

With this in mind, I have no problem switching from minimum to another, reasonable, Psomething. The more important thing is to switch away from the mean (which is the same as "rate"), because it's influenced by the max so heavily. The data here suggests to use P50. If we were to implement CI failures based on these, I'd probably start trying to fail when it's more than -4% regression on any single endpoint.

steve-chavez · 2026-05-18T22:18:10Z

I personally support having a wider spectrum of cases in the mixed load test but want to make sure the reasons given earlier by @steve-chavez are no longer relevant.

I believe the concern was, that it was not easy to tell whether a regression was caused by error cases or regular requests? The grouping by URL would solve that concern.

Yes, correct. Now we can see the misspelling URLs plus the 404, so it's fine to combine both now that we have a clearer report.

wolfgangwalther · 2026-05-19T08:06:03Z

(just pushed a change to do p95 to see how these numbers look in CI - will do p50 as well)

wolfgangwalther · 2026-05-19T10:58:53Z

So some runs in CI:

When I imagine to build automated CI failures on top of it, to me, it looks like the p50 run is the most useful. Other opinions?

mkleczek · 2026-05-19T12:21:19Z

So some runs in CI:

min: https://github.com/PostgREST/postgrest/actions/runs/26050154038#summary-76584461724

p50: https://github.com/PostgREST/postgrest/actions/runs/26091664272#summary-76718651615

p95: https://github.com/PostgREST/postgrest/actions/runs/26084532776#summary-76694135627

When I imagine to build automated CI failures on top of it, to me, it looks like the p50 run is the most useful. Other opinions?

Can we present all and take P50 as a measure to decide CI failures for now?
I guess we need to spend some time looking at the data and adjust our choice. The tricky part is going to be accounting for noise (ie. what change in the measure should cause CI failure - 5%, 10%) and making sure we catch tail latencies at the same time. That requires "learning" the data and seeing what is expected vs outside of expected bounds.

So let's start with P50 +/- 5% as a signal - we'll see how it works. WDYT?

wolfgangwalther force-pushed the loadtests branch 3 times, most recently from b0cb802 to eb2af7e Compare May 16, 2026 08:29

wolfgangwalther changed the title ~~nix(loadtest): group results by URL~~ nix(loadtest): group results by URL / report minimum latency / merge errors+mixed May 17, 2026

wolfgangwalther marked this pull request as ready for review May 17, 2026 14:39

wolfgangwalther requested a review from steve-chavez May 17, 2026 14:43

wolfgangwalther force-pushed the loadtests branch 5 times, most recently from cf64216 to 5467f57 Compare May 18, 2026 13:27

wolfgangwalther added 5 commits May 18, 2026 19:34

nix(loadtest): remove noise from report

e6d2a5a

Ultimately, we only look at the `rate` column, so we can just as well remove all other columns. This makes the next step, when we split results by request type, much less noisy.

wolfgangwalther force-pushed the loadtests branch from 5467f57 to 3cd0934 Compare May 18, 2026 17:42

steve-chavez approved these changes May 18, 2026

View reviewed changes

wolfgangwalther mentioned this pull request May 18, 2026

fix: shutdown should wait for in flight requests #4702

Merged

mkleczek reviewed May 18, 2026

View reviewed changes

p95

3f8c060

p50

5f3b6b5

Uh oh!

Conversation

wolfgangwalther commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steve-chavez commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wolfgangwalther commented May 18, 2026

Uh oh!

steve-chavez left a comment

Choose a reason for hiding this comment

Uh oh!

wolfgangwalther commented May 18, 2026

Uh oh!

mkleczek commented May 18, 2026

Uh oh!

wolfgangwalther commented May 18, 2026

Uh oh!

mkleczek left a comment

Choose a reason for hiding this comment

Uh oh!

wolfgangwalther commented May 18, 2026

Uh oh!

mkleczek left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wolfgangwalther commented May 18, 2026

Uh oh!

wolfgangwalther commented May 18, 2026

Uh oh!

wolfgangwalther commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkleczek commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wolfgangwalther commented May 18, 2026

Uh oh!

mkleczek commented May 18, 2026

Uh oh!

wolfgangwalther commented May 18, 2026

Uh oh!

mkleczek commented May 18, 2026

Uh oh!

wolfgangwalther commented May 18, 2026

Uh oh!

steve-chavez commented May 18, 2026

Uh oh!

wolfgangwalther commented May 19, 2026

Uh oh!

wolfgangwalther commented May 19, 2026

Uh oh!

mkleczek commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

wolfgangwalther commented May 8, 2026 •

edited

Loading

steve-chavez commented May 18, 2026 •

edited

Loading

mkleczek left a comment •

edited

Loading

wolfgangwalther commented May 18, 2026 •

edited

Loading

mkleczek commented May 18, 2026 •

edited

Loading