nix(loadtest): group results by URL / report minimum latency / merge errors+mixed#4902
nix(loadtest): group results by URL / report minimum latency / merge errors+mixed#4902wolfgangwalther wants to merge 7 commits into
Conversation
b0cb802 to
eb2af7e
Compare
cf64216 to
5467f57
Compare
Interesting, so we're changing to "less is better" now with the "minimum latency" right? Before it was "more is better" with the rate. Looking at https://github.com/PostgREST/postgrest/actions/runs/26036478763?pr=4902#summary-76536097215:
|
Ultimately, we only look at the `rate` column, so we can just as well remove all other columns. This makes the next step, when we split results by request type, much less noisy.
Different requests hit different code paths and perform very differently. By looking at each request type separately, we should be able to get a much better idea of what kind of change in performance we're looking at and where the root cause might be. It will hopefully also allow us to migrate some of the other test-cases into the main loadtest.
We previously used "rate", i.e. number of requests per second, as the primary metric to judge loadtest results. However, this has always been varying from run to run quite a bit, especially in CI where other jobs possibly run on the same VM host. The run-to-run variance has massively increased after splitting the results up per request. Example run in CI with rate on the PR introducing this change (on which we would expect no change at all): | rate [1/s] | main | head | Δ | |:-----------------------------------|-------:|-------:|-----:| | / | 9.4 | 9.5 | 1% | | /actors | 870.4 | 1023.0 | 18% | | /actors?actor=eq.1 | 188.5 | 198.6 | 5% | | /actors?actor=eq.1&columns=name | 197.3 | 167.1 | -15% | | /actors?select=*,roles(*,films(*)) | 153.9 | 144.9 | -6% | | /films?columns=id,title | 157.9 | 182.6 | 16% | | /films?columns=id,title,year,... | 87.0 | 87.1 | 0% | | /roles | 204.5 | 267.3 | 31% | | /rpc/call_me | 231.3 | 208.8 | -10% | | /rpc/call_me?name=John | 212.2 | 201.7 | -5% | From the data we can easily tell that the very reason that rate as a paramter has only worked, so far, because the data was *heavily* dominated by the requests on the root endpoint for OpenAPI. The longer duration makes the request much less vulnerable for concurrent activity. For all other requests its essentially not possible to judge the effect of a PR this way. One way to counter this would be to massively increase the time the loadtest runs. More samples will result in a smoother average. However, that's not practical for usability of CI. In the original PR PostgREST#1812 I already evaluated using the *minimum latency* as the most reliable criteriumi, but this has never really caught on. The theory behind this is: The variation in timings between requests is happening because of concurrent activity, priority chosen by the scheduler, availability of resources and such - all factors *outside* our control, and *irrelevant* to the Haskell code we're writing. Using the minimum latency is an estimation of how fast the code can run *in the best case*. This might not be a number relevant for production, but it's much more directly related to the code we write. Here's to show how variation becomes *much* smaller with minimum latency as the parameter: | min latency [μs] | main | head | Δ | |:-----------------------------------|---------:|-------:|-----:| | / | 1275.3 | 1263.6 | -1% | | /actors | 10.0 | 9.9 | -1% | | /actors?actor=eq.1 | 50.7 | 48.3 | -5% | | /actors?actor=eq.1&columns=name | 54.1 | 54.0 | 0% | | /actors?select=*,roles(*,films(*)) | 63.2 | 61.9 | -2% | | /films?columns=id,title | 51.1 | 50.7 | -1% | | /films?columns=id,title,year,... | 121.9 | 121.8 | 0% | | /roles | 42.9 | 42.6 | -1% | | /rpc/call_me | 45.6 | 45.4 | 0% | | /rpc/call_me?name=John | 44.4 | 44.2 | 0% | Since we're separating results per request now, we can only sensibly focus on *one* parameter - otherwise this would get really clunky UI-wise. Especially for automated CI failures, minimum latency is the logical choice.
Because we separate loadtest results per URL now, we can move the error tests into the regular mixed bag of loadtests - we will be able to tell from the misspelled URLs when we hit a regression in that area. We should be able to do similar things for JWT tests, but we'll need more infrastructure here.
This gives us a much better idea immediately when looking at the report. It also allows us to differentiate between different return codes on the same URI, which might come in handy when dealing with expired JWT and such. A nice side-effect: All the error-related requests are now grouped together in the 4xx section.
5467f57 to
3cd0934
Compare
Correct.
Interesting. I have started bisecting this.
Right. I also wanted the returned status code to be part of it. Added both, I think that works well. |
steve-chavez
left a comment
There was a problem hiding this comment.
Right. I also wanted the returned status code to be part of it. Added both, I think that works well.
Looks much better 👍
And I'm lead to baebacf as the first bad commit. Maybe one of the changed dependencies causes this? Also note that the commit did change the memory limits for exactly the same kind of request, so that matches the observation. |
That is quite possible - the most suspect is |
Currently bisecting this as well (at least by version). |
mkleczek
left a comment
There was a problem hiding this comment.
I remember @steve-chavez opposed adding error cases to mixed load test and decided to have a separate errors load test.
I personally support having a wider spectrum of cases in the mixed load test but want to make sure the reasons given earlier by @steve-chavez are no longer relevant.
|
I believe the concern was, that it was not easy to tell whether a regression was caused by error cases or regular requests? The grouping by URL would solve that concern. |
There was a problem hiding this comment.
I think minimum is wrong metric - it is really not something you can trust - it is too noisy. Usually the best performance metric is Pxx latency: most of the times P90 or P95 (sometimes even P99 but that is also noisy).
Also - we really want to measure "worst case" - not "best case". If out of 100 requests 10s each, one is 0.01s then minimum really badly shows how performant is the software.
I do agree that this applies when you expect there to be a "true mean" and the noise happening in both directions. That's not the case here. The underlying theory, which I put in the commit message, is that in the absence of any other running process, there is an optimal execution of the code we write. The best-case. Noise by other processes, resource contention and such all only make it slower. Or in practical terms: I tried many things, but minimum latency was the most reliable for me (in my subjective judgement!).
No, given the reasoning above we are not interested in worst case at all. This would essentially just mean to test how performant GitHub Actions are. These loadtests are not (and were never!) designed to give production-like numbers, where you can say how PostgREST performs in the environment its in. That's what you need stuff like |
Or let me put it in differently: The data is not normally distributed, because the minimum is not open-ended. There is a theoretical lower bound (the one where all other processes just play along perfectly and the scheduler does all the right things) under which we can not fall. Thus, we're really looking for an approximation to that, the minimum is the best approximation we can get. |
It seems this happened from warp 3.4.12 to 3.4.13. Changes for 3.4.13: yesodweb/wai@warp-tls-3.4.12...cd6b6e1 Quite a few. |
But what information does minimum give us? You have two runs of two different versions - comparing minimum latency between them is meaningless. The only reasonable data point to assess regressions is |
Well, what else can I tell you except that it works very well? Have a look at the PR introducing the loadtest, this comment #1812 (comment). This is how noisy minimum based metrics are (the numbers are "hypothetical requests per second based on minimum latency", essentially just |
It doesn't really. Let me explain below.
The problem is that these numbers are really not very meaningful, for example they would stay the same if you introduced a global lock that only allows one concurrent request at a time making others wait (minimum latency would stay the same). The performance issues from this comment would be perfectly well identified by With all due respect, measuring |
That global lock is essentially already in place, because vegeta does not attack in parallel (1 worker only) and we only have a pool size of 1. Again, the loadtest is not designed for anything multi-threaded. |
Didn't know about that and that is a problem in itself. But anyway that was just one example really, a different one that shows the same for effectively single-threaded software is excessive memory usage causing garbage collection pauses - again miminum is going to stay the same.
|
The problem is that we don't have many resources to work with in GitHub Actions and the data is inherently noisy because of that. Not reducing scope to essentially single-threaded execution will make the results unusable.
Thanks, that I understand.
I did some tests again. I ran the loadtest twice, once head vs main, and once head vs 14.11. We expect to see no difference whatsoever on the first test... and possibly a difference on the comparison to v14.11, at least to our earlier observations. I then compared the percentage difference between different metrics, from minimum all the way up to maximum. Positive numbers are better, negative numbers are worse (because I have a
My takeaways:
With this in mind, I have no problem switching from minimum to another, reasonable, Psomething. The more important thing is to switch away from the mean (which is the same as "rate"), because it's influenced by the max so heavily. The data here suggests to use P50. If we were to implement CI failures based on these, I'd probably start trying to fail when it's more than -4% regression on any single endpoint. |
Yes, correct. Now we can see the misspelling URLs plus the 404, so it's fine to combine both now that we have a clearer report. |
|
(just pushed a change to do p95 to see how these numbers look in CI - will do p50 as well) |
|
So some runs in CI:
When I imagine to build automated CI failures on top of it, to me, it looks like the p50 run is the most useful. Other opinions? |
Can we present all and take P50 as a measure to decide CI failures for now? So let's start with P50 +/- 5% as a signal - we'll see how it works. WDYT? |
Quite a few changes at once, but I think this is a big improvement to how we can work with the loadtests.
I already pushed two commits two days ago to make the
jwt-rsa-cache-worstjob optional in CI and avoid cancelling all others when that one fails.This PR then adds a series of commits on top, which improves the reporting for the loadtest. With the more detailed reporting per URL, we can merge the errors + mixed jobs. I envision something similar for the JWT loadtests, so that we will end up with only two loadtest suites: One with statically defined targets and one with dynamically created ones. It might even be possible to merge these two as well, but that's for much later.
This does not provide any automatic result checking by CI - that's something to add later in a different PR.
A lot of text and information in the commit messages. Make sure to read those, too.