Skip to content

Fix high response latency with multiple API-requests targeting non-existent hosts or services#10883

Open
jschmidt-icinga wants to merge 3 commits into
masterfrom
better-json-error-diagnostic-information
Open

Fix high response latency with multiple API-requests targeting non-existent hosts or services#10883
jschmidt-icinga wants to merge 3 commits into
masterfrom
better-json-error-diagnostic-information

Conversation

@jschmidt-icinga

@jschmidt-icinga jschmidt-icinga commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

This fixes the high response latency observed in #10875 when multiple requests target non-existent hosts in parallel. The reason is that for almost all API endpoints that target config objects diagnostic information including a stack trace is generated, regardless if it is actually returned (which only happens when the "verbose" parameter is given). The solution is to pass the exception_ptr into SendJsonError() and only generate the DiagnosticInformation from it when needed.

Solution

SendJsonError() now takes a std::exception_ptr argument instead of a dagnosticInformation string that is generated at the call site. This is cheap and doesn't cause any copies. Then, where the actual decision is made to include the diagnostic_information (i.e. dependent on the "verbose" parameter), DiagnosticInformation(<eptr>) is run. The overload already existed, so there wasn't even a need to rethrow the exception and process the string from the result.

This is a clean solution, and the change makes sense even without the latency amplification in mind, since pointless work is avoided in any case.

However, some yak-shaving was required to feed SendJsonError() with an exception pointer gathered with std::current_exception() at the call-site. I had to make the entire code-base use std::exception_ptr instead of boost::exception_ptr. On the newest boost version, this is just a typedef, on previous versions the boost one could be constructed from the std one, but there's also old versions we still support which didn't have that conversion. And that refactor in turn uncovered a small bug in the DiagnosticInformation(<eptr>) which had a dead-code fallback that then failed to compile and is now fixed.

Testing

Using the test-script that was provided in #10875, the improvement is clearly observable:

Before

With existent host/service:

$ ulimit -n 100000 && python3 wp.py --host https://localhost:5665 --user root --password icinga --total 5000 --parallel 200 --target-host master-1 --target-service procs --no-verify-ssl
Sending 5000 requests to https://localhost:5665
Max in-flight: 200
Press Ctrl+C to stop.

Progress: 3097/5000 sent (3097 ok, 0 errors) (host: master-1) (service: procs) — min: 0.001s, avg: 0.312s, p95: 1.758s, max: 2.564s

Total test duration: 7.200s
Done: 5000 ok, 0 errors out of 5000 total requests.
Request duration — min: 0.001s, max: 2.564s, avg: 0.282s, p95: 0.904s

With non-existent host/service:

$ ulimit -n 100000 && python3 wp.py --host https://localhost:5665 --user root --password icinga --total 5000 --parallel 200 --target-host master-1 --target-service procsasdf --no-verify-ssl
Sending 5000 requests to https://localhost:5665
Max in-flight: 200
Press Ctrl+C to stop.

Progress: 566/5000 sent (566 ok, 0 errors) (host: master-1) (service: procsasdf) — min: 0.116s, avg: 0.864s, p95: 2.527s, max: 4.867s
Progress: 1496/5000 sent (1496 ok, 0 errors) (host: master-1) (service: procsasdf) — min: 0.012s, avg: 1.046s, p95: 5.539s, max: 9.837s
Progress: 2419/5000 sent (2419 ok, 0 errors) (host: master-1) (service: procsasdf) — min: 0.012s, avg: 0.998s, p95: 4.867s, max: 13.447s
Progress: 3336/5000 sent (3336 ok, 0 errors) (host: master-1) (service: procsasdf) — min: 0.012s, avg: 1.015s, p95: 4.797s, max: 18.046s
Progress: 4198/5000 sent (4198 ok, 0 errors) (host: master-1) (service: procsasdf) — min: 0.012s, avg: 1.075s, p95: 5.001s, max: 22.430s

Total test duration: 29.636s
Done: 5000 ok, 0 errors out of 5000 total requests.
Request duration — min: 0.012s, max: 22.430s, avg: 1.160s, p95: 5.629s
jschmidt@ws-jschmidt:~/Projects/dev-docker$ ulimit -n 100000 && python3 wp.py --host https://localhost:5665 --user root --password icinga --total 5000 --parallel 200 --target-host master-1 --target-service procsasdf --no-verify-ssl

After

With existent host/service:

$ ulimit -n 100000 && python3 wp.py --host https://localhost:5665 --user root --password icinga --total 5000 --parallel 200 --target-host master-1 --target-service procs --no-verify-ssl
Sending 5000 requests to https://localhost:5665
Max in-flight: 200
Press Ctrl+C to stop.

Progress: 3161/5000 sent (3161 ok, 0 errors) (host: master-1) (service: procs) — min: 0.002s, avg: 0.304s, p95: 1.796s, max: 2.792s

Total test duration: 6.844s
Done: 5000 ok, 0 errors out of 5000 total requests.
Request duration — min: 0.002s, max: 2.792s, avg: 0.269s, p95: 0.799s

With non-existent host/service:

$ ulimit -n 100000 && python3 wp.py --host https://localhost:5665 --user root --password icinga --total 5000 --parallel 200 --target-host master-1 --target-service procsasdf --no-verify-ssl
Sending 5000 requests to https://localhost:5665
Sending 5000 requests to https://localhost:5665
Max in-flight: 200
Press Ctrl+C to stop.


Total test duration: 2.545s
Done: 5000 ok, 0 errors out of 5000 total requests.
Request duration — min: 0.002s, max: 2.126s, avg: 0.096s, p95: 0.212s
Sending: command not found

You can see that the error responses for non-existant targets are now actually much faster than successful requests, as one would expect.

Fixes #10875.

@cla-bot cla-bot Bot added the cla/signed label Jun 17, 2026
@jschmidt-icinga jschmidt-icinga force-pushed the better-json-error-diagnostic-information branch from f83e8c7 to c14e5de Compare June 17, 2026 09:07
@jschmidt-icinga jschmidt-icinga added this to the 2.17.0 milestone Jun 17, 2026
@jschmidt-icinga jschmidt-icinga added consider backporting Should be considered for inclusion in a bugfix release bug Something isn't working labels Jun 17, 2026
`DiagnosticInformation()` wasn't able to take a `std::exception_ptr` due
to the missing conversion on older boost versions, so now everything uses
the std::exception_ptr instead. There are still a few reasons to use
`boost::exception` in some places, but for exception pointers, the standard
one should be better in most cases and almost never requires to include an
extra header.
@jschmidt-icinga jschmidt-icinga force-pushed the better-json-error-diagnostic-information branch from c14e5de to 83f42f4 Compare June 17, 2026 13:14
This was likely meant as a fallback for non-`std::exception`
exception pointers, but it would never have been reached, because
`(std|boost)::rethrow_exception()` would have thrown those out of
the function scope.

This fixes the fallback by making it a catch handler and also using
the more direct way of getting this fallback diagnostic info inside
a catch handler.
The `diagnostic_information` string will now generated if and
when it is required (i.e. the "verbose" parameter is true).
@jschmidt-icinga jschmidt-icinga force-pushed the better-json-error-diagnostic-information branch from 83f42f4 to 1c40368 Compare June 18, 2026 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cla/signed consider backporting Should be considered for inclusion in a bugfix release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parallel process-check-result requests for non-existing objects degrade Icinga 2 performance and cause overdue checks

1 participant