Skip to content

ci: add facility for benchmarking as part of CI#4745

Merged
lgritz merged 1 commit into
AcademySoftwareFoundation:mainfrom
lgritz:lg-cibench
May 19, 2025
Merged

ci: add facility for benchmarking as part of CI#4745
lgritz merged 1 commit into
AcademySoftwareFoundation:mainfrom
lgritz:lg-cibench

Conversation

@lgritz
Copy link
Copy Markdown
Collaborator

@lgritz lgritz commented May 10, 2025

Make it so that CI test cases that set the GHA variable "benchmark" to 1 will add a benchmarking step to the workflow that runs a new ci-benchmark.bash script.

The script runs selected unit tests containing benchmarks (currently, only image_span_test, but we can amend later as needed). Those designated tests are run, and their output both echoed to the log for that step and also put in build/benchmarks/TESTNAME and saved as a build artifact for optional download.

Most test cases will not turn benchmarking on -- it probably will end up adding a few minutes so do it very selectively (once per major platform or compiler version is plenty).

I would have previously guessed that any attempts at benchmarking on GHA runners was doomed, but in practice, I'm surprised to find that there's almost as much run-to-run consistency as I find doing casual benchmarks on my own machine. As such, I think this can be a handy way to do some rough benchmarking using CI, to compare platforms or compilers, or verify that changes we want to make don't introduce performance regressions.

Caveats to remember in the future:

  • Take it all with a big grain of salt, and watch the benchmark numbers for the trial-to-trial range of times -- wide variation means that the numbers probably can't be trusted.
  • The GH runners themselves may change without warning, so beware benchmark stability over time, or if they ever have pools of heterogeneous machine generations/configurations.
  • While my results indicate a decent amount of timing reliability for purely computational tests, I assume that there will be enormous run-to-run variation in anything involving I/O or networking. So this is unlikely to be a fruitful way of testing for performance regressions in image format I/O speed (but probably is useful for a variety of in-memory operations).
  • As we add more unit tests to what we benchmark in the future, keep an eye how much time we're spending running these benchmarks. A few minutes on a small subset of the test jobs is probably fine, but I wouldn't want it to make the overall wait for a full CI run to become substantially longer because of it.

Make it so that ci test cases that set the gha variable "benchmark" to
1 will add a benchmarking step to the workflow that runs a new
ci-benchmark.bash script.

The script runs selected unit tests containing benchmarks (currently,
only image_span_test, but we can amend later as needed). Those
designated tests are run, and their output both echoed to the log for
that step and also put in build/benchmarks/TESTNAME and saved as a
build artifact for optional download.

Most test cases will not turn benchmarking on -- it probably will end
up adding a few minutes so do it very selectively (once per major
platform or compiler version is plenty).

I would have previously guessed that any attempts at benchmarking on
GHA runners was doomed, but in practice, I'm surprised to find that
there's almost as much run-to-run consistency as I find doing casual
benchmarks on my own machine. As such, I think this can be a handy way
to do some rough benchmarking using CI, to compare platforms or
compilers, or verify that chagnes we want to make don't introduce
performance regressions.

Caveats to remember in the future:

* Take it all with a big grain of salt, and watch the benchmark
  numbers for the trial-to-trial range of times -- wide variation
  means that the numbers probably can't be trusted.
* The GH runners themselves may change without warning, so beware
  benchmark stability over time, or if they ever have pools of
  heterogeneous machine generations/configurations.
* While my results indicate a decent amount of timing reliability for
  purely computational tests, I assume that there will be enormous
  run-to-run variation in anything involving I/O or networking. So
  this is unlikely to be a fruitful way of testing for performance
  regressions in image format I/O speed (but probably is useful for a
  variety of in-memory operations).
* As we add more unit tests to what we benchmark in the future, keep
  an eye how much time we're spending running these benchmarks.  A few
  minutes on a small subset of the test jobs is probably fine, but I
  wouldn't want it to make the overall wait for a full CI run to
  become substantially longer because of it.

Signed-off-by: Larry Gritz <lg@larrygritz.com>
@lgritz
Copy link
Copy Markdown
Collaborator Author

lgritz commented May 15, 2025

Any objections or comments before I merge this?

@lgritz lgritz added the build / testing / port / CI Affecting the build system, tests, platform support, porting, or continuous integration. label May 17, 2025
@lgritz
Copy link
Copy Markdown
Collaborator Author

lgritz commented May 19, 2025

Over a week in review, no objections, CI-only change ==> merge

@lgritz lgritz merged commit cc794f2 into AcademySoftwareFoundation:main May 19, 2025
32 checks passed
@lgritz lgritz deleted the lg-cibench branch May 20, 2025 04:44
lgritz added a commit to lgritz/OpenImageIO that referenced this pull request May 20, 2025
…ation#4745)

Make it so that CI test cases that set the GHA variable "benchmark" to 1
will add a benchmarking step to the workflow that runs a new
ci-benchmark.bash script.

The script runs selected unit tests containing benchmarks (currently,
only image_span_test, but we can amend later as needed). Those
designated tests are run, and their output both echoed to the log for
that step and also put in build/benchmarks/TESTNAME and saved as a build
artifact for optional download.

Most test cases will not turn benchmarking on -- it probably will end up
adding a few minutes so do it very selectively (once per major platform
or compiler version is plenty).

I would have previously guessed that any attempts at benchmarking on GHA
runners was doomed, but in practice, I'm surprised to find that there's
almost as much run-to-run consistency as I find doing casual benchmarks
on my own machine. As such, I think this can be a handy way to do some
rough benchmarking using CI, to compare platforms or compilers, or
verify that changes we want to make don't introduce performance
regressions.

Caveats to remember in the future:

* Take it all with a big grain of salt, and watch the benchmark numbers
for the trial-to-trial range of times -- wide variation means that the
numbers probably can't be trusted.
* The GH runners themselves may change without warning, so beware
benchmark stability over time, or if they ever have pools of
heterogeneous machine generations/configurations.
* While my results indicate a decent amount of timing reliability for
purely computational tests, I assume that there will be enormous
run-to-run variation in anything involving I/O or networking. So this is
unlikely to be a fruitful way of testing for performance regressions in
image format I/O speed (but probably is useful for a variety of
in-memory operations).
* As we add more unit tests to what we benchmark in the future, keep an
eye how much time we're spending running these benchmarks. A few minutes
on a small subset of the test jobs is probably fine, but I wouldn't want
it to make the overall wait for a full CI run to become substantially
longer because of it.

Signed-off-by: Larry Gritz <lg@larrygritz.com>
zachlewis pushed a commit to zachlewis/OpenImageIO that referenced this pull request Aug 1, 2025
…ation#4745)

Make it so that CI test cases that set the GHA variable "benchmark" to 1
will add a benchmarking step to the workflow that runs a new
ci-benchmark.bash script.

The script runs selected unit tests containing benchmarks (currently,
only image_span_test, but we can amend later as needed). Those
designated tests are run, and their output both echoed to the log for
that step and also put in build/benchmarks/TESTNAME and saved as a build
artifact for optional download.

Most test cases will not turn benchmarking on -- it probably will end up
adding a few minutes so do it very selectively (once per major platform
or compiler version is plenty).

I would have previously guessed that any attempts at benchmarking on GHA
runners was doomed, but in practice, I'm surprised to find that there's
almost as much run-to-run consistency as I find doing casual benchmarks
on my own machine. As such, I think this can be a handy way to do some
rough benchmarking using CI, to compare platforms or compilers, or
verify that changes we want to make don't introduce performance
regressions.

Caveats to remember in the future:

* Take it all with a big grain of salt, and watch the benchmark numbers
for the trial-to-trial range of times -- wide variation means that the
numbers probably can't be trusted.
* The GH runners themselves may change without warning, so beware
benchmark stability over time, or if they ever have pools of
heterogeneous machine generations/configurations.
* While my results indicate a decent amount of timing reliability for
purely computational tests, I assume that there will be enormous
run-to-run variation in anything involving I/O or networking. So this is
unlikely to be a fruitful way of testing for performance regressions in
image format I/O speed (but probably is useful for a variety of
in-memory operations).
* As we add more unit tests to what we benchmark in the future, keep an
eye how much time we're spending running these benchmarks. A few minutes
on a small subset of the test jobs is probably fine, but I wouldn't want
it to make the overall wait for a full CI run to become substantially
longer because of it.

Signed-off-by: Larry Gritz <lg@larrygritz.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build / testing / port / CI Affecting the build system, tests, platform support, porting, or continuous integration.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant