Skip to content

ci: remove "max-workers" and test with our gradle defaults#15307

Merged
rmuir merged 7 commits into
apache:mainfrom
rmuir:ci-test-defaults
Oct 9, 2025
Merged

ci: remove "max-workers" and test with our gradle defaults#15307
rmuir merged 7 commits into
apache:mainfrom
rmuir:ci-test-defaults

Conversation

@rmuir
Copy link
Copy Markdown
Member

@rmuir rmuir commented Oct 8, 2025

If our defaults need improvement, let's improve them to make tests run faster. We want to take advantage of the hardware.

Relates: #15212

If our defaults need improvement, let's improve them to make tests run
faster. We want to take advantage of the hardware.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 8, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@rmuir
Copy link
Copy Markdown
Member Author

rmuir commented Oct 8, 2025

I'll force a couple re-builds here to see how our mac runner does with the change. Don't want to make things temporarily worse.

@rmuir
Copy link
Copy Markdown
Member Author

rmuir commented Oct 8, 2025

The results are not promising. Would rather fix other slowdowns first before merging this one. I do think it is the correct approach, but we can't afford any slowdown on the mac runners.

@dweiss I see very different slow tests in these CI runs, one big difference is that they write the test data to "real" filesystem. Who knows how slow the disks are.

Some ideas:

  • use tmpfs, this is what I do :) It makes entire class of slow tests "go away".
  • fix logic for DisableFsyncFS to be 100% of the time at least on OS X where "fsync" is EXTREMELY slow on real disk, it is fsync() + fcntl(F_FULLFSYNC)
  • actually debug and address this whole class of tests that is bad on the filesystem
The slowest tests during this run:
  46.63s TestLucene90DocValuesFormat.testNumericDocValuesWithSkipperMedium (:lucene:core)
  18.61s TestDocIdsWriter.testCrash (:lucene:core)
  15.99s TestSimpleTextDocValuesFormat.testNumericDocValuesWithSkipperMedium (:lucene:codecs)
  15.21s TestLucene90TermVectorsFormat.testMixedOptions (:lucene:core)
  14.05s TestSimpleTextCompoundFormat.testLargeCFS (:lucene:codecs)
  12.73s TestBKD.test2DLongOrdsOffline (:lucene:core)
  11.12s TestLucene90DocValuesFormat.testThreads (:lucene:core)
  10.12s TestBKD.testWithExceptions (:lucene:core)
   9.51s TestAddIndexes.testCascadingMergesTriggered (:lucene:core)
   9.01s TestBKD.testRandomBinaryMedium (:lucene:core)

@dweiss
Copy link
Copy Markdown
Contributor

dweiss commented Oct 8, 2025

The tmpfs trick is neat and I'm all for it but it won't help people who can't use it for some reason. So it'd be better to fix the tests so that they run faster but it's a humongous task, given the size of these tests.

@rmuir
Copy link
Copy Markdown
Member Author

rmuir commented Oct 8, 2025

@dweiss I see differently, instead as simply respecting my $TMPDIR environment variable to use appropriate location for temporary files, rather than creating the temporary files inside git checkout build/ directories.

I just happened to have configured my system to use tmpfs for these temporary files.

@dweiss
Copy link
Copy Markdown
Contributor

dweiss commented Oct 8, 2025

We can change the defaults to respect TMPDIR... It's the way it is now because I thought it'd be easier to locate the generated files if they reside within the project. But maybe it's not the best idea. Don't know.

    Provider<Directory> workDirOption =
        buildOptions.addDirOption(
            "tests.workDir",
            "Working directory for forked test JVMs.",
            buildDirectory.dir("tests-cwd"));

Currently these don't use all the cpus:
* Ubuntu and Windows runners have 4 cpus
* MacOS X has 3 cpus (!)

Dividing by two (assuming hyperthreads?) doesn't yield a speedup either.
I set the value to 4 on 2-core system for this reason.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 8, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@rmuir
Copy link
Copy Markdown
Member Author

rmuir commented Oct 8, 2025

@dweiss this needs a second look, you may not like it. Here is the problem with the previous logic:

jshell> var cpus = 3; // Github Mac OS X runners
cpus ==> 3

jshell> var testsJvms = (int) Math.max(1d, Math.min(cpus * 0.5d, 12));
testsJvms ==> 1

Up to 67% of mac runner hardware was not used.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 8, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 8, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@rmuir rmuir marked this pull request as draft October 8, 2025 22:49
@rmuir
Copy link
Copy Markdown
Member Author

rmuir commented Oct 8, 2025

Set back to draft, we can't really benchmark the build times right now, so erratic, seems to be the bottleneck is something else, looks to be around the filesystem.

@rmuir rmuir marked this pull request as ready for review October 9, 2025 05:23
@dweiss
Copy link
Copy Markdown
Contributor

dweiss commented Oct 9, 2025

These CPU count defaults date back to hyperthreading - that's why they're halved (saturating cpus wasn't helping much + there are CPUs actively used by the GC and some slack was left for the filesystem).

Times have changed. Maybe this patch is indeed more realistic since on VMs the cpu count is actually what's available and on beefy hardware I/O saturates much more quickly.

@rmuir
Copy link
Copy Markdown
Member Author

rmuir commented Oct 9, 2025

OK, I'm back. The reason this change didn't work the first time, was C2 being enabled by $CI env var. We now actually just "set the defaults" (instead of CI defaults), which uses all cores for testing rather than constant JIT recompilation. The C2 was really not helping the situation and all runners have a java 25.0.0 which I believe suffers from bugs that make this worse, such as JDK-8368071.

It is my opinion java 25 set the CI system over the edge and started causing the timeouts.

MacOS gets some tweaks because it is a crazy operating system. It is not totally tamed here, but just trying to reduce the variance in the worst-case builds (where you get the crazy slow runners). I am happy if it always stays under 20 minutes which is what I'm seeing now. I checked, there are no tmpfs mounts, so we create one.

Combined with #15312 I am seeing much more stable, and faster results, e.g:

ubuntu-latest: 7m 33s
windows-latest: 11m 11s
macos-latest: 10m 9s (but when you get a "slow" runner, it can be upwards of 20)

@rmuir
Copy link
Copy Markdown
Member Author

rmuir commented Oct 9, 2025

These CPU count defaults date back to hyperthreading

Those are also good defaults if you are running lucene tests with C2, which we were doing before. You need to spare half your CPUs, so they can handle all the JIT compilation :)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 9, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@rmuir
Copy link
Copy Markdown
Member Author

rmuir commented Oct 9, 2025

I've separately got a nice list of tests that were slowest on the macos runs (after painfully running it many times LOL). The same ones keep popping up, so they are doing something to make the macos or the ARM angry.

Tomorrow can create an issue, maybe best to just temp-@Nightly the worst offenders if we need to shave off a few more minutes for the worst-case macos builds.

@rmuir
Copy link
Copy Markdown
Member Author

rmuir commented Oct 9, 2025

I will open a followup PR to cleanup the macos part of this, I'm sorry its a bit ugly, it is just putting out a fire for now. Any concerns, just revert it, but this PR should give us some breathing room.

@rmuir rmuir merged commit c67f8d3 into apache:main Oct 9, 2025
15 checks passed
mkdir /tmp/tmpfs
sudo mount_tmpfs -o noowners -s 1g /tmp/tmpfs
sudo sysctl debug.lowpri_throttle_enabled=0
echo "tests.workDir=/tmp/tmpfs/lucene" >> build-options.local.properties
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for using these local build option overrides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants