Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
7a0ef63
EH: CS-1188 control daemons with systemd
jgabler-hpc Apr 23, 2025
409639a
Merge branch 'master' of https://github.com/hpc-gridware/clustersched…
jgabler-hpc Apr 24, 2025
37b83fb
Merge branch 'master' of https://github.com/hpc-gridware/clustersched…
jgabler-hpc Apr 25, 2025
b50f535
avoid endless loop in case an invalid slice is given in the autoinsta…
jgabler-hpc Apr 25, 2025
2676098
EH: CS-1192 at startup of daemons output the cgroups slice the servic…
jgabler-hpc Apr 25, 2025
c78de60
fixed type "deamon"
jgabler-hpc Apr 25, 2025
da3faee
Merge branch 'master' of https://github.com/hpc-gridware/clustersched…
jgabler-hpc Apr 25, 2025
8000239
Merge branch 'master' of https://github.com/hpc-gridware/clustersched…
jgabler-hpc Apr 30, 2025
70eefbc
Merge branch 'master' of https://github.com/hpc-gridware/clustersched…
jgabler-hpc May 6, 2025
cd8468f
EH: CS-1223 with systemd integration, move sge_shepherd processes out…
jgabler-hpc May 6, 2025
1a6ef99
sd_bus method StartTransientUnit does only start a job creating the unit
jgabler-hpc May 7, 2025
4737c83
Merge branch 'master' of https://github.com/hpc-gridware/clustersched…
jgabler-hpc May 8, 2025
6ed8628
- do not report systemd as init system on ulx-* as we cannot build sy…
May 11, 2025
6157537
Merge branch 'master' of https://github.com/hpc-gridware/clustersched…
jgabler-hpc May 21, 2025
e04a6cd
* sd_bus error was not reported to caller
jgabler-hpc May 27, 2025
fc690c5
fixed non-unique message ids
jgabler-hpc May 27, 2025
e7259f7
EH: CS-1291 move shepherd child to its own scope
jgabler-hpc May 30, 2025
347af11
Merge branch 'master' of https://github.com/hpc-gridware/clustersched…
jgabler-hpc May 30, 2025
fa8efe2
shepherd tried to use systemd on host having systemd library but not …
jgabler-hpc May 30, 2025
2a9a4ad
EH: CS-1292 get job online usage information via systemd
jgabler-hpc Jun 4, 2025
5979a12
Merge branch 'master' of https://github.com/hpc-gridware/clustersched…
jgabler-hpc Jun 4, 2025
c109c7c
tried to connect to systemd on host not having systemd
jgabler-hpc Jun 4, 2025
d51820f
errors in StartTransientUnit were not always propagated to caller
jgabler-hpc Jun 5, 2025
4cfe7b9
EH: CS-1294 set job limits via systemd
jgabler-hpc Jun 6, 2025
41f7cc3
EH: CS-1315 set binding via systemd
jgabler-hpc Jun 9, 2025
6727f78
cleanup
jgabler-hpc Jun 10, 2025
4950f0d
EH CS-1295 set device isolation via systemd
jgabler-hpc Jun 12, 2025
97a55f7
EH: CS-1241 add profiling information for systemd operations
jgabler-hpc Jun 12, 2025
d046528
- execd profiling could not be disabled again
jgabler-hpc Jun 12, 2025
7e8a303
EH: CS-1318 allow to run jobs under systemd control even if sge_execd…
jgabler-hpc Jun 13, 2025
9441539
EH: CS-1319 make running jobs under systemd control configurable
jgabler-hpc Jun 13, 2025
a2bed4f
added ENABLE_SYSTEMD to sge_conf.5 man page
jgabler-hpc Jun 13, 2025
55bb35f
EH: CS-1322 the job specific scopes need to contain the toplevel slic…
jgabler-hpc Jun 16, 2025
9421c4f
Merge branch 'master' of https://github.com/hpc-gridware/clustersched…
jgabler-hpc Jun 16, 2025
15dc306
EH: CS-1300 do not add and handle the additional group id for jobs ru…
jgabler-hpc Jun 17, 2025
a3d7a43
BF: CS-1325 possible race condition between calling StartTransientUni…
jgabler-hpc Jun 18, 2025
510e385
EH: CS-1296 kill jobs via systemd
jgabler-hpc Jun 18, 2025
b06f77d
EH: CS-1321 allow to configure a hybrid usage data collection (both v…
jgabler-hpc Jun 21, 2025
3d6164b
fixed memory leaks
jgabler-hpc Jun 23, 2025
bbe7aa0
BF: CS-1335 need special handling for interrupted system call
jgabler-hpc Jun 27, 2025
9d9c82f
Merge branch 'master' of https://github.com/hpc-gridware/clustersched…
jgabler-hpc Jun 28, 2025
43c1e59
EH: CS-1342 add systemd specific settings (toplevel slice name) to th…
jgabler-hpc Jun 28, 2025
f25d751
Merge branch 'master' of https://github.com/hpc-gridware/clustersched…
jgabler-hpc Jul 2, 2025
2a06fb2
Merge branch 'master' of https://github.com/hpc-gridware/clustersched…
jgabler-hpc Jul 2, 2025
1f5c142
cleanup and added systemd integration to the release notes
jgabler-hpc Jul 3, 2025
b62275e
cleanup
jgabler-hpc Jul 3, 2025
10bcc17
- addressed review comments
jgabler-hpc Jul 4, 2025
07b73cb
added more details of the systemd integration to the release notes
jgabler-hpc Jul 6, 2025
c7c8ce6
addressed review comments
jgabler-hpc Jul 8, 2025
7297125
refactoring and documentation with Doxygen headers
jgabler-hpc Jul 8, 2025
0b0c426
Merge branch 'master' of https://github.com/hpc-gridware/clustersched…
jgabler-hpc Jul 10, 2025
5b4b2bd
EH: CS-1408 USAGE_COLLECTION mode must be kept consistent for running…
jgabler-hpc Jul 10, 2025
bdcea5b
Merge branch 'master' of https://github.com/hpc-gridware/clustersched…
jgabler-hpc Jul 11, 2025
ef56099
Merge branch 'master' of https://github.com/hpc-gridware/clustersched…
jgabler-hpc Jul 13, 2025
89af5a8
EH: CS-1419 disable systemd integration if sge_execd is started as no…
jgabler-hpc Jul 14, 2025
98d90e3
with HYBRID usage collection non systemd hosts didn't report cpu and rss
jgabler-hpc Jul 15, 2025
d6bf281
reprioritization code was broken by systemd integration
jgabler-hpc Jul 15, 2025
0ddf0eb
- improved diagnostics when ptf job / osjob cannot be found
jgabler-hpc Jul 17, 2025
53f6d99
BF: CS-1019 sge_execd logs errors when running tightly integrated par…
jgabler-hpc Jul 17, 2025
d353736
BF: CS-1425 backup/restore does not handle $SGE_ROOT/$SGE_CELL/slice_…
jgabler-hpc Jul 18, 2025
4e7b8d3
BF: CS-1429 sge_qmaster can segfault on qdel -f
jgabler-hpc Jul 19, 2025
130b556
BF: CS-1019 sge_execd logs errors when running tightly integrated par…
jgabler-hpc Jul 21, 2025
03d8dd7
BF: CS-1430 running tightly integrated parallel jobs leaves systemd s…
jgabler-hpc Jul 21, 2025
698fdea
fix to the fix for CS-1019
jgabler-hpc Jul 22, 2025
4bbc2bc
Merge branch 'master' of https://github.com/hpc-gridware/clustersched…
jgabler-hpc Jul 22, 2025
8080ed8
added missing files
jgabler-hpc Jul 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ option(WITH_LCOV "Enable code coverage analysis with lcov" OFF)
option(WITH_PYTHON "Enable Python external bindings" OFF)
option(WITH_BOOST "Enable Boost framework" OFF)
option(WITH_MUNGE "Enable Munge authentication" ON)
option(WITH_SYSTEMD "Enable systemd support" OFF)

# private extensions
set(PROJECT_EXTENSIONS "None" CACHE STRING "directory of private extensions")
Expand Down Expand Up @@ -188,6 +189,10 @@ if (WITH_MUNGE)
add_compile_definitions("OCS_WITH_MUNGE")
endif()

if (WITH_SYSTEMD)
add_compile_definitions("OCS_WITH_SYSTEMD")
endif()

#if (SGE_ARCH MATCHES "darwin-arm64" OR SGE_ARCH MATCHES "fbsd-amd64")
if (NOT WITH_SPOOL_BERKELEYDB AND NOT WITH_SPOOL_DYNAMIC)
set(SPOOLING_LIBS spoolloader spoolc_static spool)
Expand Down
37 changes: 18 additions & 19 deletions cmake/ArchitectureSpecificSettings.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -61,25 +61,7 @@ function(architecture_specific_settings)
message("Build with extensions is enabled")
endif()

if (SGE_ARCH MATCHES "lx-riscv64")
# Linux RiscV
message(STATUS "We are on Linux: ${SGE_ARCH}")
set(CMAKE_C_FLAGS "-Wall -Werror -pedantic" CACHE STRING "" FORCE)
set(CMAKE_CXX_FLAGS "-Wall -Werror -pedantic" CACHE STRING "" FORCE)

add_compile_definitions(LINUX _GNU_SOURCE GETHOSTBYNAME_R6 GETHOSTBYADDR_R8 HAS_IN_PORT_T SPOOLING_dynamic __SGE_COMPILE_WITH_GETTEXT__)
add_compile_options(-fPIC)
add_compile_options(-pthread)
add_link_options(-pthread -rdynamic)

set(TIRPC_INCLUDES /usr/include/tirpc PARENT_SCOPE)
set(TIRPC_LIB tirpc PARENT_SCOPE)
message(STATUS "using libtirpc")

set(WITH_JEMALLOC OFF PARENT_SCOPE)
set(WITH_MTMALLOC OFF PARENT_SCOPE)
set(JNI_ARCH "linux" PARENT_SCOPE)
elseif (SGE_ARCH MATCHES "lx-.*" OR SGE_ARCH MATCHES "ulx-.*" OR SGE_ARCH MATCHES "xlx-.*")
if (SGE_ARCH MATCHES "lx-.*" OR SGE_ARCH MATCHES "ulx-.*" OR SGE_ARCH MATCHES "xlx-.*")
# master is not supported on CentOS 6. Execd is deprecated and will be removed in the future.
if (SGE_ARCH STREQUAL "xlx-.*")
set(INSTALL_SGE_BIN_MASTER OFF CACHE BOOL "Install master daemon binaries" FORCE)
Expand Down Expand Up @@ -166,6 +148,23 @@ function(architecture_specific_settings)
message(STATUS "no libtirpc or libntirpc found")
endif ()

# build with systemd?
# @todo we might want to check the api version, we need at least
# - 235: here FreezeUnit and ThawUnit were added (not required, we work around this not being available)
# - 231: 240? here sd_bus_process() was added (not required, we work around this)
# - 221: here StopUnit was added
# Our build hosts are OK as it is (RHEL-8 compatible for lx-* has a recent enough version,
# RHEL-7 compatible for ulx-* does not have it at all)
if (EXISTS /usr/include/systemd/sd-bus.h)
set(WITH_SYSTEMD ON PARENT_SCOPE CACHE STRING "" FORCE)
message(STATUS "systemd development files found")
endif()

if (SGE_ARCH MATCHES "lx-riscv64")
# Linux RiscV
add_compile_options(-fPIC)
set(WITH_JEMALLOC OFF PARENT_SCOPE)
endif()
if (SGE_ARCH STREQUAL "lx-x86" OR SGE_ARCH STREQUAL "ulx-x86" OR SGE_ARCH STREQUAL "xlx-x86")
# we need patchelf for setting the run path in the db_* tools
# but patchelf is not available on CentOS 7 x86
Expand Down
22 changes: 20 additions & 2 deletions doc/markdown/man/man5/sge_conf.md
Original file line number Diff line number Diff line change
Expand Up @@ -1044,11 +1044,19 @@ completely.

***ENABLE_BINDING***

If this parameter is set then xxQS_NAMExx enables the core binding module within the execution daemon to apply
binding parameters that are specified during submission time of a job. This parameter is not set per default and
If this parameter is set, then xxQS_NAMExx enables the core binding module within the execution daemon to apply
binding parameters that are specified during submission time of a job. This parameter is not set per default, and
therefore all binding related information will be ignored. Find more information for job to core binding in the
section `-binding` of qsub(1).

***ENABLE_SYSTEMD***

If this parameter is set,
and an execution hosts supports systemd, then jobs will be started in a systemd scope. This allows the execution daemon to
manage the job's processes as a group, which is useful for resource management and job control.

This parameter is set to true by default, meaning that on hosts that support systemd, jobs will be started in a systemd scope. If a host does not support systemd, then this parameter will be ignored.

***SCRIPT_TIMEOUT***

This parameter allows to configure the allowed runtime of execution side scripts like prolog, epilog, and the PE
Expand All @@ -1060,6 +1068,15 @@ in one load report interval. The default for *execd_params* is none.

The global configuration entry for this value may be overwritten by the execution host local configuration.

***USAGE_COLLECTION***

This parameter controls how xxqs_name_sxx_execd collects the online usage information of jobs. The following values are recognized:

- *FALSE* : No online usage information is collected. Use with care, this also disables limit enforcement for *s_cpu*, *h_cpu*, *s_rss*, *h_rss*, *s_vmem*, and *h_vmem*.
- *PDC* : Online usage information is collected by the PDC (Portable Data Collector) mode, even if Systemd is available.
- *HYBRID* : Hybrid mode, where online usage information is both gathered via Systemd (if available) and the PDC. Use this mode, when your jobs are controlled by systemd, but you also want to collect usage information for jobs that is not available via Systemd, e.g., vmem, maxvmem, io, and iow.
- *TRUE* : This is the default mode. Online usage information is collected via Systemd if the host supports Systemd and *ENABLE_SYSTEMD* is set to *TRUE* (which is the default). It is collected by the PDC (Portable Data Collector) if the host does not support Systemd or if *ENABLE_SYSTEMD* is set to *FALSE*.

## gdi_request_limits

This value is a global configuration parameter only, and is used to prevent denial-of-service attacks on the xxqs_name_sxx_qmaster(8) process.
Expand Down Expand Up @@ -1375,3 +1392,4 @@ xxqs_name_sxx_shepherd*(8), cron(8),
# COPYRIGHT

See xxqs_name_sxx_intro(1) for a full statement of rights and permissions.

7 changes: 6 additions & 1 deletion doc/markdown/manual/development-guide/00_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,12 @@ Tags and branches before `V9` will also not be described here.
| | V900p1\_TAG | patch to the 9.0.0 making it work on GCP (CS-663) |
| | | |
| V90\_BRANCH | | maintenance of 9.0 |
| | V903\_TAG | third 9.0 patch |
| | V904\_TAG | fourth 9.0 patch |
| | V905\_TAG | fifth 9.0 patch |
| | V906\_TAG | sixth 9.0 patch |
| | V907\_TAG | seventh 9.0 patch |
| | | |

[//]: # (Each file has to end with two emty lines)
[//]: # (Each file has to end with two empty lines)

Original file line number Diff line number Diff line change
Expand Up @@ -267,5 +267,5 @@ git clone https://github.com/hpc-gridware/gcs-extensions
git clone https://github.com/hpc-gridware/gcs-testsuite
```

[//]: # (Each file has to end with two emty lines)
[//]: # (Each file has to end with two empty lines)

Original file line number Diff line number Diff line change
Expand Up @@ -188,5 +188,5 @@ Here we use *CLion* as example because it provides full integration with CMake t

Next step is to build and install xxQS_NAMExx.

[//]: # (Eeach file has to end with two emty lines)
[//]: # (Eeach file has to end with two empty lines)

Original file line number Diff line number Diff line change
Expand Up @@ -21,5 +21,5 @@ make install
You can now either install the product (follow the instructions in the *Installation Guide*) or you can continue to
setup the automated test environment as described in the next chapter.

[//]: # (Eeach file has to end with two emty lines)
[//]: # (Eeach file has to end with two empty lines)

Original file line number Diff line number Diff line change
Expand Up @@ -101,5 +101,5 @@ Instead, the job execution is just simulated.

@todo add details

[//]: # (Eeach file has to end with two emty lines)
[//]: # (Eeach file has to end with two empty lines)

Original file line number Diff line number Diff line change
Expand Up @@ -296,5 +296,5 @@ we can switch off a few potentially expensive features and just rely on scheduli
* do not configure queue load_thresholds and suspend_thresholds
* do not use load adjustments (in the scheduler config)

[//]: # (Eeach file has to end with two emty lines)
[//]: # (Eeach file has to end with two empty lines)

Original file line number Diff line number Diff line change
Expand Up @@ -280,7 +280,7 @@ Please refer to the next question for more information.

### Where is the spooling area for the master service located?

For HA-setups, it must be a shared network location; otherwise, it can be the local filesystem of the host
For HA setups, it must be a shared network location; otherwise, it can be the local filesystem of the host
running the master service.

Ensure that the spooling location meets the requirements of the spooling mechanism. Classic spooling can be done on
Expand Down Expand Up @@ -359,4 +359,5 @@ If this is your first time installing xxQS_NAMExx, we suggest a manual installat
Automatic installation is recommended if you need to install or reinstall a cluster multiple times or if you plan
to install multiple clusters with slightly different settings.

[//]: # (Eeach file has to end with two emty lines)
[//]: # (Eeach file has to end with two empty lines)

2 changes: 1 addition & 1 deletion doc/markdown/manual/installation-guide/02_download.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,5 +54,5 @@ Once you have downloaded all packages, you can test and install them at the desi

4. If your `<install-dir>` is located on a shared filesystem available on all hosts in the cluster then you can start the installation process.

[//]: # (Eeach file has to end with two emty lines)
[//]: # (Eeach file has to end with two empty lines)

Loading