Releases: hpc-gridware/clusterscheduler
OCS/GCS v9.0.12
Open Cluster Scheduler and Gridware Cluster Scheduler v9.0.11 are available for download from the HPC-Gridware Download Page.
OCS/GCS v9.0.11
Open Cluster Scheduler and Gridware Cluster Scheduler v9.0.11 are available for download from the HPC-Gridware Download Page.
OCS/GCS v9.0.10
Open Cluster Scheduler and Gridware Cluster Scheduler v9.0.10 are available for download from the HPC-Gridware Download Page.
OCS/GCS v9.0.9
Open Cluster Scheduler and Gridware Cluster Scheduler v9.0.9 are available for download from the HPC-Gridware Download Page.
OCS/GCS v9.0.8
Open Cluster Scheduler and Gridware Cluster Scheduler v9.0.8 are available for dowbload: https://www.hpc-gridware.com/download-main/
OCS/GCS v9.0.7
Monthly patch release.
OCS/GCS v9.0.6
V906_TAG OCS/GCS 9.0.6 release
OCS/GCS v9.0.5
Major Enhancements
v9.0.5
qtelemetry (Developer Preview in GCS)
This release introduces qtelemetry, a new metrics exporter for Gridware Cluster Scheduler (GCS). It allows administrators to easily collect and expose cluster metrics for monitoring and observability purposes.
Features:
- Simple integration with Prometheus and Grafana
- Export cluster metrics, including:
- Host metrics (CPU load, GPU availability, memory usage, and many more)
- Job metrics (queued, running, errored, waiting time, and many more)
- qmaster statistics (CPU/memory usage of
sge_qmaster, spooling filesystem information) - Optional per-job metric export for detailed insights (recommended only for very small workloads)
- Built-in support for pre-configured Grafana dashboard:
- Grafana dashboard example.
Quick Start:
By default, qtelemetry exports metrics on port 9464 from the /metrics endpoint:
./qtelemetry start
Enable additional metrics sources using command-line flags:
# Export exec host and qmaster metrics
./qtelemetry start --enableExecd --enableMaster
# Export individual job-level metrics (for smaller systems)
./qtelemetry start --singleJobs
(Available in Gridware Cluster Scheduler only)
Out of the Box Support of various MPI Distributions
The $SGE_ROOT/mpi directory contains templates of the PE configuration for the following MPI distributions:
- Intel MPI
- mpich
- mvapich
- openmpi
They can be added by simply calling qconf -Ap <path to template> and will add the PE configuration for running jobs using the given MPI in tight integration.
In addition build scripts for mpich, mvapich, and openmpi give an example on how the MPI distribution can be built and installed. The build scripts are located in $SGE_ROOT/mpi/<mpi name>/build.sh.
$SGE_ROOT/mpi/examples contains a MPI example written in the C language.
It can be run as tightly integrated parallel job in any of the MPI distributions mentioned above
and supports checkpointing and restart.
It comes with documentation, build script, job script and a template of a ckeckpointing enviroment.
(Available in Open Cluster Scheduler and Gridware Cluster Scheduler)
Easier Creation of Configuration Templates
Configuration objects can now contain the additional special variables $sge_root and $sge_cell for
paths to scripts, e.g. for
prologandepilogin the global config and queue configurationsstarter_method,suspend_method,resume_method, andterminate_methodin the queue configurationstart_proc_argsandstop_proc_argsin the parallel environment configurationckpt_command,migr_command,restart_command, andclean_commandin the checkpointing environment
This allows to have configuration templates that can be used in different environments without
the need to modify the paths before applying the configuration.
A list of all special variables is given in the sge_conf.5 man page in the prolog section.
(Available in Open Cluster Scheduler and Gridware Cluster Scheduler)
Full List of Fixes
Release notes - Cluster Scheduler
v9.0.5
Improvement
CS-342 provide an openmpi integration
CS-343 provide an example and test program using MPI
CS-791 sge_root should be available as special variable in the configuration of prolog, epilog, queue, pe, ckpt
CS-914 Make ARCH script more robust
CS-1090 qstat -r shall report resource requests by scope
CS-1094 Update sge_pe.md to better explain PE_HOSTFILE
CS-1114 Add GPU monitoring examples to qtelemetry Grafana dashboard
CS-1115 Build qtelemetry in containers for lx-amd64 and lx-arm64
CS-1126 in the environment of tasks of tightly integrated parallel jobs set the pe_task_id
CS-1128 Add enroot to worker GPU VM image for GCP
CS-1143 provide a MPICH integration
CS-1144 provide a MVAPICH integration
CS-1145 provide an Intel MPI integration
CS-1146 cleanup and document the ssh wrapper MPI template and scripts
CS-1152 add a checktree_mpi to testsuite with configuration and tests making use of the various MPI integrations
CS-1158 Add qtelemetry Grafana dashboard to public Grafana Cloud Dashboards
New Feature
CS-1091 Clearly document the slots syntax in man5 sge_queue_conf.md
Sub-task
CS-697 Jenkins: enable issue_3013
CS-698 Jenkins: enable issue_3179
Task
CS-662 verify delayed job reporting of sge_execd after reconnecting to sge_qmaster
CS-1117 Add qtelemetry as developer preview to GCS distribution
CS-1118 Create a packer file which builds a GPU enabled VM with and without GCS for fast deployment on GCP
CS-1125 Provide a basic examples of how enroot can be used with the GPU integration
CS-1134 message cutoff after 8 characters
CS-1136 add checktree_qtelemetry to all build environments + Jenkins setup
Bug
CS-430 booking of resources into advance reservations needs to distinguish between host and queue resources
CS-722 env_list in qstat should show NONE if not set
CS-1028 qtelemetry should support NVIDIA loadsensor values for hosts
CS-1085 BDB build error on lx-riscv64 after OS update.
CS-1096 USE_QSUB_GID functionality fails on FreeBSD 14
CS-1111 minimum and maximum thread counts in the bootstrap.5 man page are incorrect
CS-1131 wallclock time reported for tasks of a tightly integrated parallel job is incorrect
CS-1139 job deletion via JAPI/DRMAA fails if job ID exceeds INT_MAX
CS-1140 termination of event client via JAPI fails if event client ID exceeds INT_MAX
CS-1141 MacOS build broken due to unavailability of getgrouplist()
CS-1163 when a queue is signalled then additional invalid entries are created in the berkeleydb spooling database
OCS/GCS v9.0.4
v9.0.4
IT IS STRONGLY RECOMMENDED TO UPGRADE TO PATCH v9.0.4
-
We fixed several critical bugs that caused
- the
sge_qmasterto crash - issues in the internal bookkeeping of the scheduler
- jobs to be stuck in the system without being able to delete them
- ...
Find the full List of fixes in the release notes: https://www.hpc-gridware.com/download/10333/?tmstv=1741200897
- the
OCS/GCS v9.0.3
Patch release. Prebuild packages are available here: https://www.hpc-gridware.com/download-main/