Skip to content

[Core][DO NOT MERGE] Improve OpenMP loop scheduling and chunk partitioning configurability in ParallelUtilities#14273

Draft
loumalouomega wants to merge 25 commits intomasterfrom
core/dynamic-scheduling-omp
Draft

[Core][DO NOT MERGE] Improve OpenMP loop scheduling and chunk partitioning configurability in ParallelUtilities#14273
loumalouomega wants to merge 25 commits intomasterfrom
core/dynamic-scheduling-omp

Conversation

@loumalouomega
Copy link
Copy Markdown
Member

@loumalouomega loumalouomega commented Mar 10, 2026

📝 Description

This PR updates shared-memory loop execution in ParallelUtilities to improve load balancing and runtime tunability, while keeping dynamic scheduling as the default fastest option from local benchmarking.

This is kind of a retake of #12923

NOTE: After discussing with @pooyan-dadvand we need further changes, and I may retake this later.

🚀Benchmarking

Intel® Core™ Ultra 9 Processor 285HX in Windows 11

Performance improvement for large sets, for small sets equal or slightly worse

Figure_1_new

Intel® Core™ Ultra 9 Processor 285HX in Ubuntu 24.04 via WSL

Consistent x8+ performance

Figure_1_linux_chunk_size_256

🔀 What changed

  • Added ChunkPartitioningScheme with two modes:
    • DIVIDE_BY_NUMBER_OF_CHUNKS
    • DIVIDE_BY_CHUNK_SIZE
  • Extended BlockPartition and IndexPartition templates to accept the partitioning scheme as a template parameter.
  • Updated partition constructors to accept a generic N (interpreted as chunk count or chunk size depending on scheme).
  • Added global tuning knobs:
    • ParallelUtilitiesMaxChunkSize (default 1024)
    • ParallelUtilitiesMaxNumberOfChunks (default Globals::MaxAllowedThreads, then adapted at init)
  • Added env-based runtime configuration in InitializeNumberOfThreads():
    • KRATOS_PARALLEL_MAX_CHUNK_SIZE
    • KRATOS_PARALLEL_MAX_CHUNKS
  • If KRATOS_PARALLEL_MAX_CHUNKS is not set, default is adjusted to min(4 * num_threads, ParallelUtilitiesMaxNumberOfChunks).
  • Set OpenMP loop scheduling to schedule(dynamic) in all relevant loops in parallel_utilities.h.

📒Notes

  • schedule(dynamic) is intentionally kept hardcoded for now based on benchmark results; switching to schedule(runtime) can be done later if runtime policy control is preferred.

😅 TODO

  • Benchmark in different OS
  • Benchmark in different CPU

🆕 Changelog

@loumalouomega loumalouomega added Kratos Core Performance Parallel-SMP Shared memory parallelism with OpenMP or C++ Threads labels Mar 10, 2026
@loumalouomega
Copy link
Copy Markdown
Member Author

Maybe this conflicts with C++ implementation

@loumalouomega loumalouomega changed the title [Core] Improve OpenMP loop scheduling and chunk partitioning configurability in ParallelUtilities [Core][DO NOT MERGE] Improve OpenMP loop scheduling and chunk partitioning configurability in ParallelUtilities Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Kratos Core Parallel-SMP Shared memory parallelism with OpenMP or C++ Threads Performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant