Skip to content

Commit 8756e6d

Browse files
Merge pull request #328 from KernelTuner/parallel_runner
Add parallel tuning on multiple remote GPUs using Ray
2 parents 6c2adc2 + 3b622f4 commit 8756e6d

40 files changed

Lines changed: 1602 additions & 350 deletions

doc/source/contents.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ The Kernel Tuner documentation
3636
optimization
3737
metrics
3838
observers
39+
parallel
3940

4041
.. toctree::
4142
:maxdepth: 1

doc/source/launch_ray.sh

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
#!/usr/bin/env bash
2+
set -euo pipefail
3+
4+
# Get SLURM variables
5+
NODELIST="${SLURM_STEP_NODELIST:-${SLURM_JOB_NODELIST:-}}"
6+
NUM_NODES="${SLURM_STEP_NUM_NODES:-${SLURM_JOB_NUM_NODES:-}}"
7+
8+
if [[ -z "$NODELIST" || -z "$NUM_NODES" ]]; then
9+
echo "ERROR: Not running under Slurm (missing SLURM_* vars)."
10+
exit 1
11+
fi
12+
13+
# Get head node
14+
NODES=$(scontrol show hostnames "$NODELIST")
15+
NODES_ARRAY=($NODES)
16+
RAY_IP="${NODES_ARRAY[0]}"
17+
RAY_PORT="${RAY_PORT:-6379}"
18+
RAY_ADDRESS="${RAY_IP}:${RAY_PORT}"
19+
20+
# Ensure command exists (Ray >= 2.49 per docs)
21+
if ! ray symmetric-run --help >/dev/null 2>&1; then
22+
echo "ERROR: 'ray symmetric-run' not available. Check Ray installation (needs Ray 2.49+)."
23+
exit 1
24+
fi
25+
26+
# Launch cluster!
27+
echo "Ray head node: $RAY_ADDRESS"
28+
29+
exec ray symmetric-run \
30+
--address "$RAY_ADDRESS" \
31+
--min-nodes "$NUM_NODES" \
32+
-- \
33+
"$@"
34+

doc/source/parallel.rst

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
Parallel and Remote Tuning
2+
==========================
3+
4+
By default, Kernel Tuner benchmarks GPU kernel configurations sequentially on a single local GPU.
5+
While this works well for small tuning problems, it can become a bottleneck for larger search spaces.
6+
7+
.. image:: parallel_runner.png
8+
:width: 700px
9+
:alt: Example of sequential versus parallel tuning.
10+
11+
12+
Kernel Tuner also supports **parallel tuning**, allowing multiple GPUs to evaluate kernel configurations in parallel.
13+
The same mechanism can be used for **remote tuning**, where Kernel Tuner runs on a host system while one or more GPUs are located on remote machines.
14+
15+
Parallel/remote tuning is implemented using `Ray <https://docs.ray.io/en/latest/>`_ and works on both local multi-GPU systems and distributed clusters.
16+
17+
How to use
18+
----------
19+
20+
To enable parallel tuning, pass the ``parallel`` argument to ``tune_kernel``:
21+
22+
.. code-block:: python
23+
24+
kernel_tuner.tune_kernel(
25+
"vector_add",
26+
kernel_string,
27+
size,
28+
args,
29+
tune_params,
30+
parallel=True,
31+
)
32+
33+
If ``parallel`` is set to ``True``, Kernel Tuner will use all available Ray workers for tuning.
34+
The ``parallel`` option can also be set to an integer ``n`` to use exactly ``n`` workers.
35+
36+
Alternatively, define the environment variable ``KERNEL_TUNER_PARALLEL`` to enable parallel execution without modifying your Python code.
37+
38+
.. code-block:: bash
39+
40+
$ KERNEL_TUNER_PARALLEL=true python3 my_tuning_script.py
41+
42+
43+
44+
Parallel tuning and optimization strategies
45+
-------------------------------------------
46+
47+
The achievable speedup from using multiple GPUs depends in part on the **optimization strategy** used during tuning.
48+
49+
Some optimization strategies support **maximum parallelism** and can evaluate all configurations independently.
50+
Other strategies support **limited parallelism**, typically by repeatly evaluating a fixed-size population of configurations in parallel.
51+
Finally, some strategies are **inherently sequential** and always evaluate configurations one by one, providing no parallelism.
52+
53+
The current optimization strategies can be grouped as follows:
54+
55+
* **Maximum parallelism**:
56+
``brute_force``, ``random_sample``
57+
58+
* **Limited parallelism**:
59+
``genetic_algorithm``, ``pso``, ``diff_evo``, ``firefly_algorithm``
60+
61+
* **No parallelism**:
62+
``minimize``, ``basinhopping``, ``greedy_mls``, ``ordered_greedy_mls``,
63+
``greedy_ils``, ``dual_annealing``, ``mls``,
64+
``simulated_annealing``, ``bayes_opt``
65+
66+
67+
68+
Setting up Ray
69+
--------------
70+
71+
Kernel Tuner uses `Ray <https://docs.ray.io/en/latest/>`_ to distribute kernel evaluations across multiple GPUs.
72+
Ray is an open-source framework for distributed computing in Python.
73+
74+
To use parallel tuning, you must first install Ray itself:
75+
76+
.. code-block:: bash
77+
78+
$ pip install ray
79+
80+
Next, you must set up a Ray cluster.
81+
Kernel Tuner will internally attempt to connect to an existing cluster by calling:
82+
83+
.. code-block:: python
84+
85+
ray.init(address="auto")
86+
87+
Refer to the Ray documentation for details on how ``ray.init()`` connects to a local or remote cluster
88+
(`documentation <https://docs.ray.io/en/latest/ray-core/api/doc/ray.init.html>`_).
89+
For example, you can set the ``RAY_ADDRESS`` environment variable to point to the address of a remote Ray head node.
90+
Alternatively, you may manually call ``ray.init(address="your_head_node_ip:6379")`` before calling ``tune_kernel``.
91+
92+
Here are some common ways to set up your cluster:
93+
94+
95+
Local multi-GPU machine
96+
***********************
97+
98+
By default, on a machine with multiple GPUs, Ray will start a temporary local cluster and automatically detect all available GPUs.
99+
Kernel Tuner can then use these GPUs in parallel for tuning.
100+
101+
102+
Distributed cluster with SLURM (easy, Ray ≥2.49)
103+
************************************************
104+
105+
The most straightforward way to use Ray on a SLURM cluster is to use the ``ray symmetric-run`` command, available from Ray **2.49** onwards.
106+
This launches a Ray environment, runs your script, and then shuts it down again.
107+
108+
Consider the following script ``launch_ray.sh``.
109+
110+
.. literalinclude:: launch_ray.sh
111+
:language: bash
112+
113+
Next, run your Kernel Tuner script using ``srun``.
114+
The exact command depends on your cluster.
115+
In the example below, ``-N4`` indicates 4 nodes and ``--gres=gpu:1`` indicates 1 GPU per node.
116+
117+
.. code-block:: bash
118+
119+
$ srun -N4 --gres=gpu:1 launch_ray.sh python3 my_tuning_script.py
120+
121+
122+
Distributed Cluster with SLURM (manual, Ray <2.49)
123+
**************************************************
124+
125+
An alternative way to use Ray on SLURM is to launch a Ray cluster, obtain the IP address of the head node, and the connect to it remotely.
126+
127+
Consider the following sbatch script ``submit_ray.sh``.
128+
129+
.. literalinclude:: submit_ray.sh
130+
:language: bash
131+
132+
Next, submit your job using ``sbatch``.
133+
134+
.. code-block:: bash
135+
136+
$ sbatch submit_ray.sh
137+
Submitted batch job 1223577
138+
139+
After this, inspect the file `slurm-1223577.out` and search for the following line:
140+
141+
.. code-block::
142+
143+
$ grep RAY_ADDRESS slurm-1223577.out
144+
Launching head node: RAY_ADDRESS=145.184.221.164:6379
145+
146+
Finally, launch your application using:
147+
148+
.. code-block::
149+
150+
RAY_ADDRESS=145.184.221.164:6379 python my_tuning_script.py

doc/source/parallel_runner.png

195 KB
Loading

doc/source/submit_ray.sh

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
#!/bin/bash
2+
#SBATCH --time=00:10:00
3+
#SBATCH --nodes=2
4+
#SBATCH --ntasks-per-node=1
5+
#SBATCH --gpus-per-task=1
6+
set -euo pipefail
7+
8+
HEAD_NODE=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
9+
HEAD_NODE_IP=$(srun -N1 -n1 -w "$HEAD_NODE" bash -lc 'hostname -I | awk "{print \$1}"')
10+
RAY_PORT=6379
11+
RAY_ADDRESS="${HEAD_NODE_IP}:${RAY_PORT}"
12+
13+
echo "Launching head node: RAY_ADDRESS=$RAY_ADDRESS"
14+
srun --nodes=1 --ntasks=1 -w "$HEAD_NODE" \
15+
ray start --head --node-ip-address="$HEAD_NODE_IP" --port="$RAY_PORT" --block &
16+
sleep 5
17+
18+
NUM_WORKERS=$((SLURM_JOB_NUM_NODES - 1))
19+
echo "Launching ${NUM_WORKERS} worker node(s)"
20+
if [[ "$NUM_WORKERS" -gt 0 ]]; then
21+
srun -n "$NUM_WORKERS" --nodes="$NUM_WORKERS" --ntasks-per-node=1 --exclude "$HEAD_NODE" \
22+
ray start --address "$RAY_ADDRESS" --block &
23+
fi
24+
25+
# Keep job alive (or replace with running your workload on the head)
26+
wait

examples/cuda/sepconv_parallel.py

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
#!/usr/bin/env python
2+
import numpy
3+
from kernel_tuner import tune_kernel
4+
from collections import OrderedDict
5+
6+
7+
def tune():
8+
with open("convolution.cu", "r") as f:
9+
kernel_string = f.read()
10+
11+
# setup tunable parameters
12+
tune_params = OrderedDict()
13+
tune_params["filter_height"] = [i for i in range(3, 19, 2)]
14+
tune_params["filter_width"] = [i for i in range(3, 19, 2)]
15+
tune_params["block_size_x"] = [16 * i for i in range(1, 65)]
16+
tune_params["block_size_y"] = [2**i for i in range(6)]
17+
tune_params["tile_size_x"] = [i for i in range(1, 11)]
18+
tune_params["tile_size_y"] = [i for i in range(1, 11)]
19+
20+
tune_params["use_padding"] = [0, 1] # toggle the insertion of padding in shared memory
21+
tune_params["read_only"] = [0, 1] # toggle using the read-only cache
22+
23+
# limit the search to only use padding when its effective, and at least 32 threads in a block
24+
restrict = ["use_padding==0 or (block_size_x % 32 != 0)", "block_size_x*block_size_y >= 32"]
25+
26+
# setup input and output dimensions
27+
problem_size = (4096, 4096)
28+
size = numpy.prod(problem_size)
29+
largest_fh = max(tune_params["filter_height"])
30+
largest_fw = max(tune_params["filter_width"])
31+
input_size = (problem_size[0] + largest_fw - 1) * (problem_size[1] + largest_fh - 1)
32+
33+
# create input data
34+
output_image = numpy.zeros(size).astype(numpy.float32)
35+
input_image = numpy.random.randn(input_size).astype(numpy.float32)
36+
filter_weights = numpy.random.randn(largest_fh * largest_fw).astype(numpy.float32)
37+
38+
# setup kernel arguments
39+
cmem_args = {"d_filter": filter_weights}
40+
args = [output_image, input_image, filter_weights]
41+
42+
# tell the Kernel Tuner how to compute grid dimensions
43+
grid_div_x = ["block_size_x", "tile_size_x"]
44+
grid_div_y = ["block_size_y", "tile_size_y"]
45+
46+
# start tuning separable convolution (row)
47+
tune_params["filter_height"] = [1]
48+
tune_params["tile_size_y"] = [1]
49+
results_row = tune_kernel(
50+
"convolution_kernel",
51+
kernel_string,
52+
problem_size,
53+
args,
54+
tune_params,
55+
grid_div_y=grid_div_y,
56+
grid_div_x=grid_div_x,
57+
cmem_args=cmem_args,
58+
verbose=False,
59+
restrictions=restrict,
60+
parallel=True,
61+
cache="convolution_kernel_row",
62+
)
63+
64+
# start tuning separable convolution (col)
65+
tune_params["filter_height"] = tune_params["filter_width"][:]
66+
tune_params["file_size_y"] = tune_params["tile_size_x"][:]
67+
tune_params["filter_width"] = [1]
68+
tune_params["tile_size_x"] = [1]
69+
results_col = tune_kernel(
70+
"convolution_kernel",
71+
kernel_string,
72+
problem_size,
73+
args,
74+
tune_params,
75+
grid_div_y=grid_div_y,
76+
grid_div_x=grid_div_x,
77+
cmem_args=cmem_args,
78+
verbose=False,
79+
restrictions=restrict,
80+
parallel_runner=1024,
81+
cache="convolution_kernel_col",
82+
)
83+
84+
return results_row, results_col
85+
86+
87+
if __name__ == "__main__":
88+
results_row, results_col = tune()
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
#!/usr/bin/env python
2+
3+
import numpy
4+
from kernel_tuner import tune_kernel
5+
from pprint import pprint
6+
7+
8+
def tune():
9+
kernel_string = """
10+
__global__ void vector_add(float *c, float *a, float *b, int n) {
11+
int base = ((blockIdx.x * block_size_x) + threadIdx.x) * elements_per_thread;
12+
13+
#pragma unroll unroll_factor
14+
for (int offset = 0; offset < elements_per_thread; offset++) {
15+
int i = base + offset;
16+
17+
if ( i < n ) {
18+
c[i] = a[i] + b[i];
19+
}
20+
}
21+
}
22+
"""
23+
24+
size = 10000000
25+
26+
a = numpy.random.randn(size).astype(numpy.float32)
27+
b = numpy.random.randn(size).astype(numpy.float32)
28+
c = numpy.zeros_like(b)
29+
n = numpy.int32(size)
30+
31+
args = [c, a, b, n]
32+
33+
tune_params = dict()
34+
tune_params["block_size_x"] = [32 * i for i in range(1, 33)]
35+
tune_params["elements_per_thread"] = [1, 2, 3, 4, 5, 6, 7, 8]
36+
tune_params["unroll_factor"] = [1, 2, 3, 4, 5, 6, 7, 8]
37+
38+
results, env = tune_kernel("vector_add", kernel_string, size, args, tune_params, parallel=True)
39+
40+
pprint(env)
41+
return results
42+
43+
44+
if __name__ == "__main__":
45+
tune()

kernel_tuner/backends/cupy.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
from __future__ import print_function
33

44
import numpy as np
5+
import uuid
56

67
from kernel_tuner.backends.backend import GPUBackend
78
from kernel_tuner.observers.cupy import CupyRuntimeObserver
@@ -74,12 +75,17 @@ def __init__(self, device=0, iterations=7, compiler_options=None, observers=None
7475
s.split(":")[0].strip(): s.split(":")[1].strip() for s in cupy_info
7576
}
7677
env["device_name"] = info_dict[f"Device {device} Name"]
78+
env["pci_bus_id"] = info_dict[f"Device {device} PCI Bus ID"]
7779

7880
env["cuda_version"] = cp.cuda.runtime.driverGetVersion()
7981
env["compute_capability"] = self.cc
8082
env["iterations"] = self.iterations
8183
env["compiler_options"] = compiler_options
8284
env["device_properties"] = self.devprops
85+
86+
props = cp.cuda.runtime.getDeviceProperties(device)
87+
env["uuid"] = str(uuid.UUID(bytes=props["uuid"]))
88+
8389
self.env = env
8490
self.name = env["device_name"]
8591

0 commit comments

Comments
 (0)