|
| 1 | +Parallel and Remote Tuning |
| 2 | +========================== |
| 3 | + |
| 4 | +By default, Kernel Tuner benchmarks GPU kernel configurations sequentially on a single local GPU. |
| 5 | +While this works well for small tuning problems, it can become a bottleneck for larger search spaces. |
| 6 | + |
| 7 | +.. image:: parallel_runner.png |
| 8 | + :width: 700px |
| 9 | + :alt: Example of sequential versus parallel tuning. |
| 10 | + |
| 11 | + |
| 12 | +Kernel Tuner also supports **parallel tuning**, allowing multiple GPUs to evaluate kernel configurations in parallel. |
| 13 | +The same mechanism can be used for **remote tuning**, where Kernel Tuner runs on a host system while one or more GPUs are located on remote machines. |
| 14 | + |
| 15 | +Parallel/remote tuning is implemented using `Ray <https://docs.ray.io/en/latest/>`_ and works on both local multi-GPU systems and distributed clusters. |
| 16 | + |
| 17 | +How to use |
| 18 | +---------- |
| 19 | + |
| 20 | +To enable parallel tuning, pass the ``parallel`` argument to ``tune_kernel``: |
| 21 | + |
| 22 | +.. code-block:: python |
| 23 | +
|
| 24 | + kernel_tuner.tune_kernel( |
| 25 | + "vector_add", |
| 26 | + kernel_string, |
| 27 | + size, |
| 28 | + args, |
| 29 | + tune_params, |
| 30 | + parallel=True, |
| 31 | + ) |
| 32 | +
|
| 33 | +If ``parallel`` is set to ``True``, Kernel Tuner will use all available Ray workers for tuning. |
| 34 | +The ``parallel`` option can also be set to an integer ``n`` to use exactly ``n`` workers. |
| 35 | + |
| 36 | +Alternatively, define the environment variable ``KERNEL_TUNER_PARALLEL`` to enable parallel execution without modifying your Python code. |
| 37 | + |
| 38 | +.. code-block:: bash |
| 39 | +
|
| 40 | + $ KERNEL_TUNER_PARALLEL=true python3 my_tuning_script.py |
| 41 | +
|
| 42 | +
|
| 43 | +
|
| 44 | +Parallel tuning and optimization strategies |
| 45 | +------------------------------------------- |
| 46 | + |
| 47 | +The achievable speedup from using multiple GPUs depends in part on the **optimization strategy** used during tuning. |
| 48 | + |
| 49 | +Some optimization strategies support **maximum parallelism** and can evaluate all configurations independently. |
| 50 | +Other strategies support **limited parallelism**, typically by repeatly evaluating a fixed-size population of configurations in parallel. |
| 51 | +Finally, some strategies are **inherently sequential** and always evaluate configurations one by one, providing no parallelism. |
| 52 | + |
| 53 | +The current optimization strategies can be grouped as follows: |
| 54 | + |
| 55 | +* **Maximum parallelism**: |
| 56 | + ``brute_force``, ``random_sample`` |
| 57 | + |
| 58 | +* **Limited parallelism**: |
| 59 | + ``genetic_algorithm``, ``pso``, ``diff_evo``, ``firefly_algorithm`` |
| 60 | + |
| 61 | +* **No parallelism**: |
| 62 | + ``minimize``, ``basinhopping``, ``greedy_mls``, ``ordered_greedy_mls``, |
| 63 | + ``greedy_ils``, ``dual_annealing``, ``mls``, |
| 64 | + ``simulated_annealing``, ``bayes_opt`` |
| 65 | + |
| 66 | + |
| 67 | + |
| 68 | +Setting up Ray |
| 69 | +-------------- |
| 70 | + |
| 71 | +Kernel Tuner uses `Ray <https://docs.ray.io/en/latest/>`_ to distribute kernel evaluations across multiple GPUs. |
| 72 | +Ray is an open-source framework for distributed computing in Python. |
| 73 | + |
| 74 | +To use parallel tuning, you must first install Ray itself: |
| 75 | + |
| 76 | +.. code-block:: bash |
| 77 | +
|
| 78 | + $ pip install ray |
| 79 | +
|
| 80 | +Next, you must set up a Ray cluster. |
| 81 | +Kernel Tuner will internally attempt to connect to an existing cluster by calling: |
| 82 | + |
| 83 | +.. code-block:: python |
| 84 | +
|
| 85 | + ray.init(address="auto") |
| 86 | +
|
| 87 | +Refer to the Ray documentation for details on how ``ray.init()`` connects to a local or remote cluster |
| 88 | +(`documentation <https://docs.ray.io/en/latest/ray-core/api/doc/ray.init.html>`_). |
| 89 | +For example, you can set the ``RAY_ADDRESS`` environment variable to point to the address of a remote Ray head node. |
| 90 | +Alternatively, you may manually call ``ray.init(address="your_head_node_ip:6379")`` before calling ``tune_kernel``. |
| 91 | + |
| 92 | +Here are some common ways to set up your cluster: |
| 93 | + |
| 94 | + |
| 95 | +Local multi-GPU machine |
| 96 | +*********************** |
| 97 | + |
| 98 | +By default, on a machine with multiple GPUs, Ray will start a temporary local cluster and automatically detect all available GPUs. |
| 99 | +Kernel Tuner can then use these GPUs in parallel for tuning. |
| 100 | + |
| 101 | + |
| 102 | +Distributed cluster with SLURM (easy, Ray ≥2.49) |
| 103 | +************************************************ |
| 104 | + |
| 105 | +The most straightforward way to use Ray on a SLURM cluster is to use the ``ray symmetric-run`` command, available from Ray **2.49** onwards. |
| 106 | +This launches a Ray environment, runs your script, and then shuts it down again. |
| 107 | + |
| 108 | +Consider the following script ``launch_ray.sh``. |
| 109 | + |
| 110 | +.. literalinclude:: launch_ray.sh |
| 111 | + :language: bash |
| 112 | + |
| 113 | +Next, run your Kernel Tuner script using ``srun``. |
| 114 | +The exact command depends on your cluster. |
| 115 | +In the example below, ``-N4`` indicates 4 nodes and ``--gres=gpu:1`` indicates 1 GPU per node. |
| 116 | + |
| 117 | +.. code-block:: bash |
| 118 | +
|
| 119 | + $ srun -N4 --gres=gpu:1 launch_ray.sh python3 my_tuning_script.py |
| 120 | +
|
| 121 | +
|
| 122 | +Distributed Cluster with SLURM (manual, Ray <2.49) |
| 123 | +************************************************** |
| 124 | + |
| 125 | +An alternative way to use Ray on SLURM is to launch a Ray cluster, obtain the IP address of the head node, and the connect to it remotely. |
| 126 | + |
| 127 | +Consider the following sbatch script ``submit_ray.sh``. |
| 128 | + |
| 129 | +.. literalinclude:: submit_ray.sh |
| 130 | + :language: bash |
| 131 | + |
| 132 | +Next, submit your job using ``sbatch``. |
| 133 | + |
| 134 | +.. code-block:: bash |
| 135 | +
|
| 136 | + $ sbatch submit_ray.sh |
| 137 | + Submitted batch job 1223577 |
| 138 | +
|
| 139 | +After this, inspect the file `slurm-1223577.out` and search for the following line: |
| 140 | + |
| 141 | +.. code-block:: |
| 142 | +
|
| 143 | + $ grep RAY_ADDRESS slurm-1223577.out |
| 144 | + Launching head node: RAY_ADDRESS=145.184.221.164:6379 |
| 145 | +
|
| 146 | +Finally, launch your application using: |
| 147 | + |
| 148 | +.. code-block:: |
| 149 | +
|
| 150 | + RAY_ADDRESS=145.184.221.164:6379 python my_tuning_script.py |
0 commit comments