Skip to content

Commit 90227c4

Browse files
Add user guides for dpnp.tensor module
1 parent 29cd39c commit 90227c4

File tree

5 files changed

+584
-0
lines changed

5 files changed

+584
-0
lines changed

doc/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Data Parallel Extension for NumPy*
1212

1313
overview
1414
quick_start_guide
15+
user_guides/index
1516
reference/index
1617

1718
.. toctree::

doc/user_guides/dlpack.rst

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
.. _dpnp_tensor_dlpack_support:
2+
3+
DLPack exchange of USM allocated arrays
4+
=======================================
5+
6+
DLPack overview
7+
---------------
8+
9+
`DLPack <dlpack_docs_>`_ is a commonly used C-ABI compatible data structure that allows data exchange
10+
between major frameworks. DLPack strives to be minimal, intentionally leaves allocators API and
11+
device API out of scope.
12+
13+
Data shared via DLPack are owned by the producer who provides a deleter function stored in the
14+
`DLManagedTensor <dlpack_managed_tensor_>`_, and are only accessed by consumer.
15+
Python semantics of using the structure is `explained in dlpack docs <dlpack_python_spec_>`_.
16+
17+
DLPack specifies data location in memory via ``void * data`` field of `DLTensor <dlpack_dltensor_>`_ struct, and via ``DLDevice device`` field.
18+
The `DLDevice <dlpack_dldevice_>`_ struct has two members: an enumeration ``device_type`` and an integer ``device_id``.
19+
20+
DLPack recognizes enumeration value ``DLDeviceType::kDLOneAPI`` reserved for sharing SYCL USM allocations.
21+
It is not ``kDLSycl`` since importing USM-allocated tensor with this device type relies on oneAPI SYCL extensions
22+
``sycl_ext_oneapi_filter_selector`` and ``sycl_ext_oneapi_default_platform_context`` to operate.
23+
24+
.. _dlpack_docs: https://dmlc.github.io/dlpack/latest/
25+
.. _dlpack_managed_tensor: https://dmlc.github.io/dlpack/latest/c_api.html#c.DLManagedTensor
26+
.. _dlpack_dltensor: https://dmlc.github.io/dlpack/latest/c_api.html#c.DLTensor
27+
.. _dlpack_dldevice: https://dmlc.github.io/dlpack/latest/c_api.html#c.DLDevice
28+
.. _dlpack_python_spec: https://dmlc.github.io/dlpack/latest/python_spec.html
29+
30+
Exporting USM allocation to DLPack
31+
-----------------------------------
32+
33+
When sharing USM allocation (of any ``sycl::usm::kind``) with ``void * ptr`` bound to ``sycl::context ctx``:
34+
35+
.. code-block:: cpp
36+
:caption: Protocol for exporting USM allocation as DLPack
37+
38+
// Input: void *ptr:
39+
// USM allocation pointer
40+
// sycl::context ctx:
41+
// context the pointer is bound to
42+
43+
// Get device where allocation was originally made
44+
// Keep in mind, the device may be a sub-device
45+
const sycl::device &ptr_dev = sycl::get_pointer_device(ptr, ctx);
46+
47+
#if SYCL_KHR_DEFAULT_CONTEXT
48+
const sycl::context &default_ctx = ptr_dev.get_platform().khr_get_default_context();
49+
#else
50+
static_assert(false, "ext_oneapi_default_context extension is required");
51+
#endif
52+
53+
// Assert that ctx is the default platform context, or throw
54+
if (ctx != default_ctx) {
55+
throw pybind11::type_error(
56+
"Can not export USM allocations not "
57+
"bound to default platform context."
58+
);
59+
}
60+
61+
// Find parent root device if ptr_dev is a sub-device
62+
const sycl::device &parent_root_device = get_parent_root_device(ptr_dev);
63+
64+
// find position of parent_root_device in sycl::get_devices
65+
const auto &all_root_devs = sycl::device::get_devices();
66+
auto beg = std::begin(all_root_devs);
67+
auto end = std::end(all_root_devs);
68+
auto selectot_fn = [parent_root_device](const sycl::device &root_d) -> bool {
69+
return parent_root_device == root_d;
70+
};
71+
auto pos = find_if(beg, end, selector_fn);
72+
73+
if (pos == end) {
74+
throw pybind11::type_error("Could not produce DLPack: failed finding device_id");
75+
}
76+
std::ptrdiff_t dev_idx = std::distance(beg, pos);
77+
78+
// check that dev_idx can fit into int32_t if needed
79+
int32_t device_id = static_cast<int32_t>(dev_idx);
80+
81+
// populate DLTensor with DLDeviceType::kDLOneAPI and computed device_id
82+
83+
84+
Importing DLPack with ``device_type == kDLOneAPI``
85+
--------------------------------------------------
86+
87+
.. code-block:: cpp
88+
:caption: Protocol for recognizing DLPack as a valid USM allocation
89+
90+
// Input: ptr = dlm_tensor->dl_tensor.data
91+
// device_id = dlm_tensor->dl_tensor.device.device_id
92+
93+
// Get root_device from device_id
94+
const auto &device_vector = sycl::get_device();
95+
const sycl::device &root_device = device_vector.at(device_id);
96+
97+
// Check if the backend of the device is supported by consumer
98+
// Perhaps for certain backends (CUDA, hip, etc.) we should dispatch
99+
// different dlpack importers
100+
101+
// alternatively
102+
// sycl::device root_device = sycl::device(
103+
// sycl::ext::oneapi::filter_selector{ std::to_string(device_id)}
104+
// );
105+
106+
// Get default platform context
107+
#if SYCL_KHR_DEFAULT_CONTEXT
108+
const sycl::context &default_ctx = root_device.get_platform().khr_get_default_context();
109+
#else
110+
static_assert(false, "ext_oneapi_default_context extension is required");
111+
#endif
112+
113+
// Check that pointer is known in the context
114+
const sycl::usm::kind &alloc_type = sycl::get_pointer_type(ptr, ctx);
115+
116+
if (alloc_type == sycl::usm::kind::unknown) {
117+
throw pybind11::type_error(
118+
"Data pointer in DLPack is not bound to the "
119+
"default platform context of specified device"
120+
);
121+
}
122+
123+
// Perform check that USM allocation type is supported by consumer if needed
124+
125+
// Get sycl::device where the data was allocated
126+
const sycl::device &ptr_dev = sycl::get_pointer_device(ptr, ctx);
127+
128+
// Create object of consumer's library from ptr, ptr_dev, ctx
129+
130+
Support of DLPack with ``kDLOneAPI`` device type
131+
------------------------------------------------
132+
133+
:py:mod:`dpnp.tensor` supports DLPack v0.8. Exchange of USM allocations made using Level-Zero backend
134+
is supported with ``torch.Tensor(device='xpu')`` for PyTorch when using `intel-extension-for-pytorch <intel_ext_for_torch_>`_,
135+
as well as for TensorFlow when `intel-extension-for-tensorflow <intel_ext_for_tf_>`_ is used.
136+
137+
.. _intel_ext_for_torch: https://github.com/intel/intel-extension-for-pytorch
138+
.. _intel_ext_for_tf: https://github.com/intel/intel-extension-for-tensorflow
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
.. _dpnp_execution_model:
2+
3+
########################
4+
oneAPI programming model
5+
########################
6+
7+
oneAPI library and its Python interface
8+
=======================================
9+
10+
Using oneAPI libraries, a user calls functions that take ``sycl::queue`` and a collection of
11+
``sycl::event`` objects among other arguments. For example:
12+
13+
.. code-block:: cpp
14+
:caption: Prototypical call signature of oneMKL function
15+
16+
sycl::event
17+
compute(
18+
sycl::queue &exec_q,
19+
...,
20+
const std::vector<sycl::event> &dependent_events
21+
);
22+
23+
The function ``compute`` inserts computational tasks into the queue ``exec_q`` for DPC++ runtime to
24+
execute on the device the queue targets. The execution may begin only after other tasks whose
25+
execution status is represented by ``sycl::event`` objects in the provided ``dependent_events``
26+
vector complete. If the vector is empty, the runtime begins the execution as soon as the device is
27+
ready. The function returns a ``sycl::event`` object representing completion of the set of
28+
computational tasks submitted by the ``compute`` function.
29+
30+
Hence, in the oneAPI programming model, the execution **queue** is used to specify which device the
31+
function will execute on. To create a queue, one must specify a device to target.
32+
33+
In :mod:`dpctl`, the ``sycl::queue`` is represented by :class:`dpctl.SyclQueue` Python type,
34+
and a Python API to call such a function might look like
35+
36+
.. code-block:: python
37+
38+
def call_compute(
39+
exec_q : dpctl.SyclQueue,
40+
...,
41+
dependent_events : List[dpctl.SyclEvent] = []
42+
) -> dpctl.SyclEvent:
43+
...
44+
45+
When building Python API for a SYCL offloading function, and you choose to
46+
map the SYCL API to a different API on the Python side, it must still translate to a
47+
similar call under the hood.
48+
49+
The arguments to the function must be suitable for use in the offloading functions.
50+
Typically these are Python scalars, or objects representing USM allocations, such as
51+
:class:`dpnp.tensor.usm_ndarray`, :class:`dpctl.memory.MemoryUSMDevice` and friends.
52+
53+
.. note::
54+
The USM allocations these objects represent must not get deallocated before
55+
offloaded tasks that access them complete.
56+
57+
This is something authors of DPC++-based Python extensions must take care of,
58+
and users of such extensions should assume assured.
59+
60+
61+
USM allocations and compute-follows-data
62+
========================================
63+
64+
To make a USM allocation on a device in SYCL, one needs to specify ``sycl::device`` in the
65+
memory of which the allocation is made, and the ``sycl::context`` to which the allocation
66+
is bound.
67+
68+
A ``sycl::queue`` object is often used instead. In such cases ``sycl::context`` and ``sycl::device`` associated
69+
with the queue are used to make the allocation.
70+
71+
.. important::
72+
:mod:`dpnp.tensor` associates a queue object with every USM allocation.
73+
74+
The associated queue may be queried using ``.sycl_queue`` property of the
75+
Python type representing the USM allocation.
76+
77+
This design choice allows :mod:`dpnp.tensor` to have a preferred queue to use when operating on any single
78+
USM allocation. For example:
79+
80+
.. code-block:: python
81+
82+
def unary_func(x : dpnp.tensor.usm_ndarray):
83+
code1
84+
_ = _func_impl(x.sycl_queue, ...)
85+
code2
86+
87+
When combining several objects representing USM-allocations, the
88+
:ref:`programming model <dpnp_tensor_compute_follows_data>`
89+
adopted in :mod:`dpnp.tensor` insists that queues associated with each object be the same, in which
90+
case it is the execution queue used. Alternatively :exc:`dpctl.utils.ExecutionPlacementError` is raised.
91+
92+
.. code-block:: python
93+
94+
def binary_func(
95+
x1 : dpnp.tensor.usm_ndarray,
96+
x2 : dpnp.tensor.usm_ndarray
97+
):
98+
exec_q = dpctl.utils.get_execution_queue((x1.sycl_queue, x2.sycl_queue))
99+
if exec_q is None:
100+
raise dpctl.utils.ExecutionPlacementError
101+
...
102+
103+
In order to ensure that compute-follows-data works seamlessly out-of-the-box, :mod:`dpnp.tensor` maintains
104+
a cache with context and device as keys and queues as values used by :class:`dpnp.tensor.Device` class.
105+
106+
.. code-block:: python
107+
108+
>>> import dpctl
109+
>>> from dpnp import tensor
110+
111+
>>> sycl_dev = dpctl.SyclDevice("cpu")
112+
>>> d1 = tensor.Device.create_device(sycl_dev)
113+
>>> d2 = tensor.Device.create_device("cpu")
114+
>>> d3 = tensor.Device.create_device(dpctl.select_cpu_device())
115+
116+
>>> d1.sycl_queue == d2.sycl_queue, d1.sycl_queue == d3.sycl_queue, d2.sycl_queue == d3.sycl_queue
117+
(True, True, True)
118+
119+
Since :class:`dpnp.tensor.Device` class is used by all :ref:`array creation functions <dpnp_tensor_creation_functions>`
120+
in :mod:`dpnp.tensor`, the same value used as ``device`` keyword argument results in array instances that
121+
can be combined together in accordance with compute-follows-data programming model.
122+
123+
.. code-block:: python
124+
125+
>>> from dpnp import tensor
126+
>>> import dpctl
127+
128+
>>> # queue for default-constructed device is used
129+
>>> x1 = tensor.arange(100, dtype="int32")
130+
>>> x2 = tensor.zeros(100, dtype="int32")
131+
>>> x12 = tensor.concat((x1, x2))
132+
>>> x12.sycl_queue == x1.sycl_queue, x12.sycl_queue == x2.sycl_queue
133+
(True, True)
134+
>>> # default constructors of SyclQueue class create different instance of the queue
135+
>>> q1 = dpctl.SyclQueue()
136+
>>> q2 = dpctl.SyclQueue()
137+
>>> q1 == q2
138+
False
139+
>>> y1 = tensor.arange(100, dtype="int32", sycl_queue=q1)
140+
>>> y2 = tensor.zeros(100, dtype="int32", sycl_queue=q2)
141+
>>> # this call raises ExecutionPlacementError since compute-follows-data
142+
>>> # rules are not met
143+
>>> tensor.concat((y1, y2))
144+
145+
Please refer to the :ref:`array migration <dpnp_tensor_array_migration>` section of the introduction to
146+
:mod:`dpnp.tensor` for examples on how to resolve ``ExecutionPlacementError`` exceptions.

doc/user_guides/index.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
.. _user_guides:
2+
3+
***********
4+
User Guides
5+
***********
6+
7+
.. toctree::
8+
:maxdepth: 2
9+
10+
tensor_intro
11+
execution_model
12+
dlpack

0 commit comments

Comments
 (0)