Add user guides for dpnp.tensor module

vlad-perevezentsev · vlad-perevezentsev · commit 90227c48c319 · 2026-04-10T13:07:55.000-07:00
diff --git a/doc/index.rst b/doc/index.rst
@@ -12,6 +12,7 @@ Data Parallel Extension for NumPy*
 
    overview
    quick_start_guide
+   user_guides/index
    reference/index
 
 .. toctree::
diff --git a/doc/user_guides/dlpack.rst b/doc/user_guides/dlpack.rst
@@ -0,0 +1,138 @@
+.. _dpnp_tensor_dlpack_support:
+
+DLPack exchange of USM allocated arrays
+=======================================
+
+DLPack overview
+---------------
+
+`DLPack <dlpack_docs_>`_ is a commonly used C-ABI compatible data structure that allows data exchange
+between major frameworks. DLPack strives to be minimal, intentionally leaves allocators API and
+device API out of scope.
+
+Data shared via DLPack are owned by the producer who provides a deleter function stored in the
+`DLManagedTensor <dlpack_managed_tensor_>`_, and are only accessed by consumer.
+Python semantics of using the structure is `explained in dlpack docs <dlpack_python_spec_>`_.
+
+DLPack specifies data location in memory via ``void * data`` field of `DLTensor <dlpack_dltensor_>`_ struct, and via ``DLDevice device`` field.
+The `DLDevice <dlpack_dldevice_>`_ struct has two members: an enumeration ``device_type`` and an integer ``device_id``.
+
+DLPack recognizes enumeration value ``DLDeviceType::kDLOneAPI`` reserved for sharing SYCL USM allocations.
+It is not ``kDLSycl`` since importing USM-allocated tensor with this device type relies on oneAPI SYCL extensions
+``sycl_ext_oneapi_filter_selector`` and ``sycl_ext_oneapi_default_platform_context`` to operate.
+
+.. _dlpack_docs: https://dmlc.github.io/dlpack/latest/
+.. _dlpack_managed_tensor: https://dmlc.github.io/dlpack/latest/c_api.html#c.DLManagedTensor
+.. _dlpack_dltensor: https://dmlc.github.io/dlpack/latest/c_api.html#c.DLTensor
+.. _dlpack_dldevice: https://dmlc.github.io/dlpack/latest/c_api.html#c.DLDevice
+.. _dlpack_python_spec: https://dmlc.github.io/dlpack/latest/python_spec.html
+
+Exporting USM allocation to DLPack
+-----------------------------------
+
+When sharing USM allocation (of any ``sycl::usm::kind``) with ``void * ptr`` bound to ``sycl::context ctx``:
+
+.. code-block:: cpp
+    :caption: Protocol for exporting USM allocation as DLPack
+
+    // Input: void *ptr:
+    //             USM allocation pointer
+    //        sycl::context ctx:
+    //             context the pointer is bound to
+
+    // Get device where allocation was originally made
+    // Keep in mind, the device may be a sub-device
+    const sycl::device &ptr_dev = sycl::get_pointer_device(ptr, ctx);
+
+    #if SYCL_KHR_DEFAULT_CONTEXT
+    const sycl::context &default_ctx = ptr_dev.get_platform().khr_get_default_context();
+    #else
+    static_assert(false, "ext_oneapi_default_context extension is required");
+    #endif
+
+    // Assert that ctx is the default platform context, or throw
+    if (ctx != default_ctx) {
+        throw pybind11::type_error(
+            "Can not export USM allocations not "
+            "bound to default platform context."
+        );
+    }
+
+    // Find parent root device if ptr_dev is a sub-device
+    const sycl::device &parent_root_device = get_parent_root_device(ptr_dev);
+
+    // find position of parent_root_device in sycl::get_devices
+    const auto &all_root_devs = sycl::device::get_devices();
+    auto beg = std::begin(all_root_devs);
+    auto end = std::end(all_root_devs);
+    auto selectot_fn = [parent_root_device](const sycl::device &root_d) -> bool {
+        return parent_root_device == root_d;
+    };
+    auto pos = find_if(beg, end, selector_fn);
+
+    if (pos == end) {
+        throw pybind11::type_error("Could not produce DLPack: failed finding device_id");
+    }
+    std::ptrdiff_t dev_idx = std::distance(beg, pos);
+
+    // check that dev_idx can fit into int32_t if needed
+    int32_t device_id = static_cast<int32_t>(dev_idx);
+
+    // populate DLTensor with DLDeviceType::kDLOneAPI and computed device_id
+
+
+Importing DLPack with ``device_type == kDLOneAPI``
+--------------------------------------------------
+
+.. code-block:: cpp
+    :caption: Protocol for recognizing DLPack as a valid USM allocation
+
+    // Input: ptr = dlm_tensor->dl_tensor.data
+    //        device_id = dlm_tensor->dl_tensor.device.device_id
+
+    // Get root_device from device_id
+    const auto &device_vector = sycl::get_device();
+    const sycl::device &root_device = device_vector.at(device_id);
+
+    // Check if the backend of the device is supported by consumer
+    //    Perhaps for certain backends (CUDA, hip, etc.) we should dispatch
+    //    different dlpack importers
+
+    // alternatively
+    // sycl::device root_device = sycl::device(
+    //       sycl::ext::oneapi::filter_selector{ std::to_string(device_id)}
+    // );
+
+    // Get default platform context
+    #if SYCL_KHR_DEFAULT_CONTEXT
+    const sycl::context &default_ctx = root_device.get_platform().khr_get_default_context();
+    #else
+    static_assert(false, "ext_oneapi_default_context extension is required");
+    #endif
+
+    // Check that pointer is known in the context
+    const sycl::usm::kind &alloc_type = sycl::get_pointer_type(ptr, ctx);
+
+    if (alloc_type == sycl::usm::kind::unknown) {
+        throw pybind11::type_error(
+            "Data pointer in DLPack is not bound to the "
+            "default platform context of specified device"
+        );
+    }
+
+    // Perform check that USM allocation type is supported by consumer if needed
+
+    // Get sycl::device where the data was allocated
+    const sycl::device &ptr_dev = sycl::get_pointer_device(ptr, ctx);
+
+    // Create object of consumer's library from ptr, ptr_dev, ctx
+
+Support of DLPack with ``kDLOneAPI`` device type
+------------------------------------------------
+
+:py:mod:`dpnp.tensor` supports DLPack v0.8. Exchange of USM allocations made using Level-Zero backend
+is supported with ``torch.Tensor(device='xpu')`` for PyTorch when using `intel-extension-for-pytorch <intel_ext_for_torch_>`_,
+as well as for TensorFlow when `intel-extension-for-tensorflow <intel_ext_for_tf_>`_ is used.
+
+.. _intel_ext_for_torch: https://github.com/intel/intel-extension-for-pytorch
+.. _intel_ext_for_tf: https://github.com/intel/intel-extension-for-tensorflow
diff --git a/doc/user_guides/execution_model.rst b/doc/user_guides/execution_model.rst
@@ -0,0 +1,146 @@
+.. _dpnp_execution_model:
+
+########################
+oneAPI programming model
+########################
+
+oneAPI library and its Python interface
+=======================================
+
+Using oneAPI libraries, a user calls functions that take ``sycl::queue`` and a collection of
+``sycl::event`` objects among other arguments. For example:
+
+.. code-block:: cpp
+    :caption: Prototypical call signature of oneMKL function
+
+    sycl::event
+    compute(
+        sycl::queue &exec_q,
+        ...,
+        const std::vector<sycl::event> &dependent_events
+    );
+
+The function ``compute`` inserts computational tasks into the queue ``exec_q`` for DPC++ runtime to
+execute on the device the queue targets. The execution may begin only after other tasks whose
+execution status is represented by ``sycl::event`` objects in the provided ``dependent_events``
+vector complete. If the vector is empty, the runtime begins the execution as soon as the device is
+ready. The function returns a ``sycl::event`` object representing completion of the set of
+computational tasks submitted by the ``compute`` function.
+
+Hence, in the oneAPI programming model, the execution **queue** is used to specify which device the
+function will execute on. To create a queue, one must specify a device to target.
+
+In :mod:`dpctl`, the ``sycl::queue`` is represented by :class:`dpctl.SyclQueue` Python type,
+and a Python API to call such a function might look like
+
+.. code-block:: python
+
+    def call_compute(
+        exec_q : dpctl.SyclQueue,
+        ...,
+        dependent_events : List[dpctl.SyclEvent] = []
+    ) -> dpctl.SyclEvent:
+        ...
+
+When building Python API for a SYCL offloading function, and you choose to
+map the SYCL API to a different API on the Python side, it must still translate to a
+similar call under the hood.
+
+The arguments to the function must be suitable for use in the offloading functions.
+Typically these are Python scalars, or objects representing USM allocations, such as
+:class:`dpnp.tensor.usm_ndarray`, :class:`dpctl.memory.MemoryUSMDevice` and friends.
+
+.. note::
+    The USM allocations these objects represent must not get deallocated before
+    offloaded tasks that access them complete.
+
+    This is something authors of DPC++-based Python extensions must take care of,
+    and users of such extensions should assume assured.
+
+
+USM allocations and compute-follows-data
+========================================
+
+To make a USM allocation on a device in SYCL, one needs to specify ``sycl::device`` in the
+memory of which the allocation is made, and the ``sycl::context`` to which the allocation
+is bound.
+
+A ``sycl::queue`` object is often used instead. In such cases ``sycl::context`` and ``sycl::device`` associated
+with the queue are used to make the allocation.
+
+.. important::
+    :mod:`dpnp.tensor` associates a queue object with every USM allocation.
+
+    The associated queue may be queried using ``.sycl_queue`` property of the
+    Python type representing the USM allocation.
+
+This design choice allows :mod:`dpnp.tensor` to have a preferred queue to use when operating on any single
+USM allocation. For example:
+
+.. code-block:: python
+
+    def unary_func(x : dpnp.tensor.usm_ndarray):
+        code1
+        _ = _func_impl(x.sycl_queue, ...)
+        code2
+
+When combining several objects representing USM-allocations, the
+:ref:`programming model <dpnp_tensor_compute_follows_data>`
+adopted in :mod:`dpnp.tensor` insists that queues associated with each object be the same, in which
+case it is the execution queue used. Alternatively :exc:`dpctl.utils.ExecutionPlacementError` is raised.
+
+.. code-block:: python
+
+    def binary_func(
+        x1 : dpnp.tensor.usm_ndarray,
+        x2 : dpnp.tensor.usm_ndarray
+    ):
+        exec_q = dpctl.utils.get_execution_queue((x1.sycl_queue, x2.sycl_queue))
+        if exec_q is None:
+            raise dpctl.utils.ExecutionPlacementError
+        ...
+
+In order to ensure that compute-follows-data works seamlessly out-of-the-box, :mod:`dpnp.tensor` maintains
+a cache with context and device as keys and queues as values used by :class:`dpnp.tensor.Device` class.
+
+.. code-block:: python
+
+    >>> import dpctl
+    >>> from dpnp import tensor
+
+    >>> sycl_dev = dpctl.SyclDevice("cpu")
+    >>> d1 = tensor.Device.create_device(sycl_dev)
+    >>> d2 = tensor.Device.create_device("cpu")
+    >>> d3 = tensor.Device.create_device(dpctl.select_cpu_device())
+
+    >>> d1.sycl_queue == d2.sycl_queue, d1.sycl_queue == d3.sycl_queue, d2.sycl_queue == d3.sycl_queue
+    (True, True, True)
+
+Since :class:`dpnp.tensor.Device` class is used by all :ref:`array creation functions <dpnp_tensor_creation_functions>`
+in :mod:`dpnp.tensor`, the same value used as ``device`` keyword argument results in array instances that
+can be combined together in accordance with compute-follows-data programming model.
+
+.. code-block:: python
+
+    >>> from dpnp import tensor
+    >>> import dpctl
+
+    >>> # queue for default-constructed device is used
+    >>> x1 = tensor.arange(100, dtype="int32")
+    >>> x2 = tensor.zeros(100, dtype="int32")
+    >>> x12 = tensor.concat((x1, x2))
+    >>> x12.sycl_queue == x1.sycl_queue, x12.sycl_queue == x2.sycl_queue
+    (True, True)
+    >>> # default constructors of SyclQueue class create different instance of the queue
+    >>> q1 = dpctl.SyclQueue()
+    >>> q2 = dpctl.SyclQueue()
+    >>> q1 == q2
+    False
+    >>> y1 = tensor.arange(100, dtype="int32", sycl_queue=q1)
+    >>> y2 = tensor.zeros(100, dtype="int32", sycl_queue=q2)
+    >>> # this call raises ExecutionPlacementError since compute-follows-data
+    >>> # rules are not met
+    >>> tensor.concat((y1, y2))
+
+Please refer to the :ref:`array migration <dpnp_tensor_array_migration>` section of the introduction to
+:mod:`dpnp.tensor` for examples on how to resolve ``ExecutionPlacementError`` exceptions.
diff --git a/doc/user_guides/index.rst b/doc/user_guides/index.rst
@@ -0,0 +1,12 @@
+.. _user_guides:
+
+***********
+User Guides
+***********
+
+.. toctree::
+   :maxdepth: 2
+
+   tensor_intro
+   execution_model
+   dlpack
diff --git a/doc/user_guides/tensor_intro.rst b/doc/user_guides/tensor_intro.rst