Skip to content

Commit a427d2e

Browse files
committed
reflow
1 parent c6e47e0 commit a427d2e

1 file changed

Lines changed: 29 additions & 11 deletions

File tree

src/offload/internals.md

Lines changed: 29 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -4,26 +4,44 @@ This module is under active development.
44
Once upstream, it should allow Rust developers to run Rust code on GPUs.
55
We aim to develop a `rusty` GPU programming interface, which is safe, convenient and sufficiently fast by default.
66
This includes automatic data movement to and from the GPU, in a efficient way.
7-
We will (later)
8-
also offer more advanced, possibly unsafe, interfaces which allow a higher degree of control.
7+
We will (later) also offer more advanced,
8+
possibly unsafe, interfaces which allow a higher degree of control.
99

10-
The implementation is based on LLVM's "offload" project, which is already used by OpenMP to run Fortran or C++ code on GPUs.
11-
While the project is under development, users will need to call other compilers like clang to finish the compilation process.
10+
The implementation is based on LLVM's "offload" project,
11+
which is already used by OpenMP to run Fortran or C++ code on GPUs.
12+
While the project is under development,
13+
users will need to call other compilers like clang to finish the compilation process.
1214

1315
## High-level compilation design:
16+
1417
We use a single-source, two-pass compilation approach.
1518

16-
First we compile all functions that should be offloaded for the device (e.g nvptx64, amdgcn-amd-amdhsa, intel in the future).
19+
First we compile all functions that should be offloaded for the device
20+
(e.g nvptx64, amdgcn-amd-amdhsa, intel in the future).
1721
Currently we require cumbersome `#cfg(target_os="")` annotations, but we intend to recognize those in the future based on our offload intrinsic.
1822
This first compilation currently does not leverage rustc's internal Query system, so it will always recompile your kernels at the moment.
1923
This should be easy to fix, but we prioritize features and runtime performance improvements at the moment.
2024
Please reach out if you want to implement it, though!
2125

22-
We then compile the code for the host (e.g. x86-64), where most of the offloading logic happens. On the host side, we generate calls to the openmp offload runtime, to inform it about the layout of the types (a simplified version of the autodiff TypeTrees). We also use the type system to figure out whether kernel arguments have to be moved only to the device (e.g. `&[f32;1024]`), from the device, or both (e.g. `&mut [f64]`). We then launch the kernel, after which we inform the runtime to end this environment and move data back (as far as needed).
26+
We then compile the code for the host (e.g. x86-64), where most of the offloading logic happens.
27+
On the host side, we generate calls to the openmp offload runtime,
28+
to inform it about the layout of the types (a simplified version of the autodiff TypeTrees).
29+
We also use the type system to figure out whether kernel arguments have to be moved only to the device (e.g. `&[f32;1024]`),
30+
from the device, or both (e.g. `&mut [f64]`).
31+
We then launch the kernel,
32+
after which we inform the runtime to end this environment and move data back (as far as needed).
2333

2434
The second pass for the host will load the kernel artifacts from the previous compilation.
25-
rustc in general may not "guess" or hardcode the build directory layout, and as such it must be told the path to the kernel artifacts in the second invocation.
26-
The logic for this could be integrated into cargo, but it also only requires a trivial cargo wrapper, which we could trivially provide via crates.io till we see larger adoption.
27-
28-
It might seem tempting to think about a single-source, single pass compilation approach. However, a lot of the rustc frontend (e.g. AST) will drop any dead code (e.g. code behind an inactive `cfg`). Getting the frontend to expand and lower code for two targets naively will result in multiple definitions of the same symbol (and other issues). Trying to teach the whole rustc middle and backend to be aware that any symbol now might contain two implementations is a large undertaking, and it is questionable why we should make the whole compiler more complex, if the alternative is a ~5 line cargo wrapper. We still control the full compilation pipeline and have both host and device code available, therefore there shouldn't be a runtime performance difference between the two approaches.
29-
35+
rustc in general may not "guess" or hardcode the build directory layout,
36+
and as such it must be told the path to the kernel artifacts in the second invocation.
37+
The logic for this could be integrated into cargo,
38+
but it also only requires a trivial cargo wrapper,
39+
which we could trivially provide via crates.io till we see larger adoption.
40+
41+
It might seem tempting to think about a single-source, single pass compilation approach.
42+
However, a lot of the rustc frontend (e.g. AST) will drop any dead code (e.g. code behind an inactive `cfg`).
43+
Getting the frontend to expand and lower code for two targets naively will result in multiple definitions of the same symbol (and other issues).
44+
Trying to teach the whole rustc middle and backend to be aware that any symbol now might contain two implementations is a large undertaking,
45+
and it is questionable why we should make the whole compiler more complex, if the alternative is a ~5 line cargo wrapper.
46+
We still control the full compilation pipeline and have both host and device code available,
47+
therefore there shouldn't be a runtime performance difference between the two approaches.

0 commit comments

Comments
 (0)