Haofei · pull · Apr 2, 2026 · Mar 31, 2026 · Mar 31, 2026 · Mar 31, 2026
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -0,0 +1,33 @@
+---
+name: Bug Report
+about: Report a bug or unexpected behavior
+title: ''
+labels: bug
+assignees: ''
+---
+
+**Describe the bug**
+A clear and concise description of what the bug is.
+
+**To reproduce**
+Steps to reproduce the behavior:
+
+1. ...
+2. ...
+
+**Expected behavior**
+A clear and concise description of what you expected to happen.
+
+**Environment**
+
+- OS: [e.g. Windows 11, Ubuntu 22.04]
+- GPU: [e.g. RTX 3060]
+- CUDA Toolkit version: [e.g. 13.2]
+- cuDNN version (if applicable): [e.g. 9.x]
+- Rust toolchain: [output of `rustc --version`]
+
+**Error output**
+If applicable, paste the full error message or log output.
+
+**Additional context**
+Add any other context about the problem here.
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -0,0 +1,19 @@
+---
+name: Feature Request
+about: Suggest an enhancement or new feature
+title: ''
+labels: enhancement
+assignees: ''
+---
+
+**Is your feature request related to a problem?**
+A clear and concise description of the problem. E.g. "I'm always frustrated when..."
+
+**Describe the solution you'd like**
+A clear and concise description of what you want to happen.
+
+**Describe alternatives you've considered**
+A clear and concise description of any alternative solutions or features you've considered.
+
+**Additional context**
+Add any other context, links to CUDA documentation, or references here.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -0,0 +1,20 @@
+## Summary
+
+Brief description of what this PR does and why.
+
+Closes #ISSUE_NUMBER
+
+## Changes
+
+- ...
+- ...
+
+## Testing
+
+- [ ] `cargo build` passes
+- [ ] `cargo clippy --workspace` passes
+- [ ] Tested on: [OS, GPU, CUDA version]
+
+## Notes
+
+Any additional context for reviewers.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,190 @@
+# Contributing to Rust CUDA
+
+Welcome! We're glad you're interested in contributing to the Rust CUDA project. We welcome
+contributions from people of all backgrounds who are interested in making great software with us.
+
+## Getting Help
+
+For questions, clarifications, and general help:
+
+1. Search existing [GitHub issues](https://github.com/Rust-GPU/rust-cuda/issues)
+2. If you can't find the answer, open a new issue or start a discussion
+
+## Prerequisites
+
+### Required
+
+- **CUDA Toolkit** (12.x or 13.x recommended). Install from
+  [NVIDIA's website](https://developer.nvidia.com/cuda-downloads).
+- **Rust nightly toolchain** -- the project pins a specific nightly via
+  [`rust-toolchain.toml`](rust-toolchain.toml). Running any `cargo` command in the repo
+  will automatically install the correct version if you have `rustup`.
+- **LLVM tools** -- installed automatically by `rustup` as part of the pinned toolchain
+  components.
+- A **CUDA-capable GPU** with compute capability >= 3.0.
+
+### Optional
+
+- **cuDNN** -- required only if you're building the `cudnn` / `cudnn-sys` crates. Install
+  from [NVIDIA cuDNN](https://developer.nvidia.com/cudnn).
+- **mdBook** -- required to build the guide locally. Install with
+  `cargo install mdbook`.
+
+### Windows-Specific Notes
+
+- Ensure the CUDA Toolkit `bin` directory is on your `PATH` (e.g.
+  `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.2\bin`).
+- The MSVC build tools are required. Install via
+  [Visual Studio Build Tools](https://visualstudio.microsoft.com/downloads/) with the
+  "Desktop development with C++" workload.
+- If using cuDNN, place the cuDNN files in your CUDA Toolkit directory or set
+  `CUDNN_PATH` to point to the cuDNN installation.
+- Some crates require `advapi32` for linking (handled automatically by build scripts).
+
+### Linux-Specific Notes
+
+- Ensure `nvcc` is on your `PATH` and `LD_LIBRARY_PATH` includes the CUDA lib directory.
+- The project provides container images for CI; see
+  `.github/workflows/ci_linux.yml` for reference.
+
+## Building
+
+Build the entire workspace:
+
+```sh
+cargo build
+```
+
+Build a specific crate:
+
+```sh
+cargo build -p cust
+cargo build -p cudnn
+```
+
+Run clippy:
+
+```sh
+cargo clippy --workspace
+```
+
+Run tests (requires a CUDA-capable GPU):
+
+```sh
+cargo test --workspace
+```
+
+### Building the Guide
+
+The user-facing documentation is an [mdBook](https://rust-lang.github.io/mdBook/) located
+in the `guide/` directory.
+
+```sh
+# Install mdBook (one-time)
+cargo install mdbook
+
+# Build and serve locally
+mdbook serve guide --open
+```
+
+## Running Examples
+
+Examples live in the `examples/` and `samples/` directories:
+
+```sh
+# Vector addition
+cargo run -p vecadd
+
+# Matrix multiplication (GEMM)
+cargo run -p gemm
+```
+
+See [`examples/README.md`](examples/README.md) for the full list.
+
+## Issues
+
+### Feature Requests
+
+If you have ideas for improvements, suggest features by opening a GitHub issue. Include
+details about the feature and describe any use cases it would enable.
+
+### Bug Reports
+
+When reporting a bug, make sure your issue describes:
+
+- Steps to reproduce the behavior
+- Your platform (OS, GPU, CUDA version, Rust toolchain version)
+- Any error messages or logs
+
+### Wontfix
+
+Issues may be closed as `wontfix` if they are misaligned with the project vision or out of
+scope. We will comment on the issue with detailed reasoning.
+
+## Contribution Workflow
+
+### Finding Work
+
+Start by looking at open issues tagged as
+[`help wanted`](https://github.com/Rust-GPU/rust-cuda/issues?q=is%3Aopen+is%3Aissue+label%3A%22help+wanted%22)
+or
+[`good first issue`](https://github.com/Rust-GPU/rust-cuda/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22).
+
+Comment on the issue to let others know you're working on it.
+
+### Pull Request Process
+
+1. **Fork** the repository.
+2. **Create a new feature branch** from `main`.
+3. **Make your changes.** Ensure there are no build errors by running `cargo build` and
+   `cargo clippy --workspace` locally.
+4. **Open a pull request** with a clear title and description of what you did.
+5. A maintainer will review your pull request and may ask you to make changes.
+
+### Commit Messages
+
+This project follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/)
+specification. Each commit message should have the format:
+
+```
+<type>(<scope>): <description>
+```
+
+**Types:** `feat`, `fix`, `docs`, `chore`, `ci`, `test`, `refactor`, `perf`, `style`
+
+**Scopes** (common examples): `cust`, `cudnn`, `cudnn-sys`, `cust_raw`, `cuda_std`,
+`nvvm`, `vecadd`, `guide`, `windows`
+
+**Examples:**
+
+```
+feat(cudnn): add batch normalization forward/backward
+fix(cust_raw): correct Windows CUDA path discovery
+docs(guide): add Windows getting-started section
+ci(windows): include vecadd in workspace build
+```
+
+## Project Structure
+
+| Directory | Description |
+| --- | --- |
+| `crates/cust` | High-level safe wrapper around the CUDA Driver API |
+| `crates/cust_core` | Core `DeviceCopy` trait shared between host and device |
+| `crates/cust_raw` | Low-level `bindgen` bindings to CUDA SDK |
+| `crates/cudnn` | Type-safe cuDNN wrapper |
+| `crates/cudnn-sys` | Low-level `bindgen` bindings to cuDNN |
+| `crates/cuda_std` | GPU-side standard library |
+| `crates/cuda_std_macros` | Proc macros (`#[kernel]`, `#[gpu_only]`, etc.) |
+| `crates/cuda_builder` | Build-time helper for compiling GPU kernels |
+| `crates/rustc_codegen_nvvm` | Custom rustc backend targeting NVVM/PTX |
+| `crates/nvvm` | Wrapper around NVIDIA's libNVVM |
+| `crates/blastoff` | cuBLAS bindings |
+| `examples/` | Example programs |
+| `samples/` | Ports of NVIDIA CUDA samples |
+| `guide/` | mdBook source for the Rust CUDA Guide |
+
+## Licensing
+
+This project is dual-licensed under Apache-2.0 or MIT, at your discretion. Unless you
+explicitly state otherwise, any contribution intentionally submitted for inclusion in the
+work shall be dual-licensed as above, without any additional terms or conditions.
diff --git a/crates/cuda_builder/src/lib.rs b/crates/cuda_builder/src/lib.rs
@@ -809,6 +809,9 @@ fn invoke_rustc(builder: &CudaBuilder) -> Result<PathBuf, CudaBuilderError> {
 
     let cargo_encoded_rustflags = join_checking_for_separators(rustflags, "\x1f");
 
+    // HACK(fee1-dead): didn't seem like there was a better way to disable f16/f128s, the `target_config`` did not work for some reason.
+    cargo.env("CARGO_FEATURE_NO_F16_F128", "1");
+
     let build = cargo
         .stderr(Stdio::inherit())
         .current_dir(&builder.path_to_crate)

diff --git a/crates/cudnn/src/backend/graph.rs b/crates/cudnn/src/backend/graph.rs
@@ -1,4 +1,4 @@
-use crate::{
+use crate::{
     CudnnContext, CudnnError,
     backend::{Descriptor, Operation},
 };
@@ -39,7 +39,6 @@ impl GraphBuilder {
             let descriptors = operations
                 .iter()
                 .map(|op| match op {
-                    Operation::ConvBwdData { raw, .. } => raw.inner(),
                     Operation::ConvBwdData { raw, .. } => raw.inner(),
                     Operation::ConvBwdFilter { raw, .. } => raw.inner(),
                     Operation::ConvFwd { raw, .. } => raw.inner(),

diff --git a/crates/rustc_codegen_nvvm/src/builder.rs b/crates/rustc_codegen_nvvm/src/builder.rs
@@ -526,8 +526,20 @@ impl<'ll, 'tcx, 'a> BuilderMethods<'a, 'tcx> for Builder<'a, 'll, 'tcx> {
         order: AtomicOrdering,
         _size: Size,
     ) -> &'ll Value {
-        // Since for any A, A | 0 = A, and performing atomics on constant memory is UB in Rust, we can abuse or to perform atomic reads.
-        self.atomic_rmw(AtomicRmwBinOp::AtomicOr, ptr, self.const_int(ty, 0), order)
+        // Since for any A, A | 0 = A, and performing atomics on constant memory is UB in Rust, we
+        // can abuse bitwise-or to perform atomic reads.
+        //
+        // njn: is `ty` the type of the loaded value, or the type of the
+        // pointer to the loaded-from address? i.e. `T` or `*const T`? I'm
+        // assuming `T`
+        let ret_ptr = unsafe { llvm::LLVMRustGetTypeKind(ty) == llvm::TypeKind::Pointer };
+        self.atomic_rmw(
+            AtomicRmwBinOp::AtomicOr,
+            ptr,
+            self.const_int(ty, 0),
+            order,
+            ret_ptr,
+        )
     }
 
     fn load_operand(&mut self, place: PlaceRef<'tcx, &'ll Value>) -> OperandRef<'tcx, &'ll Value> {
@@ -760,7 +772,9 @@ impl<'ll, 'tcx, 'a> BuilderMethods<'a, 'tcx> for Builder<'a, 'll, 'tcx> {
         _size: Size,
     ) {
         // We can exchange *ptr with val, and then discard the result.
-        self.atomic_rmw(AtomicRmwBinOp::AtomicXchg, ptr, val, order);
+        let ret_ptr =
+            unsafe { llvm::LLVMRustGetTypeKind(llvm::LLVMTypeOf(val)) == llvm::TypeKind::Pointer };
+        self.atomic_rmw(AtomicRmwBinOp::AtomicXchg, ptr, val, order, ret_ptr);
     }
 
     fn gep(&mut self, ty: &'ll Type, ptr: &'ll Value, indices: &[&'ll Value]) -> &'ll Value {
@@ -1217,17 +1231,19 @@ impl<'ll, 'tcx, 'a> BuilderMethods<'a, 'tcx> for Builder<'a, 'll, 'tcx> {
         let success = self.extract_value(res, 1);
         (val, success)
     }
+
     fn atomic_rmw(
         &mut self,
         op: AtomicRmwBinOp,
         dst: &'ll Value,
         src: &'ll Value,
         order: AtomicOrdering,
+        ret_ptr: bool,
     ) -> &'ll Value {
         if matches!(op, AtomicRmwBinOp::AtomicNand) {
             self.fatal("Atomic NAND not supported yet!")
         }
-        self.atomic_op(
+        let mut res = self.atomic_op(
             dst,
             |builder, dst| {
                 // We are in a supported address space - just use ordinary atomics
@@ -1243,8 +1259,8 @@ impl<'ll, 'tcx, 'a> BuilderMethods<'a, 'tcx> for Builder<'a, 'll, 'tcx> {
                 }
             },
             |builder, dst| {
-                // Local space is only accessible to the current thread.
-                // So, there are no synchronization issues, and we can emulate it using a simple load / compare / store.
+                // Local space is only accessible to the current thread. So, there are no
+                // synchronization issues, and we can emulate it using a simple load/compare/store.
                 let load: &'ll Value =
                     unsafe { llvm::LLVMBuildLoad(builder.llbuilder, dst, UNNAMED) };
                 let next_val = match op {
@@ -1278,7 +1294,17 @@ impl<'ll, 'tcx, 'a> BuilderMethods<'a, 'tcx> for Builder<'a, 'll, 'tcx> {
                 unsafe { llvm::LLVMBuildStore(builder.llbuilder, next_val, dst) };
                 load
             },
-        )
+        );
+
+        // njn:
+        // - copied from rustc_codegen_llvm
+        // - but Fractal said: Here, if ret_ptr is true, we should cast dst to *usize, src to
+        //   usize, and then cast the return value back to a *T(by checking the original type of
+        //   src).
+        if ret_ptr && self.val_ty(res) != self.type_ptr() {
+            res = self.inttoptr(res, self.type_ptr());
+        }
+        res
     }
 
     fn atomic_fence(

diff --git a/crates/rustc_codegen_nvvm/src/context.rs b/crates/rustc_codegen_nvvm/src/context.rs
@@ -816,10 +816,6 @@ impl<'tcx> FnAbiOfHelpers<'tcx> for CodegenCx<'_, 'tcx> {
 }
 
 impl<'tcx> CoverageInfoBuilderMethods<'tcx> for CodegenCx<'_, 'tcx> {
-    fn init_coverage(&mut self, _instance: Instance<'tcx>) {
-        todo!()
-    }
-
     fn add_coverage(
         &mut self,
         _instance: Instance<'tcx>,

diff --git a/crates/rustc_codegen_nvvm/src/intrinsic.rs b/crates/rustc_codegen_nvvm/src/intrinsic.rs
@@ -612,7 +612,8 @@ impl<'ll, 'tcx> IntrinsicCallBuilderMethods<'tcx> for Builder<'_, 'll, 'tcx> {
                 // This piece of code was adapted from `rustc_codegen_cranelift`.
                 let intrinsic = self.tcx.intrinsic(instance.def_id()).unwrap();
                 if intrinsic.must_be_overridden {
-                    bug!(
+                    span_bug!(
+                        span,
                         "intrinsic {} must be overridden by codegen_nvvm, but isn't",
                         intrinsic.name,
                     );