feat(core): add host, pinned, and device memory management utilities

Eamon2009 · Eamon2009 · commit 184b7ffde73e · 2026-05-21T18:46:53.000+05:30
- Implement memory wrappers for host (`malloc`), pinned host (`cudaMallocHost`), and aligned device allocations (`cudaMalloc`).
- Enforce strict memory layout by rounding up device bytes to `QX_MEM_ALIGN`.
- Add `tensor_alloc_device` and `tensor_alloc_host` factory allocators with automatic initialization.
- Implement unified `tensor_free` handling safe deallocations across all memory spaces.
- Add async Host-to-Device (`tensor_h2d`) copy routine.
diff --git a/cuda/includes/utils.cuh b/cuda/includes/utils.cuh
@@ -0,0 +1,9 @@
+#pragma once
+
+// Aggregator — include this one header to get the full Day 1 runtime.
+// Each sub-header is small and independently loadable.
+
+#include "common.h"   // macros, enums, error checks, dtype helpers
+#include "tensor.cuh" // TensorShape, Tensor struct
+#include "memory.cuh" // allocators, tensor_alloc_*, tensor_free, transfers
+#include "reduce.cuh" // warpReduceSum/Max/Min, blockReduceSum/Max