Skip to content

Commit 25c1aa5

Browse files
MagicalTuxclaude
andcommitted
Support multiple GPUs: one pipeline per GPU, --jobs applied per GPU
Previously every runner bound device 0, so --jobs>1 just piled onto one GPU and extra GPUs sat idle. Now: - cuda::device_count() enumerates GPUs; Gpu::load_first takes a device ordinal and binds it on the calling thread. - run_worker builds one independent pipeline (prefetch -> N runners -> N uploaders) per selected GPU, so --jobs is the concurrency *per GPU* (2 GPUs, --jobs 3 => 6 concurrent runs). - New --gpus selector (e.g. "0,2"); default is every detected GPU. - A shared per-content-hash download lock (Downloads) coordinates the now-multiple prefetch threads so two GPUs fetching the same job's blobs don't race on the same file in the shared cache; the existing shared in-flight set already dedupes a fragment handed to two GPUs. - Per-fragment "running" log and startup banner now name the GPU. Unit tests for --gpus parsing/validation. Single-GPU path verified end-to-end (claim -> run on GPU#0 -> next fragment prefetched under --jobs 2). Bump version to 0.1.9. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 1f38213 commit 25c1aa5

4 files changed

Lines changed: 178 additions & 50 deletions

File tree

Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "decryptd"
3-
version = "0.1.8"
3+
version = "0.1.9"
44
edition = "2024"
55
license = "Proprietary"
66
authors = ["Karpeles Lab Inc"]

src/cuda.rs

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ type CuDeviceptr = u64;
2727
#[allow(non_snake_case)]
2828
unsafe extern "C" {
2929
fn cuInit(flags: u32) -> CuResult;
30+
fn cuDeviceGetCount(count: *mut i32) -> CuResult;
3031
fn cuDeviceGet(device: *mut CuDevice, ordinal: i32) -> CuResult;
3132
fn cuDeviceGetAttribute(pi: *mut i32, attrib: i32, dev: CuDevice) -> CuResult;
3233
fn cuDeviceGetName(name: *mut c_char, len: i32, dev: CuDevice) -> CuResult;
@@ -119,6 +120,16 @@ impl Drop for DeviceBuf {
119120
}
120121
}
121122

123+
/// Number of CUDA devices visible to the driver (after `CUDA_VISIBLE_DEVICES`).
124+
pub fn device_count() -> Result<i32, String> {
125+
unsafe {
126+
check(cuInit(0), "cuInit")?;
127+
let mut n: i32 = 0;
128+
check(cuDeviceGetCount(&mut n), "cuDeviceGetCount")?;
129+
Ok(n)
130+
}
131+
}
132+
122133
/// An initialized CUDA context with a module loaded.
123134
pub struct Gpu {
124135
ctx: CuContext,
@@ -127,20 +138,22 @@ pub struct Gpu {
127138
}
128139

129140
impl Gpu {
130-
/// Init device 0 and load the best cubin for it. Callers pass `(arch, bytes)`
131-
/// pairs highest-arch-first, where arch is CC `X.Y` encoded as `X*10+Y`.
141+
/// Init device `ordinal` and load the best cubin for it. Callers pass
142+
/// `(arch, bytes)` pairs highest-arch-first, where arch is CC `X.Y` encoded as
143+
/// `X*10+Y`. The created context is current on the *calling thread*, so each
144+
/// runner thread must call this on its own GPU (see [`crate::run_loop`]).
132145
///
133146
/// Cubins newer than the device are skipped rather than tried: an old driver
134147
/// (e.g. 550.x / CUDA 12.4) doesn't cleanly reject a cubin for an architecture
135148
/// it has never heard of — `cuModuleLoadData` faults with SIGILL *inside*
136149
/// libcuda. So we query the GPU's compute capability first and never hand the
137150
/// driver anything above it. Same-major-lower cubins that still don't load
138151
/// (a known arch the driver rejects) fall through to the next candidate.
139-
pub fn load_first(cubins: &[(u32, Vec<u8>)]) -> Result<Gpu, String> {
152+
pub fn load_first(ordinal: i32, cubins: &[(u32, Vec<u8>)]) -> Result<Gpu, String> {
140153
unsafe {
141154
check(cuInit(0), "cuInit")?;
142155
let mut dev: CuDevice = 0;
143-
check(cuDeviceGet(&mut dev, 0), "cuDeviceGet")?;
156+
check(cuDeviceGet(&mut dev, ordinal), "cuDeviceGet")?;
144157

145158
// Device compute capability, encoded to match the `smNN` tags.
146159
let (mut maj, mut min) = (0i32, 0i32);
@@ -359,7 +372,7 @@ mod tests {
359372
// it; the real cubin must still match this GPU for cuModuleLoadData.
360373
let cubins = vec![(0u32, bytes)];
361374
for i in 0..64 {
362-
let gpu = Gpu::load_first(&cubins)
375+
let gpu = Gpu::load_first(0, &cubins)
363376
.unwrap_or_else(|e| panic!("iteration {i}: load_first failed: {e}"));
364377
// Touch it so the context is really used, then drop at end of scope.
365378
let _ = gpu.compute_capability();

0 commit comments

Comments
 (0)