Skip to content

Free curand states before the thread is destroyed#1912

Open
no1d wants to merge 1 commit intoOpenNMT:masterfrom
no1d:dispose-curand
Open

Free curand states before the thread is destroyed#1912
no1d wants to merge 1 commit intoOpenNMT:masterfrom
no1d:dispose-curand

Conversation

@no1d
Copy link
Copy Markdown

@no1d no1d commented Aug 16, 2025

Tried #1201 with no luck, so this should fix SYSTRAN/faster-whisper#71

@xSlikZodiac
Copy link
Copy Markdown

I've had this issue too... running cuda 12.4 4080s

0xC0000409 (3221226505) = Windows’ stack buffer overrun.. ive literally tried to resolve on my own for the last year.

@Purfview
Copy link
Copy Markdown
Contributor

Purfview commented Nov 24, 2025

@no1d Thanks for this.

@jordimas Check it out, this one is important for us.

@a2d8a4v
Copy link
Copy Markdown
Contributor

a2d8a4v commented Jan 5, 2026

Hi, @Purfview,
Can we find other people help review and merge this patch?
It seems that the author recently do not reply in this repo, maybe he's too busy right now.

@no1d no1d force-pushed the dispose-curand branch from 0968c24 to aaaf528 Compare April 9, 2026 09:21
morganjeremiah7 pushed a commit to morganjeremiah7/hush-profanity that referenced this pull request Apr 28, 2026
…rash

Two changes that go together:

1) Stack upgrade — removes the cuDNN 8 / cuDNN 9 dual-load
   - torch 2.5.1+cu121 -> 2.8.0+cu126
   - ctranslate2 4.4.0 -> 4.7.1 (uses cuDNN 9 natively)
   - whisperx 3.4.5 -> 3.8.5
   - nvidia-cudnn-cu12==8.9.7.29 -> removed (torch's bundled cuDNN 9 is
     now the only one in the process)
   - install-windows.ps1, pyproject.toml, requirements.txt updated.

   This alone did not fix the crash: even with the cleaner stack, python
   still died on the 2nd file with the same KERNELBASE 0xe06d7363 +
   ucrtbase 0xC0000409 signature.

2) Subprocess-per-file transcription — the bulletproof workaround for
   OpenNMT/CTranslate2#1912 / faster-whisper#71/#1293. ctranslate2's CUDA
   cleanup path corrupts the heap when WhisperModel is destroyed; the
   corruption gets touched fatally after 1-3 destruct/reconstruct cycles
   in one process. The fix recommended by the upstream issue threads is
   to run each transcription in its own process and let OS-level CUDA
   context teardown bypass the buggy cleanup path.

   New module src/hush_profanity/_transcribe_worker.py:
     - JSON-in, JSON-out contract (config in, words out, both via temp files)
     - exit codes: 0 success, 1 config/IO error, 2 transcribe error, >2 unknown
     - stderr captured by parent and forwarded to the main log

   scanner.gpu_worker now spawns this worker per file via subprocess.run
   with a 30 min timeout. If the subprocess crashes (which it shouldn't,
   but if ctranslate2 gets weird) the parent catches RuntimeError and
   marks just that file as failed, then continues with the next.

Verified: 3 sequential subprocess transcriptions on CUDA all exit clean.
The in-process version of the same test crashed on the 3rd run.

Cost: ~5-10 s subprocess startup per file. Negligible vs the alternative
(crash after 1-2 files).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
morganjeremiah7 pushed a commit to morganjeremiah7/hush-profanity that referenced this pull request Apr 28, 2026
…-whisper

Why: ctranslate2 has a long-standing CUDA cleanup crash on Windows
(OpenNMT/CTranslate2#1912, faster-whisper#71/#1293) that we hit reliably
across every version (4.4.0 → 4.7.1) and every workaround we tried:
  - int8 quantization (Test 1) — VRAM dropped 22GB → 11GB but crashes
    persisted, ruling out memory exhaustion as the cause
  - alignment off (Test 3) — removed PyTorch from the GPU entirely so
    only ctranslate2 was a CUDA library; still crashed, ruling out the
    dual-allocator theory
  - stack rollback to ct2 4.4.0 / cu121 / cuDNN 8 (Test 2) — exactly the
    version that did 49 files in a row originally; still crashed, so the
    bug is in ctranslate2 itself regardless of version
  - subprocess isolation — kept the parent alive when workers crashed
    but still lost ~30% of files per scan

The cure was replacing the engine. openai-whisper is the reference
PyTorch implementation. Slower (~3-4× per file) but rock-solid: same
PyTorch CUDA stack as the wav2vec2 alignment in WhisperX, so only one
CUDA allocator in the process.

Verified with sequential subprocess test (alignment ON, real CUDA) —
3/3 clean exits where the in-process ctranslate2 version crashed every
time on the 3rd run. Then 7/7 successful overnight scan on the 8 files
that had previously failed.

Other changes:
  - transcribe.py: full rewrite around openai-whisper API. Same Word
    dataclass output. Subprocess pattern preserved for belt-and-suspenders.
  - verbose=None instead of False to suppress the tqdm progress bar
    that was polluting the worker stderr → main log.
  - install-windows.ps1: drops nvidia-cudnn-cu12==8.9.7.29 (no longer
    needed — torch's bundled cuDNN is sufficient). Adds triton-windows
    so openai-whisper's word-timestamp DTW kernels run on GPU instead
    of falling back to a much slower pure-PyTorch path.
  - pyproject.toml + requirements.txt: pin openai-whisper, whisperx<3.5
    (3.5+ pulls ctranslate2 back in transitively), torch 2.5.1+cu121,
    triton-windows 3.1.0.post17 (windows-only).
  - settings.example.toml: clarify that compute_type, vad_filter, and
    whisper_batch_size are now ignored / mapped because openai-whisper
    has no equivalent to faster-whisper's batched pipeline.
  - .gitignore: add .claude/ for agent-tool local state.

Speed cost on a 3090: each file takes ~6-8 min instead of ~3 min, so
an 83-file overnight scan goes from ~4-5 hr to ~6-10 hr. Acceptable
for a stack that doesn't crash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Windows process crashes when the GPU model is unloaded

4 participants