Free curand states before the thread is destroyed by no1d · Pull Request #1912 · OpenNMT/CTranslate2

no1d · 2025-08-16T14:42:00Z

Tried #1201 with no luck, so this should fix SYSTRAN/faster-whisper#71

xSlikZodiac · 2025-09-06T19:13:43Z

I've had this issue too... running cuda 12.4 4080s

0xC0000409 (3221226505) = Windows’ stack buffer overrun.. ive literally tried to resolve on my own for the last year.

Purfview · 2025-11-24T21:57:08Z

@no1d Thanks for this.

@jordimas Check it out, this one is important for us.

a2d8a4v · 2026-01-05T08:01:35Z

Hi, @Purfview,
Can we find other people help review and merge this patch?
It seems that the author recently do not reply in this repo, maybe he's too busy right now.

…rash Two changes that go together: 1) Stack upgrade — removes the cuDNN 8 / cuDNN 9 dual-load - torch 2.5.1+cu121 -> 2.8.0+cu126 - ctranslate2 4.4.0 -> 4.7.1 (uses cuDNN 9 natively) - whisperx 3.4.5 -> 3.8.5 - nvidia-cudnn-cu12==8.9.7.29 -> removed (torch's bundled cuDNN 9 is now the only one in the process) - install-windows.ps1, pyproject.toml, requirements.txt updated. This alone did not fix the crash: even with the cleaner stack, python still died on the 2nd file with the same KERNELBASE 0xe06d7363 + ucrtbase 0xC0000409 signature. 2) Subprocess-per-file transcription — the bulletproof workaround for OpenNMT/CTranslate2#1912 / faster-whisper#71/#1293. ctranslate2's CUDA cleanup path corrupts the heap when WhisperModel is destroyed; the corruption gets touched fatally after 1-3 destruct/reconstruct cycles in one process. The fix recommended by the upstream issue threads is to run each transcription in its own process and let OS-level CUDA context teardown bypass the buggy cleanup path. New module src/hush_profanity/_transcribe_worker.py: - JSON-in, JSON-out contract (config in, words out, both via temp files) - exit codes: 0 success, 1 config/IO error, 2 transcribe error, >2 unknown - stderr captured by parent and forwarded to the main log scanner.gpu_worker now spawns this worker per file via subprocess.run with a 30 min timeout. If the subprocess crashes (which it shouldn't, but if ctranslate2 gets weird) the parent catches RuntimeError and marks just that file as failed, then continues with the next. Verified: 3 sequential subprocess transcriptions on CUDA all exit clean. The in-process version of the same test crashed on the 3rd run. Cost: ~5-10 s subprocess startup per file. Negligible vs the alternative (crash after 1-2 files). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-whisper Why: ctranslate2 has a long-standing CUDA cleanup crash on Windows (OpenNMT/CTranslate2#1912, faster-whisper#71/#1293) that we hit reliably across every version (4.4.0 → 4.7.1) and every workaround we tried: - int8 quantization (Test 1) — VRAM dropped 22GB → 11GB but crashes persisted, ruling out memory exhaustion as the cause - alignment off (Test 3) — removed PyTorch from the GPU entirely so only ctranslate2 was a CUDA library; still crashed, ruling out the dual-allocator theory - stack rollback to ct2 4.4.0 / cu121 / cuDNN 8 (Test 2) — exactly the version that did 49 files in a row originally; still crashed, so the bug is in ctranslate2 itself regardless of version - subprocess isolation — kept the parent alive when workers crashed but still lost ~30% of files per scan The cure was replacing the engine. openai-whisper is the reference PyTorch implementation. Slower (~3-4× per file) but rock-solid: same PyTorch CUDA stack as the wav2vec2 alignment in WhisperX, so only one CUDA allocator in the process. Verified with sequential subprocess test (alignment ON, real CUDA) — 3/3 clean exits where the in-process ctranslate2 version crashed every time on the 3rd run. Then 7/7 successful overnight scan on the 8 files that had previously failed. Other changes: - transcribe.py: full rewrite around openai-whisper API. Same Word dataclass output. Subprocess pattern preserved for belt-and-suspenders. - verbose=None instead of False to suppress the tqdm progress bar that was polluting the worker stderr → main log. - install-windows.ps1: drops nvidia-cudnn-cu12==8.9.7.29 (no longer needed — torch's bundled cuDNN is sufficient). Adds triton-windows so openai-whisper's word-timestamp DTW kernels run on GPU instead of falling back to a much slower pure-PyTorch path. - pyproject.toml + requirements.txt: pin openai-whisper, whisperx<3.5 (3.5+ pulls ctranslate2 back in transitively), torch 2.5.1+cu121, triton-windows 3.1.0.post17 (windows-only). - settings.example.toml: clarify that compute_type, vad_filter, and whisper_batch_size are now ignored / mapped because openai-whisper has no equivalent to faster-whisper's batched pipeline. - .gitignore: add .claude/ for agent-tool local state. Speed cost on a 3090: each file takes ~6-8 min instead of ~3 min, so an 83-file overnight scan goes from ~4-5 hr to ~6-10 hr. Acceptable for a stack that doesn't crash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

no1d mentioned this pull request Sep 6, 2025

Crash when using CUDA on Windows collabora/WhisperLive#317

Closed

Free curand states before the thread is destroyed

aaaf528

Purfview mentioned this pull request Mar 23, 2026

Fix ThreadPool shutdown deadlock on Windows with CUDA #2027

Open

6 tasks

no1d force-pushed the dispose-curand branch from 0968c24 to aaaf528 Compare April 9, 2026 09:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Free curand states before the thread is destroyed#1912

Free curand states before the thread is destroyed#1912
no1d wants to merge 1 commit intoOpenNMT:masterfrom
no1d:dispose-curand

no1d commented Aug 16, 2025

Uh oh!

xSlikZodiac commented Sep 6, 2025

Uh oh!

Purfview commented Nov 24, 2025 •

edited

Loading

Uh oh!

a2d8a4v commented Jan 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

no1d commented Aug 16, 2025

Uh oh!

xSlikZodiac commented Sep 6, 2025

Uh oh!

Purfview commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a2d8a4v commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Purfview commented Nov 24, 2025 •

edited

Loading

a2d8a4v commented Jan 5, 2026 •

edited

Loading