fix: add torch.inference_mode() to parallel worker threads

sungsooha · claude · sungsooha · commit 4e164002dfaa · 2026-04-07T14:37:18.000-07:00
torch.inference_mode() is thread-local — ThreadPoolExecutor workers
don't inherit the parent's context. Without it, parallel workers run
with autograd enabled (extra memory, different semantics than
sequential path).

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
Signed-off-by: Sungsoo Ha &lt;sungsooh@nvidia.com&gt;
diff --git a/modelopt/torch/export/plugins/vllm_fakequant_hf.py b/modelopt/torch/export/plugins/vllm_fakequant_hf.py
@@ -114,7 +114,7 @@ def _process_weight(item: _WeightQuantWork) -> tuple[str, torch.Tensor, str | No
 
 def _process_device_batch(items: list[_WeightQuantWork], device: torch.device):
     """Process all weight items on a single GPU. Runs in a dedicated thread."""
-    with torch.cuda.device(device):
+    with torch.inference_mode(), torch.cuda.device(device):
         results = [_process_weight(item) for item in items]
         torch.cuda.synchronize(device)
     return results