chore(deps): update dependency torch to v2.12.0 by dreadnode-renovate-bot[bot] · Pull Request #245 · dreadnode/dyana

dreadnode-renovate-bot · 2026-05-17T01:10:03Z

ℹ️ Note

This PR body was truncated due to platform limits.

This PR contains the following updates:

| Package | Change | Age | Confidence |
|

Generated Summary:

Updated PyTorch version from 2.11.0 to 2.12.0 across multiple requirements files.
Files modified include:
- dyana/loaders/automodel/requirements.txt
- dyana/loaders/base/dyana-requirements-gpu.txt
- dyana/loaders/lora/requirements.txt
This update can enhance performance and compatibility with newer features or bug fixes introduced in PyTorch 2.12.0.

This summary was generated with ❤️ by rigging

| torch | ==2.11.0 → ==2.12.0 | | |

Warning

Some dependencies could not be looked up. Check the Dependency Dashboard for more information.

Release Notes

pytorch/pytorch (torch)

`v2.12.0`: PyTorch 2.12.0 Release

Compare Source

Highlights

Batched linalg.eigh on CUDA is up to 100x faster due to updated cuSolver backend selection.
New torch.accelerator.Graph API unifies graph capture and replay across CUDA, XPU, and out-of-tree backends.
torch.export.save now supports Microscaling (MX) quantization formats, enabling full export of aggressively compressed models.
Adagrad now supports `fused=True`, joining Adam, AdamW, and SGD with a single-kernel optimizer implementation.
torch.cond control flow can now be captured and replayed inside CUDA Graphs.
ROCm users gain expandable memory segments, rocSHMEM symmetric memory collectives, and FlexAttention pipelining.

For more details about these highlighted features, you can look at the release blogpost. Below are the full release notes for this release.

Backwards Incompatible Changes

Build Frontend

Strengthened SVE compile checks in FindARM.cmake, which may reject previously accepted but incorrect SVE configurations (#176646)

Source builds that enable SVE now validate the compiler configuration more strictly. If a build previously passed with an incomplete or mismatched SVE setup, it may now fail during CMake configuration instead of later in compilation. Update the compiler/toolchain flags so they accurately describe the target SVE support, or disable SVE for that build.
Updated the minimum CUDA version required to build PyTorch from source to CUDA 12.6 (#178925)

Building PyTorch from source with CUDA versions older than 12.6 is no longer supported. Users building custom binaries should install CUDA 12.6 or newer and make sure CUDA_HOME points to that installation.

Version 2.11:
```
CUDA_HOME=/usr/local/cuda-12.4 python setup.py develop
```
Version 2.12:
```
CUDA_HOME=/usr/local/cuda-12.6 python setup.py develop
```
Enforced a C++20 minimum in CMake build files (#178662)

Source builds now require a compiler and build configuration that support C++20. If you maintain custom build scripts or downstream extensions that build PyTorch from source, update the compiler and remove assumptions that PyTorch can be built as C++17.

Distributed

torch.distributed.nn.functional ops now raise RuntimeError under torch.compile (#177342)

All ops in torch.distributed.nn.functional (e.g., broadcast, all_reduce, all_gather, reduce_scatter, all_to_all_single) now raise RuntimeError when called inside torch.compile. Users should migrate to the functional collectives API in torch.distributed._functional_collectives.

Version 2.11:
```
@&#8203;torch.compile
def my_func(x):
    return torch.distributed.nn.functional.all_reduce(x, op=ReduceOp.SUM)
```
Version 2.12:
```
@&#8203;torch.compile
def my_func(x):
    return torch.distributed._functional_collectives.all_reduce(x, reduceOp="sum", group=group)
```

TorchElastic

torchrun now defaults to an OS-assigned free port for single-node training instead of port 29500 (#175699)

When running torchrun --nproc-per-node=N script.py without specifying --master-port or --standalone, the default behavior now automatically uses an OS-assigned free port via the c10d rendezvous backend. This eliminates "Address already in use" errors when running multiple training jobs concurrently. Multi-node training, explicit --master-port, PET_MASTER_PORT env var, and --standalone are unchanged.

Version 2.11:
```
# Used static rendezvous on port 29500 by default
torchrun --nproc-per-node=4 train.py
```
Version 2.12:
```
# Uses OS-assigned free port by default
torchrun --nproc-per-node=4 train.py

# To explicitly use a fixed port:
torchrun --nproc-per-node=4 --master-port=29500 train.py
```

MPS

All MPS tensors are now allocated in unified memory (#175818)

Previously, MPS tensors could be allocated in either device-only or unified memory. Now all MPS tensors use unified memory unconditionally. This simplifies memory management and enables CPU access to MPS tensor data without explicit copies. Code that relied on device-only memory placement may observe different performance characteristics.

Inductor

The max_autotune layout-constraint deferral introduced in 2.11 is now opt-in (#175330)

In 2.11, Inductor deferred layout freezing for max_autotune templates to expose more fusion opportunities. This caused a regional-inductor failure mode, so the default in 2.12 reverts to immediate layout freezing. Users who relied on the deferred behavior for fusion opportunities should opt in explicitly via torch._inductor.config.max_autotune_defer_layout_freezing or TORCHINDUCTOR_MAX_AUTOTUNE_DEFER_LAYOUT_FREEZING=1.

Version 2.11:
```
# Deferred layout freezing was the default
torch.compile(model, mode="max-autotune")
```
Version 2.12:
```
import torch._inductor.config as cfg
cfg.max_autotune_defer_layout_freezing = True
# or set TORCHINDUCTOR_MAX_AUTOTUNE_DEFER_LAYOUT_FREEZING=1
torch.compile(model, mode="max-autotune")
```

Deprecations

Release Engineering

Deprecate CUDA 12.8 builds in favor of CUDA 13.0 (#179072)

CUDA 12.8 binaries have been removed from the PyTorch binary build matrix. CUDA 13.0 is now the stable default and CUDA 12.6 remains available for users on older drivers. Users explicitly pinning the cu128 index URL will need to switch to cu130 (recommended) or cu126.

Version 2.11:
```
pip install torch --index-url https://download.pytorch.org/whl/cu128
```
Version 2.12:
```
# Use CUDA 13.0 (default on PyPI):
pip install torch
# Or explicitly:
pip install torch --index-url https://download.pytorch.org/whl/cu130
# Older driver fallback:
pip install torch --index-url https://download.pytorch.org/whl/cu126
```
Compatibility with CMake < 3.10 will be removed in a future release (#166259)

Source builds against CMake versions older than 3.10 now emit a deprecation warning. A future release will require CMake 3.10 or newer; please upgrade CMake before then.

Linear Algebra

Several CUDA linear algebra operators no longer use the MAGMA backend and now dispatch to cuSolver or cuBLAS unconditionally:
- torch.linalg.eigh now dispatches to cuSolver (#174619)
- torch.linalg.lu_solve now dispatches to cuSolver/cuBLAS (#174248)
- torch.linalg.cholesky_inverse now dispatches to cuSolver (#174681)
- torch.linalg.cholesky_solve now dispatches to cuSolver (#174769)
User code calling these APIs does not need to change. The practical impact is for users who depended on MAGMA-specific numerical behavior, performance characteristics, or debugging. Those calls now use the cuSolver/cuBLAS implementations on CUDA.

FullyShardedDataParallel2 (FSDP2)

Compiling through FSDP2 hooks without graph breaks is no longer supported (#174863, #174906). If you use compiled autograd with FSDP2, update your code to allow graph breaks around FSDP2 hooks or disable compiled autograd for the FSDP2 training step.

Version 2.11:

with torch._dynamo.config.patch(compiled_autograd=True):
    compiled_model = torch.compile(fsdp_model, fullgraph=True)
    loss = compiled_model(input).sum()
    loss.backward()

Version 2.12:

# Either run FSDP2 backward without fullgraph.
compiled_model = torch.compile(fsdp_model, fullgraph=False)
loss = compiled_model(input).sum()
loss.backward()

# Or apply compile before applying FSDP.
compiled_model_pre_fsdp = torch.compile(model, fullgraph=True)
compiled_model = fully_shard(compiled_model_pre_fsdp, ...)
loss = compiled_model(input).sum()
loss.backward()

Profiler

Profiler's metadata_json field is now deprecated; use event_metadata instead (#179417)

Version 2.11:
```
metadata = event.metadata_json
```
Version 2.12:
```
metadata = event.event_metadata
```

Dynamo

torch.compile(fullgraph=True) now warns when a call runs no compiled code; will error in 2.13 (#181940)

Previously fullgraph=True was only validated once Dynamo actually compiled and ran the function. If Dynamo was bypassed at call time (e.g. under a user-defined TorchDispatchMode), the annotation silently had no effect. 2.12 emits a warning; 2.13 will raise. For graph-break errors without fullgraph's stronger guarantees, use torch._dynamo.error_on_graph_break.

Version 2.12:

from torch.utils._python_dispatch import TorchDispatchMode

class LoggingMode(TorchDispatchMode):
    def __torch_dispatch__(self, func, types, args=(), kwargs=None):
        return func(*args, **(kwargs or {}))

@&#8203;torch.compile(fullgraph=True)
def model(x):
    return x.sin() + 1

# A user-defined TorchDispatchMode is active, so Dynamo skips the frame
# and no compiled code runs — emits a warning in 2.12, will raise in 2.13.
with LoggingMode(): # Remove this to fix warning
    model(torch.randn(3, 4))

The inline_inbuilt_nn_modules Dynamo config is deprecated (#177489, #178205)

Inlining of in-built nn.Module instances is now the default; setting the flag emits a deprecation warning and it will be removed in a future release.

Version 2.11:
```
import torch._dynamo.config as cfg
cfg.inline_inbuilt_nn_modules = True  # was a tunable knob
```
Version 2.12:
```
# No action needed — inlining is on by default.
# Remove any explicit references to torch._dynamo.config.inline_inbuilt_nn_modules.
```
Added a deprecation framework to the torch.compile config module so individual options can be marked deprecated (#169837)

New Features

Release Engineering

Add Claude-powered autorevert AI advisor workflow (#177404, #178810)
Add torchtitan tests to PyTorch CI (#175901, #176774, #179749, #177572)
Add Pallas TPU CI configuration (#173870, #174532, #175650, #175590)
Add downloadable profiler traces and TLParse output from CI runs (#178488)
Enable full AArch64 unit testing for pull requests, with periodic m7g coverage maintained on trunk (#178270)

Python Frontend

Introduced torch.accelerator.Graph as a unified frontend Graph interface (#171285)

Foreach

Add _foreach_clone operator, with a fast path for CUDA utilizing _foreach_copy_ (#177421)

Distributed

Add Store::barrier API and TCPStore client BARRIER support, reducing synchronization round trips compared to the existing ADD+WAIT pattern (#174920)
Add NCCL communicator suspend(), resume(), and memory_stats() APIs for managing communicator memory lifecycle (#176300)
Add all_to_all support in the Gloo backend (#165435)
Add reduce_scatter_offset to symmetric memory, supporting variable-sized block reductions with NVLink multicast or LSA fallback (#177791)
Enable batch_isend_irecv to work under torch.compile (#161213)
Add torch.distributed.symmetric_memory.is_symm_mem_tensor() API to check if a tensor is a symmetric memory tensor (#178947)
Convert NanCheck to a standalone op (torch.ops.c10d.check_for_nan) usable outside of ProcessGroupNCCL (#174990)

DTensor

Add support for twice-differentiable DTensor redistribution (#160509)
DeviceMesh is now traceable by torch.compile. Make DeviceMesh opaque (#176661), Make placements opaque (#171482).
Add grad_placements parameter to DTensor.from_local(), allowing explicit control over gradient placements in the backward pass (#175867)

FullyShardedDataParallel2 (FSDP2)

Support per-parameter meshes in FSDP2, enabling different parameter groups to shard over different meshes (#173509)
Support fully_shard with DTensors on a full SPMD mesh via DataParallelMeshDims (#176334)
Add FSDP2 support for non-floating-point parameters by excluding non-float parameters from reduce-scatter while still sharding and all-gathering them as needed (#177948)

TorchElastic

Add configurable --shutdown-timeout to torchrun for controlling the SIGTERM-to-SIGKILL timeout during worker shutdown (#172596)

CPU x86

Expose a CPUBlas brgemm API for fp8 (e4m3 & e5m2) GEMM, backed by oneDNN (#172548)

CUDA

Added support for torch.cond with CUDA graphs, using conditional graph nodes (CUDA 12.4+) so data-dependent control flow can be captured entirely inside a single CUDA graph. Works with the eager and cudagraphs torch.compile backends (no Inductor support yet). (#168912)

MPS

Implemented linalg_qr for MPS (#172536)
Added cholesky_solve support on MPS (#176703)
Added index_reduce on MPS (#174936)
Implemented torch.distributions.Gamma (forward + backward) on MPS (#179228)
Enabled mvlgamma on MPS (#178914)
Added nonzero_static implementation on MPS (#179589) (from miscategorized)

ROCm

Enable expandable segments (#173330, #177974, #179930, #179781)
hipSPARSELt
- Enable for ROCm >= 7.12 (#170852, #178285)
- Enable FP8 semi-structured sparsity (#179310)
amdgcnspirv is now a supported offload target, not enabled by default (#172559)

XPU

Support torch.accelerator.Graph on XPU (#176421)
Added memory_clock_rate and memory_bus_width to XPU device properties (#171967)
Enable split_group API when TorchComms is used as a backend for TorchTitan on XPU (#178236)

Profiler

Profiler's Activity selection allows for fine-grained activity type selection. (#176351)
Memory visualize has a new tab to show private pool memory view (#177289)

Dynamo

Made torch._dynamo.aot_compile public, with aot_eager and inductor backend support and docs (#179917, #180008)
Added a recompile_limit keyword argument to torch.compile to override the per-function recompile cap without touching global config (#177936)
Added min/max bounds to torch._dynamo.mark_unbacked for communicating value ranges to the symbolic shape system (#176313)
Added bdb, a pdb-style debugger for stepping through nested frames during Dynamo tracing (n, u, d, r, bt), plus a user-callable breakpoint() that auto-starts it (#174626, #174746, #175200)

Inductor

Added user-defined stream support to torch.compile. Inductor now codegens stream context managers (enter/exit) and record_stream calls in the wrapper, enabling user streams to flow through compiled regions with proper synchronization, scheduler integration, and cross-stream dependency tracking (#165390, #165391, #165504, #165505, #174223, #176700, #177694)
Added ao::offload, ao::reload, and ao::wait ops for asynchronous activation offloading. These ops encapsulate async CPU offloading stream management following the same async 2-op pattern as c10d functional collectives, reducing IR size from 7 nodes (offload) and 5 nodes (reload) down to 2 nodes each (#177621)
Added user-defined Triton kernel unary epilogue fusion. Inductor can now fuse user Triton kernels with downstream pointwise epilogues (e.g. relu()), parsing the user kernel source via AST and inlining the epilogue into the tl.store expression (#173662)
Added out-variant discovery and lowering for custom ops. When a custom op registers both functional and .out overloads, Inductor automatically lowers single-output and multi-output functional ops to their .out variants as ExternKernelOut, enabling memory planner buffer reuse (#175116, #176117)
max_autotune now extends to combo kernels. The autotuning pipeline generates and benchmarks per-sub-kernel block-size phase configs, with chained sequential autotuning and per-sub-kernel reduction hints (#177715, #178936, #179317)
Added non-TMA persistent Triton templates for mm and addmm for max-autotune, enabling persistent kernels on hardware without TMA (#177781, #179095)
Added CUTLASS backend support for torch.float8_e5m2 dtype, including registration for FP8 GEMM autotuning (#171176)
Added XPU CUTLASS GEMM kernel codegen and codecache to max-autotune-gemm, allowing CUTLASS-style GEMM templates to target Intel GPUs (#161938, #161939)
Added MTIA Triton codegen for sort, median, and mode operations (#178525)
Added a Triton template for depthwise conv1d (#175280)
Added AVX10.2 fp32↔fp8 intrinsics in at::vec::convert for the Inductor C++ x86 backend (#172309)
Pallas backend: added scalar prefetch and indirect access support (#177212)
Added a disable_welford_reduction config flag to opt out of Welford reduction in codegen (#175778)

Ahead-Of-Time Inductor (AOTI)

Add MXFP4 dtype support (float8_e8m0fnu and float4_e2m1fn_x2) to the AOTInductor C shim layer, enabling MXFP4 quantization (e.g., for AMD MI350) (#176496)
Add a compile backend registry and custom device support for AOTI Eager, letting out-of-tree device backends plug into the AOTI eager compile/load flow without modifying upstream code (#175605)

torch.fx

Add tuple_return option to split_module that wraps submodule outputs in a tuple (#179007)
Add ignore_raw_node option to GraphPickler (#176939)
Add _merge_overlapping_fusions() method to FxNetSplitter which detects and merges overlapping fusion groups (#177099)

torch.export

Add serialization support for float8_e8m0fnu dtype (#176270)
Add serialization support for torch.uint32 and torch.uint64 dtypes (#179434)
Add serialization support for nested float lists (List[List[float]]) (#178081)

JIT

Added input-independent graph optimization API (#179393)

Improvements

Release Engineering

Add support for CUDA 13.2 in CI/CD including binary builds, magma builds, and Windows CD workflows; update binaries to CUDA 13.2.1 (#177083, #177197, #177918, #178660, #180288, #177975, #177567, #180293)
Upgrade Triton to 3.7 (#174896, #178821, #179586, #179971, #177364, #177723)
Upgrade NCCL to 2.29.7 (#176299)
Upgrade cusparseLt to 0.8.1 for CUDA 12.9 / 13.0 builds (#177456)
Migrate clang15 CI jobs to clang18 and consolidate ASAN/ONNX images (#178801, #178803, #178928)
Bump MACOSX_DEPLOYMENT_TARGET to 14.0 (#179083)
Bump numpy pin to 2.3.4 for Python 3.14 builds (#179720)
Add macOS wheel platform tag vs dylib minos validation (#177609, #177993)
Enable Metal-4 shaders offline compilation (#179378)
Migrate lint and other workflows from EC2 to k8s ARC runners (OSDC) (#177431, #177899, #177950, #177953, #177954, #178585, #178973, #179058)
Add XPU client docker image and CI tests (#174188, #177831, #178380, #178383, #178143, #179786)
Merge majority of libtorch builds into wheel CD builds (#174753, #177802)
Enable R2/S3 dual upload for torch nightly packages (#175352, #175570)
Make CUDA 13.0 cross-compilation work (#181287)

Python Frontend

Used compiler wrapper when building C++ extensions (#175696)
Updated uniform and normal sampling on CPU to improve fp16/bf16 results (#175988)
Changed requires_grad to Optional[bool] in torch.asarray (#170897)

Autograd

Implemented narrow_copy derivative (#175609)
Implemented higher-order derivatives for grid_sample (#177487)
Implemented backward and forward AD for torch.aminmax (#175215)
Exposed num_splits in varlen attention to allow disabling split_kv (#176905)
Added user AutoNamingMode support in Selective Activation Checkpointing (#175348)
Refactored torch.utils.checkpoint to no longer use autograd.Function for saving inputs (#174327)

Dataloader

Added a thread-safe RNG utility function (#175375)

Linear Algebra

Added _int_mm unsigned int8 × signed int8 (u8s8) support on CPU (#168226)
Added FP64 support for TunableOp on ROCm via hipBLASLt (#178195)

Nested Tensor (NJT)

Added nested tensor deserialization support (#174843)

torch.nn

Added bias argument to nn normalization methods (LayerNorm, GroupNorm, RMSNorm, etc.) (#176573)
Improved MultiMarginLoss error message for inconsistent target size (#174072)
Added enable_gqa flag to varlen_attn (#179468)
Allowed eps=0 in batch_norm during eval mode (#175508)
Added meta device support in trunc_normal_ initialization (#176240)
Split onehot checks for CPU and accelerators (#181211)

Sparse

Implemented clone operator for semi-structured sparse tensors (#174991)
Allowed semi-structured sparse tensors to be instantiated with alg_id (#178659)
Enabled FP8 semi-structured sparsity on ROCm via hipSPARSELt (#179310)

Build Frontend

Simplified SVE256 detection (#176247)
Removed ARMv7 checks (#176645)

C++ Frontend

Upgraded cpp_extension and cpp_builder to C++20 (#176659)
Reland at::Tag header-only changes and add a library.def override for tags (#181608)

Distributed

Add configurable worker timeout and partial data support to the distributed debug server (#176058)
Add timeout parameter to torch.distributed.barrier() (#174974)
Add reduce_scatter_tensor_coalesced support to ProcessGroupWrapper (#168961)
Functional collectives API now automatically handles non-contiguous inputs instead of asserting (#177965)
DDP: Add batched_grad_copy option to reduce per-parameter kernel launches to 2 kernels per bucket (#176638)
DDP: Refactor bucket capacity config into BucketCapacityConfig dataclass (#175217)
Add signal name to ChildFailedError exitcode output for better debugging (#175254)
Add CUDA-aware detection for Cray MPICH (#178323)
Support dist.broadcast for FP8 tensors on GPUs older than SM90 (#175884)
Add __torch_function__ handlers for distributed functions (#176376)
Enable split_group API for TorchComms on XPU (#178236)
Make py-spy dumps nonblocking by default (#178312)
Add ncclx and gloo to FlightRecorder trace analyzer backend allowlist (#180268)
Improve error message on symmetric memory handle exchange (#178989)
SymmMem: Add thread safety to NCCL and NVSHMEM backends (#176551)
Check NCCL terminate signal more frequently when exiting from heartbeat monitor (#170000)
Implement missing methods in ProcessGroupWrapper (#178779)
Add compute_estimator option for overlap scheduling (#175204)
[local_tensor] Add standalone rank_map/tensor_map functions (#174795)

Distributed Checkpoint (DCP)

DCP: Improve save plan validation error messaging (#176728)
DCP: Preserve original exception in metadata read failure for better debuggability (#177739)

DTensor

Support DTensor view ops (flatten/unflatten) with _StridedSharding for full nn.Linear(DTensor) compatibility (#166483)
Add Dijkstra-based single-dim strategy search for DTensor sharding propagation, avoiding exponential enumeration of strategy combinations (#169438)
DTensor: Add is_pinned() support (#177235)
DTensor: Add print() HOP support (#175222)
DTensor: Emit zero paddings for uneven shardings to enable SPMD compilation (#177758)
DTensor: Make run_dtensor_rng_op compatible with compile_on_one_rank (#177447)
DTensor: Lenient handling of view redistributes in decomposition flow (#175194)
DTensor: Redistribute from/to _StridedShard through Replicate (#179059)
DTensor: Raise clearer error for unsupported Split(Flatten) sharding propagation (#179632)
DTensor: Unbacked-safe view_groups (#174629)
DTensor: Expanded sharding strategy coverage for index_select, index, index_fill, index_reduce, roll, fft, constant_pad_nd, squeeze.dims, interpolate, linalg ops, LayerNorm/RMSNorm FW/BW, foreach/fused ops, and einsum linearity (#176037, #176038, #178456, #175463, #175656, #173563, #176991, #176955, #179173, #177186, #177187, #176150, #174830)

FullyShardedDataParallel (FSDP)

Remove mixed-dtype rejection from FSDP clip_grad_norm to match the documented behavior (#173641)

FullyShardedDataParallel2 (FSDP2)

Allow ModuleList/ModuleDict subclasses that implement forward() (#175033)
FSDP2: Support dataclass args/kwargs output without memory leakage (#174692)
Share more implementation code between replicate and FSDP2 fully_shard (#173580)
Consolidate FSDP2 shard_mesh and shard_mesh_from_root handling (#174107)

Distributed Pipeline

DTensor metadata foundation for Pipeline Parallelism with DTensor-aware stage and schedule refactoring (#177727, #177728)
Pipeline Parallel: Dispatch homogeneous P2P ops individually to avoid stream serialization (#175712)

TorchElastic

[elastic] Add Windows support for stdout/stderr redirects (#176789)
torchelastic: Keep health check alive during exit barrier (#178197)

CUDA

[CUDA] Fix offset_t operators to be __host__ __device__ in SortStable.cu (#175997)
[CUDA] [Green Context] Add support for workqueue limit (#177242)
Remove dead avg_pool3d backward shape-check variables in CUDA (#178893)
[BE] add missing assert on cuda device synchronize in ATen tests (#174966)
[reland 2][pytorch] Preemptive OOM rejection using per_process_memory_fraction + throw_on_cudamalloc_oom (#179473) (#179473)
[cuda graphs] Add enable_annotations kwarg to torch.cuda.graph` (#179867)
[CUDA/ROCm] avoid double casting in ReduceLogicKernel (#176132)
Nit fix: Align state_step tensor max to param tensor max (#178913)

cuDNN

Upgrade 12.8, 13.0 (and 12.9) wheels to cuDNN 9.20.0.48 (#177321)

MPS

Fixed abs complex overflow/underflow on MPS (#174346)
Migrated index_fill_ to native Metal (#175822)
Extended histogram to float/bfloat types on MPS (#176913)
Extended unfold_backward to torch.complex64 on MPS (#177274)
Added complex input support to scatter, gather, repeat, cumsum, logcumsumexp, cumprod, and nn.functional.linear on MPS (#177794, #178198, #178328, #178411, #178436, #178799)
Migrated lerp, eye, relu, silu, fill_, xlogy, norm to native Metal kernels (#177093, #178683, #178866, #179071, #176101, #177749, #177328)
Registered DeviceCapability for MPS backend (#178180)
Switched exponential distribution to native Metal (#174277)
Add enable_gqa parameter to SDPA MPS meta registration (#181550)

ROCm

CPP extensions only compile for user's detected arch (#168998)
Remove obsolete HIP NaN handling workarounds; remove technical debt (#171104)

XPU

Support half precision FFT on XPU backend (#171231)
Add proper float64 handling for addmv, addmm, and baddbmm on XPU (#174590)
Enable FMA-based addcdiv lowering for XPU (#176163)
Enable bmm_outer_product Triton override for XPU (#180816)
Use version check for XPU fallback registration in Inductor (#174679)
Catch Intel Triton compilation/runtime errors as IntelGPUError in Inductor (#169167)
Improve Inductor UT coverage for XPU (#174053, #174054, #174055, #174056, #174057, #174058)
Added Uint16/Uint32/Uint64/FP8 support to XPU device capability reporting (#178467)

Profiler

Profiler's events() method now has parity with information returned in export_chrome_trace(). (#177662, #177888, #178168, #178597, #178901, #179714)
Restructure Memviz and add tests (#179488)

Dynamo

Broader Python tracing: enum.Enum iteration, nn.Module.__getattribute__, _enter_autocast/_exit_autocast and other context managers, next() on itertools.count, itertools.takewhile, bool(OrderedDict), NamedTuple.__eq__(tuple), numpy ndarray.flat, and locals()/vars() (#175176, #175527, #173877, #176521, #178818, #177876, #175394, #176729, #175787, #179595)
CPython nb_index/nb_bool/nb_float slots so Dynamo can trace operator.index(tensor), bool(...), and float(...); graph-break on torch.Generator methods (#178921, #178931, #179114, #180198, #178519)
Higher-order ops & subgraphs: cond supports aliases and mutations under no_grad, autogradable leaf modules support pytree outputs, nonstrict_trace accepts nn.Module inputs, and invoke_subgraph supports subgraph reuse (#172836, #172152, #175010, #172372, #176644)
Streams & Triton: current-stream handling via torch.cuda.stream, sync barriers via a dependency HOP, triton.set_allocator inside torch.compile, and reuse of tracked objects for Triton prune_configs_by (#177610, #168894, #177470, #177874)

Inductor

Unified OUT_DTYPE, ACC_TYPE, and INDEX_DTYPE codegen flow in Triton templates (#179453)
Enabled cudagraph w/o partition for cpp-wrapper (#179249)
Added FMA-based addcdiv lowering for CUDA parity with eager and matching _foreach_addcdiv to _foreach_addcmul (#174912, #175309, #175310, #175839, #176237)
Added lerp decompositions for bitwise parity with eager (#176804)
Added outer-product decomposition (#176552)
Enabled padding fusion with torch.cat and avoided duplicate computation in cat/pad when inputs have multiple consumers (#175729)
Lowered functional symmetric memory ops to ExternKernelOut for output buffer reuse, and added symm_mem planning for graph inputs and fallback regions (#174856, #175449)
Modified addmm template call to support hipblaslt bias-fused kernels on ROCm (#177130)
Newly trained PadMM AutoHeuristics for A100 and H200, plus support for pad_mm AutoHeuristics in deterministic mode (#176186, #179826)
Propagate metadata in pattern matcher and add validation (#179113)
FlexAttention: raise a clear NotImplementedError when return_aux=AuxRequest(max_scores=True) is requested with BACKEND='FLASH' instead of failing later with an opaque error (#177434)
Migrated Inductor internals from legacy allow_tf32 to fp32_precision to avoid divergence with the new TF32 API (#176098)
Pallas backend: enabled element-wise ops, native TPU OOB DMA masking via aligned block specs, and generalized N-D transpose permutation detection (#174743, #175458, #176952)
Registered lowerings for prims.scalar_tensor and aten.arange.start_step (#179017, #179028)
Added SDPA pattern matching support for visformer (#177826)
Relaxed concat-linear fusion to support GQA QKV (#178523)
Allowed subgraphs to be benchmarked with async pipelined autotuning (#175455)
Added convert_element_type lowering to emulate PyTorch eager numerics (#176781)
Added GEMM configs to XPU autotuning heuristic (#177647)
Added kpack Triton compile options on ROCm (#173179)
ROCm: enabled exhaustive autotuning for FP8 (#177797)
Override decomposition for aten.index_add (#179486)
Drop tile_k from nvMatmulHeuristics matching (#176845)

Ahead-Of-Time Inductor (AOTI)

Add aten._grouped_mm to AOTInductor fallback ops, enabling cpp-wrapper mode for grouped_mm (#177307)
Support lazy Triton kernel compilation for cpp-wrapper on XPU (#179239)
Add dynamic shapes support to AOTI Eager via AOTIPythonKernelHolder, allowing a single compiled kernel to serve multiple input shapes (#176018)
Support multi-return ops in AOTI Eager (e.g., native_layer_norm, aminmax) (#176019)
Allow custom ops with Optional[List[T]] arguments in cpp wrapper (#174460)
Add lazy Triton kernel compilation for cpp-wrapper (#175416)
Add TMA support for lazy Triton kernel compilation (#175548)
Call latest c_shim version for versioned fallback ops (#181548)
Add BC-safe c_shim v2 for _scaled_dot_product_attention_math_for_mps enable_gqa (#181549)

torch.fx

Update get_source_partitioner to parse nn_module_stack metadata for improved source-based graph partitioning (#175788)
split_module now uses _make_graph_module to support lazy recompile (#177907)
Fix fuser_utils.topo_sort to produce a stable ordering (#175378)
Fix GraphPickler to support nodes with slice() arguments (#175996)

Composability

Added DynamicInt __pow__ and __rpow__ methods (#179868)
Added scaled_mm_v2 CPU implementation (#176266)

Bug fixes

Release Engineering

Fix periodic inductor CI silently skipping all tests (#177695)
Fix python docs build hanging in CI (#180177)
Avoid installing test dll into

✂ Note

PR body was truncated to here.

Configuration

📅 Schedule: (UTC)

Branch creation
- At any time (no schedule defined)
Automerge
- At any time (no schedule defined)

🚦 Automerge: Enabled.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR has been generated by Mend Renovate.

| datasource | package | from | to | | ---------- | ------- | ------ | ------ | | pypi | torch | 2.11.0 | 2.12.0 |

chore(deps): update dependency torch to v2.12.0

5c06864

| datasource | package | from | to | | ---------- | ------- | ------ | ------ | | pypi | torch | 2.11.0 | 2.12.0 |

dreadnode-renovate-bot Bot added the type/digest Dependency digest updates label May 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(deps): update dependency torch to v2.12.0#245

chore(deps): update dependency torch to v2.12.0#245
dreadnode-renovate-bot[bot] wants to merge 1 commit into
mainfrom
renovate/loader-deps

dreadnode-renovate-bot Bot commented May 17, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

dreadnode-renovate-bot Bot commented May 17, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

| Package | Change | Age | Confidence | |

Generated Summary:

Release Notes

v2.12.0: PyTorch 2.12.0 Release

PyTorch 2.12.0 Release Notes

Highlights

Backwards Incompatible Changes

Build Frontend

Distributed

TorchElastic

MPS

Inductor

Deprecations

Release Engineering

Linear Algebra

FullyShardedDataParallel2 (FSDP2)

Profiler

Dynamo

New Features

Release Engineering

Python Frontend

Foreach

Distributed

DTensor

FullyShardedDataParallel2 (FSDP2)

TorchElastic

CPU x86

CUDA

MPS

ROCm

XPU

Profiler

Dynamo

Inductor

Ahead-Of-Time Inductor (AOTI)

torch.fx

torch.export

JIT

Improvements

Release Engineering

Python Frontend

Autograd

Dataloader

Linear Algebra

Nested Tensor (NJT)

torch.nn

Sparse

Build Frontend

C++ Frontend

Distributed

Distributed Checkpoint (DCP)

DTensor

FullyShardedDataParallel (FSDP)

FullyShardedDataParallel2 (FSDP2)

Distributed Pipeline

TorchElastic

CUDA

cuDNN

MPS

ROCm

XPU

Profiler

Dynamo

Inductor

Ahead-Of-Time Inductor (AOTI)

torch.fx

Composability

Bug fixes

Release Engineering

Configuration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

dreadnode-renovate-bot Bot commented May 17, 2026 •

edited by github-actions Bot

Loading

| Package | Change | Age | Confidence |
|

`v2.12.0`: PyTorch 2.12.0 Release