Skip to content

chore(deps): update dependency torch to v2.12.0#245

Open
dreadnode-renovate-bot[bot] wants to merge 1 commit into
mainfrom
renovate/loader-deps
Open

chore(deps): update dependency torch to v2.12.0#245
dreadnode-renovate-bot[bot] wants to merge 1 commit into
mainfrom
renovate/loader-deps

Conversation

@dreadnode-renovate-bot
Copy link
Copy Markdown
Contributor

@dreadnode-renovate-bot dreadnode-renovate-bot Bot commented May 17, 2026

ℹ️ Note

This PR body was truncated due to platform limits.

This PR contains the following updates:

| Package | Change | Age | Confidence |
|

Generated Summary:

  • Updated PyTorch version from 2.11.0 to 2.12.0 across multiple requirements files.
  • Files modified include:
    • dyana/loaders/automodel/requirements.txt
    • dyana/loaders/base/dyana-requirements-gpu.txt
    • dyana/loaders/lora/requirements.txt
  • This update can enhance performance and compatibility with newer features or bug fixes introduced in PyTorch 2.12.0.

This summary was generated with ❤️ by rigging

| torch | ==2.11.0==2.12.0 | age | confidence |


Warning

Some dependencies could not be looked up. Check the Dependency Dashboard for more information.


Release Notes

pytorch/pytorch (torch)

v2.12.0: PyTorch 2.12.0 Release

Compare Source

PyTorch 2.12.0 Release Notes
Highlights
Batched linalg.eigh on CUDA is up to 100x faster due to updated cuSolver backend selection.
New torch.accelerator.Graph API unifies graph capture and replay across CUDA, XPU, and out-of-tree backends.
torch.export.save now supports Microscaling (MX) quantization formats, enabling full export of aggressively compressed models.
Adagrad now supports fused=True, joining Adam, AdamW, and SGD with a single-kernel optimizer implementation.
torch.cond control flow can now be captured and replayed inside CUDA Graphs.
ROCm users gain expandable memory segments, rocSHMEM symmetric memory collectives, and FlexAttention pipelining.

For more details about these highlighted features, you can look at the release blogpost. Below are the full release notes for this release.

Backwards Incompatible Changes
Build Frontend
  • Strengthened SVE compile checks in FindARM.cmake, which may reject previously accepted but incorrect SVE configurations (#​176646)

    Source builds that enable SVE now validate the compiler configuration more strictly. If a build previously passed with an incomplete or mismatched SVE setup, it may now fail during CMake configuration instead of later in compilation. Update the compiler/toolchain flags so they accurately describe the target SVE support, or disable SVE for that build.

  • Updated the minimum CUDA version required to build PyTorch from source to CUDA 12.6 (#​178925)

    Building PyTorch from source with CUDA versions older than 12.6 is no longer supported. Users building custom binaries should install CUDA 12.6 or newer and make sure CUDA_HOME points to that installation.

    Version 2.11:

    CUDA_HOME=/usr/local/cuda-12.4 python setup.py develop

    Version 2.12:

    CUDA_HOME=/usr/local/cuda-12.6 python setup.py develop
  • Enforced a C++20 minimum in CMake build files (#​178662)

    Source builds now require a compiler and build configuration that support C++20. If you maintain custom build scripts or downstream extensions that build PyTorch from source, update the compiler and remove assumptions that PyTorch can be built as C++17.

Distributed
  • torch.distributed.nn.functional ops now raise RuntimeError under torch.compile (#​177342)

    All ops in torch.distributed.nn.functional (e.g., broadcast, all_reduce, all_gather, reduce_scatter, all_to_all_single) now raise RuntimeError when called inside torch.compile. Users should migrate to the functional collectives API in torch.distributed._functional_collectives.

    Version 2.11:

    @​torch.compile
    def my_func(x):
        return torch.distributed.nn.functional.all_reduce(x, op=ReduceOp.SUM)

    Version 2.12:

    @​torch.compile
    def my_func(x):
        return torch.distributed._functional_collectives.all_reduce(x, reduceOp="sum", group=group)
TorchElastic
  • torchrun now defaults to an OS-assigned free port for single-node training instead of port 29500 (#​175699)

    When running torchrun --nproc-per-node=N script.py without specifying --master-port or --standalone, the default behavior now automatically uses an OS-assigned free port via the c10d rendezvous backend. This eliminates "Address already in use" errors when running multiple training jobs concurrently. Multi-node training, explicit --master-port, PET_MASTER_PORT env var, and --standalone are unchanged.

    Version 2.11:

    # Used static rendezvous on port 29500 by default
    torchrun --nproc-per-node=4 train.py

    Version 2.12:

    # Uses OS-assigned free port by default
    torchrun --nproc-per-node=4 train.py
    
    # To explicitly use a fixed port:
    torchrun --nproc-per-node=4 --master-port=29500 train.py
MPS
  • All MPS tensors are now allocated in unified memory (#​175818)

    Previously, MPS tensors could be allocated in either device-only or unified memory. Now all MPS tensors use unified memory unconditionally. This simplifies memory management and enables CPU access to MPS tensor data without explicit copies. Code that relied on device-only memory placement may observe different performance characteristics.

Inductor
  • The max_autotune layout-constraint deferral introduced in 2.11 is now opt-in (#​175330)

    In 2.11, Inductor deferred layout freezing for max_autotune templates to expose more fusion opportunities. This caused a regional-inductor failure mode, so the default in 2.12 reverts to immediate layout freezing. Users who relied on the deferred behavior for fusion opportunities should opt in explicitly via torch._inductor.config.max_autotune_defer_layout_freezing or TORCHINDUCTOR_MAX_AUTOTUNE_DEFER_LAYOUT_FREEZING=1.

    Version 2.11:

    # Deferred layout freezing was the default
    torch.compile(model, mode="max-autotune")

    Version 2.12:

    import torch._inductor.config as cfg
    cfg.max_autotune_defer_layout_freezing = True
    # or set TORCHINDUCTOR_MAX_AUTOTUNE_DEFER_LAYOUT_FREEZING=1
    torch.compile(model, mode="max-autotune")
Deprecations
Release Engineering
  • Deprecate CUDA 12.8 builds in favor of CUDA 13.0 (#​179072)

    CUDA 12.8 binaries have been removed from the PyTorch binary build matrix. CUDA 13.0 is now the stable default and CUDA 12.6 remains available for users on older drivers. Users explicitly pinning the cu128 index URL will need to switch to cu130 (recommended) or cu126.

    Version 2.11:

    pip install torch --index-url https://download.pytorch.org/whl/cu128

    Version 2.12:

    # Use CUDA 13.0 (default on PyPI):
    pip install torch
    # Or explicitly:
    pip install torch --index-url https://download.pytorch.org/whl/cu130
    # Older driver fallback:
    pip install torch --index-url https://download.pytorch.org/whl/cu126
  • Compatibility with CMake < 3.10 will be removed in a future release (#​166259)

    Source builds against CMake versions older than 3.10 now emit a deprecation warning. A future release will require CMake 3.10 or newer; please upgrade CMake before then.

Linear Algebra
  • Several CUDA linear algebra operators no longer use the MAGMA backend and now dispatch to cuSolver or cuBLAS unconditionally:

    • torch.linalg.eigh now dispatches to cuSolver (#​174619)
    • torch.linalg.lu_solve now dispatches to cuSolver/cuBLAS (#​174248)
    • torch.linalg.cholesky_inverse now dispatches to cuSolver (#​174681)
    • torch.linalg.cholesky_solve now dispatches to cuSolver (#​174769)

    User code calling these APIs does not need to change. The practical impact is for users who depended on MAGMA-specific numerical behavior, performance characteristics, or debugging. Those calls now use the cuSolver/cuBLAS implementations on CUDA.

FullyShardedDataParallel2 (FSDP2)
  • Compiling through FSDP2 hooks without graph breaks is no longer supported (#​174863, #​174906). If you use compiled autograd with FSDP2, update your code to allow graph breaks around FSDP2 hooks or disable compiled autograd for the FSDP2 training step.

    Version 2.11:

    with torch._dynamo.config.patch(compiled_autograd=True):
        compiled_model = torch.compile(fsdp_model, fullgraph=True)
        loss = compiled_model(input).sum()
        loss.backward()

    Version 2.12:

    # Either run FSDP2 backward without fullgraph.
    compiled_model = torch.compile(fsdp_model, fullgraph=False)
    loss = compiled_model(input).sum()
    loss.backward()
    
    # Or apply compile before applying FSDP.
    compiled_model_pre_fsdp = torch.compile(model, fullgraph=True)
    compiled_model = fully_shard(compiled_model_pre_fsdp, ...)
    loss = compiled_model(input).sum()
    loss.backward()
Profiler
  • Profiler's metadata_json field is now deprecated; use event_metadata instead (#​179417)

    Version 2.11:

    metadata = event.metadata_json

    Version 2.12:

    metadata = event.event_metadata
Dynamo
  • torch.compile(fullgraph=True) now warns when a call runs no compiled code; will error in 2.13 (#​181940)

    Previously fullgraph=True was only validated once Dynamo actually compiled and ran the function. If Dynamo was bypassed at call time (e.g. under a user-defined TorchDispatchMode), the annotation silently had no effect. 2.12 emits a warning; 2.13 will raise. For graph-break errors without fullgraph's stronger guarantees, use torch._dynamo.error_on_graph_break.

    Version 2.12:

    from torch.utils._python_dispatch import TorchDispatchMode
    
    class LoggingMode(TorchDispatchMode):
        def __torch_dispatch__(self, func, types, args=(), kwargs=None):
            return func(*args, **(kwargs or {}))
    
    @&#8203;torch.compile(fullgraph=True)
    def model(x):
        return x.sin() + 1
    
    # A user-defined TorchDispatchMode is active, so Dynamo skips the frame
    # and no compiled code runs — emits a warning in 2.12, will raise in 2.13.
    with LoggingMode(): # Remove this to fix warning
        model(torch.randn(3, 4))
  • The inline_inbuilt_nn_modules Dynamo config is deprecated (#​177489, #​178205)

    Inlining of in-built nn.Module instances is now the default; setting the flag emits a deprecation warning and it will be removed in a future release.

    Version 2.11:

    import torch._dynamo.config as cfg
    cfg.inline_inbuilt_nn_modules = True  # was a tunable knob

    Version 2.12:

    # No action needed — inlining is on by default.
    # Remove any explicit references to torch._dynamo.config.inline_inbuilt_nn_modules.
  • Added a deprecation framework to the torch.compile config module so individual options can be marked deprecated (#​169837)

New Features
Release Engineering
Python Frontend
  • Introduced torch.accelerator.Graph as a unified frontend Graph interface (#​171285)
Foreach
  • Add _foreach_clone operator, with a fast path for CUDA utilizing _foreach_copy_ (#​177421)
Distributed
  • Add Store::barrier API and TCPStore client BARRIER support, reducing synchronization round trips compared to the existing ADD+WAIT pattern (#​174920)
  • Add NCCL communicator suspend(), resume(), and memory_stats() APIs for managing communicator memory lifecycle (#​176300)
  • Add all_to_all support in the Gloo backend (#​165435)
  • Add reduce_scatter_offset to symmetric memory, supporting variable-sized block reductions with NVLink multicast or LSA fallback (#​177791)
  • Enable batch_isend_irecv to work under torch.compile (#​161213)
  • Add torch.distributed.symmetric_memory.is_symm_mem_tensor() API to check if a tensor is a symmetric memory tensor (#​178947)
  • Convert NanCheck to a standalone op (torch.ops.c10d.check_for_nan) usable outside of ProcessGroupNCCL (#​174990)
DTensor
  • Add support for twice-differentiable DTensor redistribution (#​160509)
  • DeviceMesh is now traceable by torch.compile. Make DeviceMesh opaque (#​176661), Make placements opaque (#​171482).
  • Add grad_placements parameter to DTensor.from_local(), allowing explicit control over gradient placements in the backward pass (#​175867)
FullyShardedDataParallel2 (FSDP2)
  • Support per-parameter meshes in FSDP2, enabling different parameter groups to shard over different meshes (#​173509)
  • Support fully_shard with DTensors on a full SPMD mesh via DataParallelMeshDims (#​176334)
  • Add FSDP2 support for non-floating-point parameters by excluding non-float parameters from reduce-scatter while still sharding and all-gathering them as needed (#​177948)
TorchElastic
  • Add configurable --shutdown-timeout to torchrun for controlling the SIGTERM-to-SIGKILL timeout during worker shutdown (#​172596)
CPU x86
  • Expose a CPUBlas brgemm API for fp8 (e4m3 & e5m2) GEMM, backed by oneDNN (#​172548)
CUDA
  • Added support for torch.cond with CUDA graphs, using conditional graph nodes (CUDA 12.4+) so data-dependent control flow can be captured entirely inside a single CUDA graph. Works with the eager and cudagraphs torch.compile backends (no Inductor support yet). (#​168912)
MPS
  • Implemented linalg_qr for MPS (#​172536)
  • Added cholesky_solve support on MPS (#​176703)
  • Added index_reduce on MPS (#​174936)
  • Implemented torch.distributions.Gamma (forward + backward) on MPS (#​179228)
  • Enabled mvlgamma on MPS (#​178914)
  • Added nonzero_static implementation on MPS (#​179589) (from miscategorized)
ROCm
XPU
  • Support torch.accelerator.Graph on XPU (#​176421)
  • Added memory_clock_rate and memory_bus_width to XPU device properties (#​171967)
  • Enable split_group API when TorchComms is used as a backend for TorchTitan on XPU (#​178236)
Profiler
  • Profiler's Activity selection allows for fine-grained activity type selection. (#​176351)
  • Memory visualize has a new tab to show private pool memory view (#​177289)
Dynamo
  • Made torch._dynamo.aot_compile public, with aot_eager and inductor backend support and docs (#​179917, #​180008)
  • Added a recompile_limit keyword argument to torch.compile to override the per-function recompile cap without touching global config (#​177936)
  • Added min/max bounds to torch._dynamo.mark_unbacked for communicating value ranges to the symbolic shape system (#​176313)
  • Added bdb, a pdb-style debugger for stepping through nested frames during Dynamo tracing (n, u, d, r, bt), plus a user-callable breakpoint() that auto-starts it (#​174626, #​174746, #​175200)
Inductor
  • Added user-defined stream support to torch.compile. Inductor now codegens stream context managers (enter/exit) and record_stream calls in the wrapper, enabling user streams to flow through compiled regions with proper synchronization, scheduler integration, and cross-stream dependency tracking (#​165390, #​165391, #​165504, #​165505, #​174223, #​176700, #​177694)
  • Added ao::offload, ao::reload, and ao::wait ops for asynchronous activation offloading. These ops encapsulate async CPU offloading stream management following the same async 2-op pattern as c10d functional collectives, reducing IR size from 7 nodes (offload) and 5 nodes (reload) down to 2 nodes each (#​177621)
  • Added user-defined Triton kernel unary epilogue fusion. Inductor can now fuse user Triton kernels with downstream pointwise epilogues (e.g. relu()), parsing the user kernel source via AST and inlining the epilogue into the tl.store expression (#​173662)
  • Added out-variant discovery and lowering for custom ops. When a custom op registers both functional and .out overloads, Inductor automatically lowers single-output and multi-output functional ops to their .out variants as ExternKernelOut, enabling memory planner buffer reuse (#​175116, #​176117)
  • max_autotune now extends to combo kernels. The autotuning pipeline generates and benchmarks per-sub-kernel block-size phase configs, with chained sequential autotuning and per-sub-kernel reduction hints (#​177715, #​178936, #​179317)
  • Added non-TMA persistent Triton templates for mm and addmm for max-autotune, enabling persistent kernels on hardware without TMA (#​177781, #​179095)
  • Added CUTLASS backend support for torch.float8_e5m2 dtype, including registration for FP8 GEMM autotuning (#​171176)
  • Added XPU CUTLASS GEMM kernel codegen and codecache to max-autotune-gemm, allowing CUTLASS-style GEMM templates to target Intel GPUs (#​161938, #​161939)
  • Added MTIA Triton codegen for sort, median, and mode operations (#​178525)
  • Added a Triton template for depthwise conv1d (#​175280)
  • Added AVX10.2 fp32↔fp8 intrinsics in at::vec::convert for the Inductor C++ x86 backend (#​172309)
  • Pallas backend: added scalar prefetch and indirect access support (#​177212)
  • Added a disable_welford_reduction config flag to opt out of Welford reduction in codegen (#​175778)
Ahead-Of-Time Inductor (AOTI)
  • Add MXFP4 dtype support (float8_e8m0fnu and float4_e2m1fn_x2) to the AOTInductor C shim layer, enabling MXFP4 quantization (e.g., for AMD MI350) (#​176496)
  • Add a compile backend registry and custom device support for AOTI Eager, letting out-of-tree device backends plug into the AOTI eager compile/load flow without modifying upstream code (#​175605)
torch.fx
  • Add tuple_return option to split_module that wraps submodule outputs in a tuple (#​179007)
  • Add ignore_raw_node option to GraphPickler (#​176939)
  • Add _merge_overlapping_fusions() method to FxNetSplitter which detects and merges overlapping fusion groups (#​177099)
torch.export
  • Add serialization support for float8_e8m0fnu dtype (#​176270)
  • Add serialization support for torch.uint32 and torch.uint64 dtypes (#​179434)
  • Add serialization support for nested float lists (List[List[float]]) (#​178081)
JIT
  • Added input-independent graph optimization API (#​179393)
Improvements
Release Engineering
Python Frontend
  • Used compiler wrapper when building C++ extensions (#​175696)
  • Updated uniform and normal sampling on CPU to improve fp16/bf16 results (#​175988)
  • Changed requires_grad to Optional[bool] in torch.asarray (#​170897)
Autograd
  • Implemented narrow_copy derivative (#​175609)
  • Implemented higher-order derivatives for grid_sample (#​177487)
  • Implemented backward and forward AD for torch.aminmax (#​175215)
  • Exposed num_splits in varlen attention to allow disabling split_kv (#​176905)
  • Added user AutoNamingMode support in Selective Activation Checkpointing (#​175348)
  • Refactored torch.utils.checkpoint to no longer use autograd.Function for saving inputs (#​174327)
Dataloader
  • Added a thread-safe RNG utility function (#​175375)
Linear Algebra
  • Added _int_mm unsigned int8 × signed int8 (u8s8) support on CPU (#​168226)
  • Added FP64 support for TunableOp on ROCm via hipBLASLt (#​178195)
Nested Tensor (NJT)
  • Added nested tensor deserialization support (#​174843)
torch.nn
  • Added bias argument to nn normalization methods (LayerNorm, GroupNorm, RMSNorm, etc.) (#​176573)
  • Improved MultiMarginLoss error message for inconsistent target size (#​174072)
  • Added enable_gqa flag to varlen_attn (#​179468)
  • Allowed eps=0 in batch_norm during eval mode (#​175508)
  • Added meta device support in trunc_normal_ initialization (#​176240)
  • Split onehot checks for CPU and accelerators (#​181211)
Sparse
  • Implemented clone operator for semi-structured sparse tensors (#​174991)
  • Allowed semi-structured sparse tensors to be instantiated with alg_id (#​178659)
  • Enabled FP8 semi-structured sparsity on ROCm via hipSPARSELt (#​179310)
Build Frontend
C++ Frontend
  • Upgraded cpp_extension and cpp_builder to C++20 (#​176659)
  • Reland at::Tag header-only changes and add a library.def override for tags (#​181608)
Distributed
  • Add configurable worker timeout and partial data support to the distributed debug server (#​176058)
  • Add timeout parameter to torch.distributed.barrier() (#​174974)
  • Add reduce_scatter_tensor_coalesced support to ProcessGroupWrapper (#​168961)
  • Functional collectives API now automatically handles non-contiguous inputs instead of asserting (#​177965)
  • DDP: Add batched_grad_copy option to reduce per-parameter kernel launches to 2 kernels per bucket (#​176638)
  • DDP: Refactor bucket capacity config into BucketCapacityConfig dataclass (#​175217)
  • Add signal name to ChildFailedError exitcode output for better debugging (#​175254)
  • Add CUDA-aware detection for Cray MPICH (#​178323)
  • Support dist.broadcast for FP8 tensors on GPUs older than SM90 (#​175884)
  • Add __torch_function__ handlers for distributed functions (#​176376)
  • Enable split_group API for TorchComms on XPU (#​178236)
  • Make py-spy dumps nonblocking by default (#​178312)
  • Add ncclx and gloo to FlightRecorder trace analyzer backend allowlist (#​180268)
  • Improve error message on symmetric memory handle exchange (#​178989)
  • SymmMem: Add thread safety to NCCL and NVSHMEM backends (#​176551)
  • Check NCCL terminate signal more frequently when exiting from heartbeat monitor (#​170000)
  • Implement missing methods in ProcessGroupWrapper (#​178779)
  • Add compute_estimator option for overlap scheduling (#​175204)
  • [local_tensor] Add standalone rank_map/tensor_map functions (#​174795)
Distributed Checkpoint (DCP)
  • DCP: Improve save plan validation error messaging (#​176728)
  • DCP: Preserve original exception in metadata read failure for better debuggability (#​177739)
DTensor
  • Support DTensor view ops (flatten/unflatten) with _StridedSharding for full nn.Linear(DTensor) compatibility (#​166483)
  • Add Dijkstra-based single-dim strategy search for DTensor sharding propagation, avoiding exponential enumeration of strategy combinations (#​169438)
  • DTensor: Add is_pinned() support (#​177235)
  • DTensor: Add print() HOP support (#​175222)
  • DTensor: Emit zero paddings for uneven shardings to enable SPMD compilation (#​177758)
  • DTensor: Make run_dtensor_rng_op compatible with compile_on_one_rank (#​177447)
  • DTensor: Lenient handling of view redistributes in decomposition flow (#​175194)
  • DTensor: Redistribute from/to _StridedShard through Replicate (#​179059)
  • DTensor: Raise clearer error for unsupported Split(Flatten) sharding propagation (#​179632)
  • DTensor: Unbacked-safe view_groups (#​174629)
  • DTensor: Expanded sharding strategy coverage for index_select, index, index_fill, index_reduce, roll, fft, constant_pad_nd, squeeze.dims, interpolate, linalg ops, LayerNorm/RMSNorm FW/BW, foreach/fused ops, and einsum linearity (#​176037, #​176038, #​178456, #​175463, #​175656, #​173563, #​176991, #​176955, #​179173, #​177186, #​177187, #​176150, #​174830)
FullyShardedDataParallel (FSDP)
  • Remove mixed-dtype rejection from FSDP clip_grad_norm to match the documented behavior (#​173641)
FullyShardedDataParallel2 (FSDP2)
  • Allow ModuleList/ModuleDict subclasses that implement forward() (#​175033)
  • FSDP2: Support dataclass args/kwargs output without memory leakage (#​174692)
  • Share more implementation code between replicate and FSDP2 fully_shard (#​173580)
  • Consolidate FSDP2 shard_mesh and shard_mesh_from_root handling (#​174107)
Distributed Pipeline
  • DTensor metadata foundation for Pipeline Parallelism with DTensor-aware stage and schedule refactoring (#​177727, #​177728)
  • Pipeline Parallel: Dispatch homogeneous P2P ops individually to avoid stream serialization (#​175712)
TorchElastic
  • [elastic] Add Windows support for stdout/stderr redirects (#​176789)
  • torchelastic: Keep health check alive during exit barrier (#​178197)
CUDA
  • [CUDA] Fix offset_t operators to be __host__ __device__ in SortStable.cu (#​175997)
  • [CUDA] [Green Context] Add support for workqueue limit (#​177242)
  • Remove dead avg_pool3d backward shape-check variables in CUDA (#​178893)
  • [BE] add missing assert on cuda device synchronize in ATen tests (#​174966)
  • [reland 2][pytorch] Preemptive OOM rejection using per_process_memory_fraction + throw_on_cudamalloc_oom (#​179473) (#​179473)
  • [cuda graphs] Add enable_annotations kwarg to torch.cuda.graph` (#​179867)
  • [CUDA/ROCm] avoid double casting in ReduceLogicKernel (#​176132)
  • Nit fix: Align state_step tensor max to param tensor max (#​178913)
cuDNN
  • Upgrade 12.8, 13.0 (and 12.9) wheels to cuDNN 9.20.0.48 (#​177321)
MPS
ROCm
  • CPP extensions only compile for user's detected arch (#​168998)
  • Remove obsolete HIP NaN handling workarounds; remove technical debt (#​171104)
XPU
Profiler
Dynamo
Inductor
  • Unified OUT_DTYPE, ACC_TYPE, and INDEX_DTYPE codegen flow in Triton templates (#​179453)
  • Enabled cudagraph w/o partition for cpp-wrapper (#​179249)
  • Added FMA-based addcdiv lowering for CUDA parity with eager and matching _foreach_addcdiv to _foreach_addcmul (#​174912, #​175309, #​175310, #​175839, #​176237)
  • Added lerp decompositions for bitwise parity with eager (#​176804)
  • Added outer-product decomposition (#​176552)
  • Enabled padding fusion with torch.cat and avoided duplicate computation in cat/pad when inputs have multiple consumers (#​175729)
  • Lowered functional symmetric memory ops to ExternKernelOut for output buffer reuse, and added symm_mem planning for graph inputs and fallback regions (#​174856, #​175449)
  • Modified addmm template call to support hipblaslt bias-fused kernels on ROCm (#​177130)
  • Newly trained PadMM AutoHeuristics for A100 and H200, plus support for pad_mm AutoHeuristics in deterministic mode (#​176186, #​179826)
  • Propagate metadata in pattern matcher and add validation (#​179113)
  • FlexAttention: raise a clear NotImplementedError when return_aux=AuxRequest(max_scores=True) is requested with BACKEND='FLASH' instead of failing later with an opaque error (#​177434)
  • Migrated Inductor internals from legacy allow_tf32 to fp32_precision to avoid divergence with the new TF32 API (#​176098)
  • Pallas backend: enabled element-wise ops, native TPU OOB DMA masking via aligned block specs, and generalized N-D transpose permutation detection (#​174743, #​175458, #​176952)
  • Registered lowerings for prims.scalar_tensor and aten.arange.start_step (#​179017, #​179028)
  • Added SDPA pattern matching support for visformer (#​177826)
  • Relaxed concat-linear fusion to support GQA QKV (#​178523)
  • Allowed subgraphs to be benchmarked with async pipelined autotuning (#​175455)
  • Added convert_element_type lowering to emulate PyTorch eager numerics (#​176781)
  • Added GEMM configs to XPU autotuning heuristic (#​177647)
  • Added kpack Triton compile options on ROCm (#​173179)
  • ROCm: enabled exhaustive autotuning for FP8 (#​177797)
  • Override decomposition for aten.index_add (#​179486)
  • Drop tile_k from nvMatmulHeuristics matching (#​176845)
Ahead-Of-Time Inductor (AOTI)
  • Add aten._grouped_mm to AOTInductor fallback ops, enabling cpp-wrapper mode for grouped_mm (#​177307)
  • Support lazy Triton kernel compilation for cpp-wrapper on XPU (#​179239)
  • Add dynamic shapes support to AOTI Eager via AOTIPythonKernelHolder, allowing a single compiled kernel to serve multiple input shapes (#​176018)
  • Support multi-return ops in AOTI Eager (e.g., native_layer_norm, aminmax) (#​176019)
  • Allow custom ops with Optional[List[T]] arguments in cpp wrapper (#​174460)
  • Add lazy Triton kernel compilation for cpp-wrapper (#​175416)
  • Add TMA support for lazy Triton kernel compilation (#​175548)
  • Call latest c_shim version for versioned fallback ops (#​181548)
  • Add BC-safe c_shim v2 for _scaled_dot_product_attention_math_for_mps enable_gqa (#​181549)
torch.fx
  • Update get_source_partitioner to parse nn_module_stack metadata for improved source-based graph partitioning (#​175788)
  • split_module now uses _make_graph_module to support lazy recompile (#​177907)
  • Fix fuser_utils.topo_sort to produce a stable ordering (#​175378)
  • Fix GraphPickler to support nodes with slice() arguments (#​175996)
Composability
  • Added DynamicInt __pow__ and __rpow__ methods (#​179868)
  • Added scaled_mm_v2 CPU implementation (#​176266)
Bug fixes
Release Engineering
  • Fix periodic inductor CI silently skipping all tests (#​177695)
  • Fix python docs build hanging in CI (#​180177)
  • Avoid installing test dll into

Note

PR body was truncated to here.


Configuration

📅 Schedule: (UTC)

  • Branch creation
    • At any time (no schedule defined)
  • Automerge
    • At any time (no schedule defined)

🚦 Automerge: Enabled.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR has been generated by Mend Renovate.

| datasource | package | from   | to     |
| ---------- | ------- | ------ | ------ |
| pypi       | torch   | 2.11.0 | 2.12.0 |
@dreadnode-renovate-bot dreadnode-renovate-bot Bot added the type/digest Dependency digest updates label May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type/digest Dependency digest updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants