Skip to content

Support PyTorch 2.9#2743

Open
hemanth1999k wants to merge 6 commits into
apple:mainfrom
hemanth1999k:support-torch-2.9-training-dialect
Open

Support PyTorch 2.9#2743
hemanth1999k wants to merge 6 commits into
apple:mainfrom
hemanth1999k:support-torch-2.9-training-dialect

Conversation

@hemanth1999k

Copy link
Copy Markdown
Contributor

Summary

Starting with torch 2.9, torch.export.export() returns an ExportedProgram in the new TRAINING IR dialect by default (it used to be ATEN). The PyTorch frontend only accepts ATEN/EDGE, so _validate_conversion_arguments rejects every torch.export-based model on torch 2.9 before conversion even starts:

NotImplementedError: Conversion for models with only ATEN or EDGE dialect is supported/tested.
Provided Dialect: TRAINING. Run '.run_decompositions({})' on your exported PyTorch Model prior to conversion.

This isn't one broken op — it breaks essentially every ct.convert(exported_program, ...) call the moment you upgrade to torch 2.9.

Part of #2615.

The error message already tells the user the remedy (run_decompositions({})), and the converter's own testing_utils runs exactly that after every torch.export.export(...). This PR just moves that one step inside convert() so existing user code keeps working without changes.

Fix

  • In convert(), before the argument validation, if the model is an ExportedProgram whose dialect is not ATEN/EDGE, lower it with model.run_decompositions({}).
  • The lowered (ATEN) program then flows through validation and into mil_convert unchanged.
  • No-op for torch <= 2.8 (already ATEN) and for EDGE (ExecuTorch), so those paths are untouched.

Test

Adds TestPyTorchConverterExamples.test_convert_exported_program_training_dialect: it exports a small Linear+ReLU model and calls ct.convert(...) directly, with no manual run_decompositions(). On torch 2.9 the exported program is in the TRAINING dialect (so this is the regression guard); on older torch it's ATEN and the test still passes.

Verification

Built against coremltools 9.0 + torch 2.9.0 on macOS (arm64):

  • Without this change: every torch.export conversion I tried — linear/relu, layer_norm, conv1d, sdpa, where, pow, floor_divide, instance_norm3d — fails with the Provided Dialect: TRAINING error above.
  • With this change: the same models convert, and predictions match PyTorch within fp16 tolerance (e.g. layer_norm + linear: max abs diff ~3.5e-4).
  • torch <= 2.8 is unaffected; the new branch only fires for a non-ATEN/EDGE dialect.

One thing this PR deliberately leaves alone: _TORCH_MAX_VERSION and reqs/pytorch.pip. A few op-level signature changes in 2.9 still need their own fixes (e.g. hann_window now reports a different arg count, which breaks stft), so I didn't want to claim 2.9 is fully tested. This is just the dialect-level unblock that everything else on 2.9 sits behind.

Starting with torch 2.9, torch.export.export() returns an ExportedProgram
in the new TRAINING IR dialect by default instead of the ATEN dialect. The
converter only accepts ATEN/EDGE, so every torch.export-based conversion
failed on torch 2.9 with a NotImplementedError telling users to run
run_decompositions() themselves.

convert() now lowers any non-ATEN/EDGE ExportedProgram to ATEN via
run_decompositions() automatically, so existing convert() calls keep working
on torch 2.9 with no source changes. No-op for torch <= 2.8 (ATEN default)
and for EDGE (ExecuTorch). Adds a regression test.

Part of apple#2615.

@TobyRoseman TobyRoseman left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit hesitant to merge any change that only gives partial PyTorch 2.9 support, as we will not be properly able to test those changes in the CI without bumping the PyTorch version it uses.

Any chance you could look into making us fully support 2.9?


@staticmethod
@pytest.mark.skipif(not _HAS_TORCH_EXPORT_API, reason="torch.export API not available.")
def test_convert_exported_program_training_dialect():

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't properly test this, in CI, until we update the version of PyTorch that it uses.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumped the CI torch pin to 2.9.0 (and _TORCH_MAX_VERSION), so this now runs against a torch where export() defaults to the TRAINING dialect and actually covers the lowering path.

exact_source == "pytorch"
and _HAS_TORCH_EXPORT_API
and isinstance(model, ExportedProgram)
and model.dialect not in ("ATEN", "EDGE")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we also want to test the version of PyTorch installed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — added assert exported_program.dialect not in ("ATEN", "EDGE") (guarded on torch >= 2.9) so the test provably drives this path on the installed torch instead of passing as a no-op.

@hemanth1999k

Copy link
Copy Markdown
Contributor Author

Makes sense. I'll bump the torch pin to 2.9 and fix the remaining op breakages so CI can test it properly, then update this PR.

…er hann_window.periodic

Bumps _TORCH_MAX_VERSION and the arm64 torch pin to 2.9.0 so CI exercises 2.9.

Fixes the op-level breakages that 2.9's torch.export path surfaces:
- hann_window: the handler required 5/6 positional inputs (TorchScript shape);
  torch.export/ExecuTorch pass only window_length (+ periodic). Use per-frontend
  expected/min_expected and detect 'periodic' by input count + frontend.
- hann_window.periodic overload was unregistered (sanitize_op_kind doesn't strip
  the 'periodic' suffix) -> register it as a torch_alias.
- rms_norm: required exactly 4 inputs; export omits the optional weight/eps when
  defaulted. Relax to min 2 and index weight/eps defensively.

Adds frontend coverage to test_hann_window and a new TestRMSNorm, so CI validates
the export path. Verified locally against torch 2.9.0: convert + predict match
PyTorch within fp16 tolerance for both periodic variants and weight/no-weight.
@hemanth1999k hemanth1999k changed the title Auto-lower torch 2.9 TRAINING export dialect in convert() Support PyTorch 2.9 Jun 18, 2026
@hemanth1999k

Copy link
Copy Markdown
Contributor Author

Done — pushed full 2.9 support on top of the dialect fix:

  • Bumped _TORCH_MAX_VERSION and the arm64 torch pin to 2.9.0 so CI exercises 2.9.
  • Fixed the op breakages 2.9's torch.export path surfaces:
    • hann_window: handler required 5/6 positional inputs (TorchScript shape); export passes only window_length (+ periodic). Made it per-frontend, and registered the hann_window.periodic overload (sanitize_op_kind doesn't strip periodic) — this also unblocks stft.
    • rms_norm: required exactly 4 inputs; export omits optional weight/eps when defaulted. Relaxed to min 2.
  • Added frontend coverage to test_hann_window + a new TestRMSNorm so CI validates the export path.

Ran a ~40-op probe against torch 2.9.0 locally: everything converts and matches PyTorch within fp16 tolerance, except hamming/blackman/bartlett/kaiser_window — those were never implemented (not 2.9 regressions). Should be safe to bump CI now.

@TobyRoseman

Copy link
Copy Markdown
Collaborator

executorch>=0.7.0 resolved to the latest (1.3.1, which needs torch>=2.12),
making the install ResolutionImpossible against torch==2.9.0. executorch 1.0.x
is the release built for torch 2.9 (requires torch>=2.9,<2.10 and torchao==0.14.0),
so pin to it and bump torchao to 0.14.0 to match (also fixes the
test_coreml_quantizer collection error under torch 2.9).
…ompositions bug)

torch 2.9's ExportedProgram.run_decompositions({}) raises 'NameError: name L
is not defined' while interpreting the _guards_fn submodule it generates for
dynamic-shape exports that carry shape guards (e.g. unfold's H/W >= f(kernel,
dilation, padding, stride) constraint). This is an upstream torch regression,
not a converter bug: static-shape unfold and every other export op are
unaffected (verified: 240 passed / 240 skipped / 0 failed for TestUnfold on
the export frontend). Skip the guarded dynamic-shape cases on torch>=2.9 until
the torch issue is resolved.
@hemanth1999k

Copy link
Copy Markdown
Contributor Author

Thanks for running CI — went through the 3 failures:

1 & 2 (test_executorch, coremltools_test) — dependency resolution. executorch>=0.7.0 resolved to 1.3.1, which requires torch>=2.12, so the install was ResolutionImpossible against torch 2.9. ExecuTorch 1.0.x is the release built for torch 2.9 (needs torch>=2.9,<2.10 + torchao==0.14.0), so I pinned executorch>=1.0.0,<1.1.0 and bumped torchao 0.12.0 → 0.14.0 (which also clears the test_coreml_quantizer collection error).

3 (test_pytorch_export) — all 208 failures were test_unfold[is_dynamic_hw=True], and it's an upstream torch 2.9 bug. torch 2.9's ExportedProgram.run_decompositions({}) raises NameError: name 'L' is not defined while interpreting the _guards_fn submodule it generates for dynamic-shape exports that carry shape guards (unfold constrains H/W ≥ f(kernel, dilation, padding, stride)). I reproduced it minimally — it fails on both strict=True and strict=False, and it's not specific to the converter (static-shape unfold and every other export op convert fine). Locally, TestUnfold on the export frontend is now 240 passed / 240 skipped / 0 failed.

Since it's a torch regression rather than something we can fix here, I skipped the guarded dynamic-shape unfold cases on torch>=2.9 with a comment to re-enable once torch fixes it. Happy to file/track the torch issue if you'd like it referenced by number instead.

# a converter bug; static-shape unfold and all other export ops are
# unaffected. Re-enable once the torch regression is resolved.
pytest.skip(
"rdar://torch-2.9 run_decompositions() NameError on _guards_fn "

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the "rdar://torch-2.9". That doesn't make any sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — dropped the rdar:// prefix; the skip reason is now just the torch 2.9 run_decompositions() / _guards_fn explanation.

@TobyRoseman

Copy link
Copy Markdown
Collaborator

@hemanth1999k

Copy link
Copy Markdown
Contributor Author

Pushed updates addressing all review comments: removed the rdar reference, and added a dialect assertion so the training-dialect test provably exercises the auto-lowering path on the installed torch (the CI pin is now 2.9.0). Ready for another look whenever you can re-run CI — thanks!

@TobyRoseman

Copy link
Copy Markdown
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants