Skip to content

[BUG] Running sparse_mla_bwd.py in tilelang 0.1.9 produces NaNs #2199

@iclementine

Description

@iclementine

Required prerequisites

What version of TileLang are you using?

0.1.9

System information

3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] linux
0.1.9
2.7.0a0+79aa17489c.nv25.04

Problem description

  • Install tilelang 0.1.9
  • Clone tilelang's codebase and Checkout tag v0.1.9
  • Run examples/deepseel_v32/sparse_mla_bwd.py

It cannot pass the tensor similarity test. And there are almost all Nans in tl_dq and tl_dkv.

Change the

tilelang.PassConfigKey.TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE: True,

to

 tilelang.PassConfigKey.TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE: False,

in the tl.jit argument of bwd fixes this, though it would use more shared memory. And I want to know what breaks it

Reproducible example code

Traceback

ERROR: dq similarity check failed, diff=nan (threshold=1.00e-04)
Traceback (most recent call last):
  File "/workspace/inference/chenfeiyu/repos/github/tilelang/examples/deepseek_v32/sparse_mla_bwd.py", line 373, in <module>
    test_sparse_mla_bwd(B=1, S=4096, SKV=8192, H=64, HKV=1, DQKV=576, DV=512, topk=2048, dtype=torch.bfloat16, check_correctness=True)
  File "/workspace/inference/chenfeiyu/repos/github/tilelang/examples/deepseek_v32/sparse_mla_bwd.py", line 313, in test_sparse_mla_bwd
    assert_tensors_similar(tl_dq, ref_dq, eps=1e-4, name="dq")
  File "/workspace/inference/chenfeiyu/repos/github/tilelang/examples/deepseek_v32/utils.py", line 310, in assert_tensors_similar
    assert False  # noqa: B011
           ^^^^^
AssertionError

Expected behavior

It passes the test

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions