Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
259 commits
Select commit Hold shift + click to select a range
8952bc4
[Core] Add function to convert container to string (#1342)
timmoon10 Nov 22, 2024
ae393e8
Support CUDA Graph for MoE models (#1233)
buptzyb Nov 25, 2024
60ce21f
[Common] Moved framework agnostic THD kernels to common. (#1339)
mgoldfarb-nvidia Nov 25, 2024
a132ac4
Fix cuda graph capture for grouped gemm (#1345)
xrennvidia Nov 27, 2024
0951971
Update list of CI users (#1340)
timmoon10 Dec 2, 2024
64126aa
Improving communication overlap for the case of multi kernel queue us…
youngeunkwon0405 Dec 2, 2024
44f6ff2
add paged attention; test_kv_cache_accuray and test_paged_attn pass
cyanguwa Dec 4, 2024
06605e5
remove unnecessary change from last commit
cyanguwa Dec 4, 2024
0b2eb88
test_fused_attn pass
cyanguwa Dec 4, 2024
d243b79
Merge branch 'main' into paged_attention
cyanguwa Dec 4, 2024
b0a5da4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 4, 2024
b4efd71
remove unnecessary import in test_numerics
cyanguwa Dec 4, 2024
e637a07
add license for test
cyanguwa Dec 4, 2024
767c8f5
fix lint
cyanguwa Dec 4, 2024
a3bb14f
add to L0 test
cyanguwa Dec 4, 2024
d65933c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 4, 2024
d3cbccd
[JAX] Scale sequence length in CP tests to avoid tiny sizes. (#1347)
mgoldfarb-nvidia Dec 4, 2024
71ada55
Debug nightly docs (#1338)
timmoon10 Dec 5, 2024
8c00424
[PyTorch] Store module extra state in tensor (#1335)
timmoon10 Dec 5, 2024
d978e80
Fix attention mask type for Flash Attention + CP + THD (#1354)
xrennvidia Dec 5, 2024
d8b13cb
Disable FP8 in Mcore integration test on older GPUs (#1357)
timmoon10 Dec 6, 2024
3102fdd
[C] Normalization Refactor + Adding CUDNN backend (#1315)
phu0ngng Dec 6, 2024
e4c99b0
[JAX] Use default factory for not sharing mutable default values (#1364)
zlsh80826 Dec 10, 2024
0e1d9fa
[JAX] Bug fix for distributed normalization (#1366)
phu0ngng Dec 12, 2024
e7bfc0c
Add user to CI (#1371)
ksivaman Dec 12, 2024
1ae8190
Fix an invalid reference in the doc (#1362)
wujingyue Dec 14, 2024
1975ace
[JAX] Bug Fix: Softmax FFIs with correct Encapsulates (#1375)
phu0ngng Dec 14, 2024
0196ed4
Enabling FP8 all-gather for TE Float8Tensor when using Torch FSDP2 (#…
youngeunkwon0405 Dec 16, 2024
f4f35c2
[common] Add max_t support for KV in THD (#1370)
cyanguwa Dec 17, 2024
7f5c784
[JAX] Fused attention unit tests fixes and refinements (#1352)
zlsh80826 Dec 17, 2024
83dac8c
[PyTorch] Add weights_only=False for torch.load (#1374)
cyanguwa Dec 18, 2024
f033498
[PyTorch] Fix get_swa_mask() for padding masks (#1281)
cyanguwa Dec 18, 2024
a3b32ec
[JAX] Move parallel encoder tests to L0 distributed test set. (#1356)
phu0ngng Dec 18, 2024
838345e
[common/PyTorch] Add cuDNN SWA (left, 0) + padding + bottom right cau…
cyanguwa Dec 20, 2024
c9ea6be
Update copyright to include 2025 (#1388)
ksivaman Jan 2, 2025
cd626b8
Merge branch 'main' into paged_attention
cyanguwa Jan 6, 2025
7c23b96
update license for test_paged_attn
cyanguwa Jan 6, 2025
2dbf2e1
update kv_cache_manager license
cyanguwa Jan 6, 2025
d2f1549
fix build issue from previous merge
cyanguwa Jan 7, 2025
b898cbe
[JAX] Add THD + SWA unit tests (#1390)
zlsh80826 Jan 8, 2025
61cf102
bug fix for using `return_layernorm_output=True` (#1382)
LiyuanLucasLiu Jan 8, 2025
a4cb1d1
[JAX] Correct fused attention output after each step of ring attentio…
mgoldfarb-nvidia Jan 8, 2025
560bccf
clean CP implementation for flash attention and cuDNN 9.6 (#1387)
xrennvidia Jan 8, 2025
7b861e7
Take token count quantization of fused attention into consideration f…
xrennvidia Jan 10, 2025
a65ad37
[JAX] Test_multiprocessing_encoder with process spawn in bash (#1394)
phu0ngng Jan 11, 2025
cbc4653
Fix "refractor" typo in the PR template (#1402)
kit1980 Jan 13, 2025
2402406
[PyTorch] Adding TP overlap support for `te.Linear` with `parallel_mo…
denera Jan 13, 2025
3d63cbb
Make it an option to compile activation functions with fast math (#1410)
guyueh1 Jan 15, 2025
81a07e0
Merge branch 'main' into paged_attention
cyanguwa Jan 15, 2025
c2937c5
[PyTorch] `te.Linear` FP8 DGRAD+RS output bugfix (#1412)
denera Jan 16, 2025
6e84892
[JAX] Consolidate the distributed fused attention test code (#1405)
mgoldfarb-nvidia Jan 17, 2025
7aa8118
[PyTorch] Fix AttentionParams comparison logic (#1397)
cyanguwa Jan 21, 2025
3d7ff1c
[PyTorch] Avoid `parameters` function in op backward pass (#1403)
timmoon10 Jan 22, 2025
c2c3d54
[JAX] Support segment_ids/pos as FA inputs (#1406)
zlsh80826 Jan 24, 2025
2fce82b
[MoE][PyTorch] Add mask-based MoE permutation (#1373)
hxbai Jan 27, 2025
199e612
Use log1p(x) instead of log(1+x) (#1401)
kit1980 Jan 28, 2025
76282cf
Merge branch 'main' into paged_attention
cyanguwa Jan 28, 2025
96534aa
Update neox to completed (#1439)
Quentin-Anthony Jan 30, 2025
366fa65
Merge branch 'NVIDIA:main' into paged_attention
cyanguwa Jan 30, 2025
e536954
Support `store_param_remainders` feature from Apex in TE Fused Adam (…
sanandaraj5597 Jan 31, 2025
544dd14
Update main branch with TE 2.0 code, update version to 2.1.0.dev0
ptrendx Feb 7, 2025
9f31f09
Merge branch 'main' into paged_attention
cyanguwa Feb 8, 2025
8dc06e0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 8, 2025
59dcf48
WIP: minor fix/preparation for inference/cuda graph
cyanguwa Feb 10, 2025
b87e539
[JAX] Flax module init with a given dtype (#1472)
phu0ngng Feb 11, 2025
09448a9
WIP: non-paged
cyanguwa Feb 12, 2025
49a4535
Add NVTX ranges to categorize execution (#1447)
minitu Feb 12, 2025
612637c
WIP: non-paged, bshd/sbhd
cyanguwa Feb 12, 2025
ee4a17d
Update documentation for 2.0 release (#1479)
ptrendx Feb 12, 2025
f9bd83c
WIP: non-paged, thd, no CG
cyanguwa Feb 12, 2025
f0d22ca
Fix a bug for D being nullptr in grouped gemm (#1475)
yaox12 Feb 13, 2025
54ae0c7
WIP: non-paged, thd, CG
cyanguwa Feb 14, 2025
24e4f95
[JAX] Flax params initialization with weight_dtype (#1481)
phu0ngng Feb 14, 2025
e19b828
[JAX] Fixes for CI failures with the latest JAX (#1469)
phu0ngng Feb 14, 2025
654c929
WIP: non-paged, CG
cyanguwa Feb 14, 2025
737f45a
WIP: non-paged, using paged kernel
cyanguwa Feb 14, 2025
ac015ef
WIP: restructure kernels
cyanguwa Feb 14, 2025
45e9d8b
[JAX] Lint Fix (#1484)
phu0ngng Feb 14, 2025
dfbf4dd
[JAX] Fix issues when mask/sequence_descriptor is None (#1477)
zlsh80826 Feb 14, 2025
af7b2b4
[JAX] Expose THD format to the flax module (#1480)
zlsh80826 Feb 14, 2025
b39397c
Changed VERSION to 2.2.0.dev0
ptrendx Feb 15, 2025
e52868b
WIP: paged, CG
cyanguwa Feb 15, 2025
ba5e333
WIP: padding + BRCM
cyanguwa Feb 15, 2025
bcef6b3
WIP: restructure IP, clean up
cyanguwa Feb 15, 2025
f3975f0
WIP: fix non-CG, fused
cyanguwa Feb 15, 2025
125548c
WIP: fix last commit
cyanguwa Feb 15, 2025
9bf3204
WIP: unfused, non-CG
cyanguwa Feb 15, 2025
3060892
WIP: flash-attn, non-CG
cyanguwa Feb 16, 2025
eb9857d
[MoE][PyTorch] Add prob permutation to mask-based MoE permutation; Fi…
hxbai Feb 18, 2025
6673f16
[JAX] Flax with compute dtype inferred from input dtype. (#1485)
phu0ngng Feb 18, 2025
978f1d7
Fix issues for MCore DDP. (#1474)
Victarry Feb 19, 2025
11f15fd
WIP: flash_attn_with_kvcache
cyanguwa Feb 19, 2025
56c0c07
[PyTorch] Fix typo (#1495)
timmoon10 Feb 19, 2025
fceff07
[PyTorch] Fix fuse_wgrad_accumulation for GroupedLinear (#1488)
yaox12 Feb 19, 2025
33b430f
commit two files missed by bcef6b34
cyanguwa Feb 19, 2025
b612cde
Fix TE ops API compatibility with PyTorch versions < 2.4.3 (#1494)
ksivaman Feb 20, 2025
257345a
[PyTorch] Fix CP implementation with FP8 (#1483)
xrennvidia Feb 20, 2025
1c31b68
WIP: thd_bshd_bshd
cyanguwa Feb 21, 2025
7331a4c
WIP: fix last commit
cyanguwa Feb 22, 2025
0341de7
WIP: fix 1c31b68d
cyanguwa Feb 22, 2025
6bd61a7
WIP: add bshd_2sbhd, sbhd_2bshd
cyanguwa Feb 22, 2025
2d30bb1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 22, 2025
b4fbc2b
[PyTorch] Use same API in optimizer `zero_grad` as PyTorch optimizers…
timmoon10 Feb 22, 2025
7f2dcf9
[Pytorch] Decoupling framework extensions from common module (#1498)
KshitijLakhani Feb 22, 2025
a391a49
Merge branch 'main' into paged_attention
cyanguwa Feb 22, 2025
9ec3649
WIP: some cleanup
cyanguwa Feb 23, 2025
93235dd
WIP: all qkv_format combinations and merge CM files
cyanguwa Feb 23, 2025
b476244
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 23, 2025
3cb001d
WIP: some lint fixes
cyanguwa Feb 24, 2025
583b76f
WIP: add docstring for IP
cyanguwa Feb 24, 2025
d668f18
[Pytorch] Added missing assert_dim_for_fp8_exec for Linear
pggPL Feb 24, 2025
229dd04
[PyTorch] Run all Python tests, even if one of them fails
pggPL Feb 24, 2025
f13b861
fix sequences_pre
cyanguwa Feb 25, 2025
62cffc8
Merge branch 'main' into paged_attention
cyanguwa Feb 25, 2025
a06d72c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 25, 2025
8744188
Minor fixes for attention (#1504)
cyanguwa Feb 25, 2025
9351a17
Fix a crash in NeMo 2.0 during module._apply(lambda t: t.cpu()) (#1502)
guyueh1 Feb 25, 2025
94c9291
Adding remove_caches API to Float8Tensor class (#1425)
youngeunkwon0405 Feb 25, 2025
8ca2caf
Parallel Cross Entropy using online softmax (#1456)
sanandaraj5597 Feb 26, 2025
5d85857
Added memory alignment check to cast_fp8_1D (#1507)
Oleg-Goncharov Feb 26, 2025
2834e4a
[PyTorch] Skip context parallelism tests if not enough GPUs (#1508)
timmoon10 Feb 26, 2025
9b33071
WIP: minor fixes for multi-layer
cyanguwa Feb 27, 2025
e3de9bc
WIP: initial multi-layer test
cyanguwa Feb 27, 2025
0c72a61
WIP: minor clean up
cyanguwa Feb 27, 2025
203fbb4
Merge branch 'NVIDIA:main' into paged_attention
cyanguwa Feb 27, 2025
e06f4a1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 27, 2025
fb77772
WIP: clean up
cyanguwa Feb 27, 2025
eb28c65
Update build test CUDA version to 12.1 (#1517)
timmoon10 Feb 27, 2025
9654931
Support vectorized local reduction for p2p-based ReduceScatter overla…
erhoo82 Feb 28, 2025
97344d6
TP-RS local reduction: fix lint err (#1520)
erhoo82 Feb 28, 2025
9588109
Fix shape of new quantized tensor in `make_like` (#1515)
ksivaman Feb 28, 2025
303c6d1
Enforce PyTorch version 2.1 and run attention tests with torch.compil…
ksivaman Feb 28, 2025
d3efaeb
Delete extra tensor objects after restoring float8 tensors (#1500)
sudhakarsingh27 Feb 28, 2025
4b523d2
[PyTorch] Set flags in norm modules for Mcore sequence-parallel suppo…
timmoon10 Mar 1, 2025
b823fe1
Merge branch 'main' into paged_attention
cyanguwa Mar 1, 2025
d04d805
WIP: switch to flash_attn_varlen_func
cyanguwa Mar 1, 2025
0cbe998
WIP: fix unfused for separate q/kv format
cyanguwa Mar 1, 2025
6091196
WIP: fix fused for separate q/kv formats
cyanguwa Mar 1, 2025
8585bc3
WIP: flash attn + TELayer + 2 layers
cyanguwa Mar 1, 2025
e05ba53
WIP: unfused + TL + 2layers
cyanguwa Mar 1, 2025
490e57a
WIP: all modules/backend
cyanguwa Mar 1, 2025
339bfa9
WIP: minor cleanup
cyanguwa Mar 1, 2025
8d13f12
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 1, 2025
6bd1142
WIP: FlashAttention on Hopper with 2.7.3
cyanguwa Mar 1, 2025
ee006c4
WIP: FlashAttention + v3 from 39e7179
cyanguwa Mar 2, 2025
bd93082
WIP: FlashAttention + v3 + FP8 + WIP
cyanguwa Mar 2, 2025
2428793
WIP: add backend support table
cyanguwa Mar 2, 2025
91fa902
WIP: clean up
cyanguwa Mar 2, 2025
eeb0dc7
WIP: separate use_flash_attention_2 and _3
cyanguwa Mar 3, 2025
38110a7
WIP: tweaks to paged attn script
cyanguwa Mar 3, 2025
07021c2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 3, 2025
c5d6a06
[JAX] THD ring attention (#1454)
zlsh80826 Mar 3, 2025
a3fdb28
WIP: enable/disable certain cases for fused attn
cyanguwa Mar 3, 2025
fc5a7e9
WIP: small fixes for lint and cg
cyanguwa Mar 3, 2025
e1898e1
WIP: minor fixes for attn/infer
cyanguwa Mar 3, 2025
96d7d79
WIP: fix CP
cyanguwa Mar 3, 2025
90dcc68
WIP: readd page info to FADescriptor_v1
cyanguwa Mar 3, 2025
53fe2c9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 3, 2025
4a45656
Merge branch 'main' into paged_attention
cyanguwa Mar 3, 2025
f250dce
minor tweak to test_numerics.py
cyanguwa Mar 3, 2025
e5c0e40
fix 9.5/9.7 sq/skv + mask logic
cyanguwa Mar 3, 2025
fc1b91c
Launch GEMM on compute_stream which has low priority. (#1522)
vasunvidia Mar 3, 2025
bca1f58
clean up
cyanguwa Mar 3, 2025
3618785
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 3, 2025
bc4c452
[common] Removed tensor boundary checks in MXFP8 kernels (#1519)
Oleg-Goncharov Mar 3, 2025
90d5d45
Add sanity test for lightning-thunder integration (#1531)
ksivaman Mar 4, 2025
cbb96f2
Export only necessary symbols from libtransformer_engine.so (#1511)
KshitijLakhani Mar 4, 2025
e85d180
Update list of CI users (#1535)
timmoon10 Mar 4, 2025
1643584
Merge branch 'main' into paged_attention
cyanguwa Mar 5, 2025
f8eddcf
Add support for UB MNNVL (#1470)
nvcastet Mar 5, 2025
45553c4
minor fix for FA3
cyanguwa Mar 5, 2025
547d8dd
Don't touch nor send messages to the root logger. (#1380)
sagostinho-nvidia Mar 5, 2025
a3e6ed8
Fix installation from PyPI wheels after a source install (#1526)
ksivaman Mar 5, 2025
6ff7b70
[PyTorch] Move Lightning-Thunder integration test to L1 (#1536)
timmoon10 Mar 5, 2025
bd278ff
[PyTorch] Enable MXFP8 LayerNorm and RMSNorm (#1487)
timmoon10 Mar 6, 2025
74983b3
Fix UB with MPI init (#1538)
nvcastet Mar 6, 2025
e1c4f51
make sure dout is contiguous (#1539)
xrennvidia Mar 6, 2025
9710013
[PyTorch] Fix issue when last input in GroupedLinear is empty.
pggPL Mar 6, 2025
3db46d6
more minor fixes for FA3
cyanguwa Mar 6, 2025
831866a
test page_size=1 for FA3
cyanguwa Mar 6, 2025
63241ad
fix t3hd/th3d strides
cyanguwa Mar 6, 2025
13bd745
Remove cudaStreamSync. call from transformer_engine.cpp (#1518)
vasunvidia Mar 6, 2025
de06a34
Add NVTX ranges to FP8 amax AR and grad output preprocessing (#1530)
minitu Mar 6, 2025
52ae4f5
Merge branch 'main' into paged_attention
cyanguwa Mar 7, 2025
a37058a
fix ckpt recompute and fa3 k_scale
cyanguwa Mar 7, 2025
09c2f39
raise dynamo recompile limit for test
cyanguwa Mar 7, 2025
5e2f2a9
remove thunder test from L0
cyanguwa Mar 7, 2025
d4d82dd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 7, 2025
2d9a882
fix FA selection logic
cyanguwa Mar 7, 2025
bb5613c
fix FA3 q_descale shape
cyanguwa Mar 7, 2025
48b8eea
[PyTorch] Don't set FP8 data to `None` when saving base tensors (#1548)
ksivaman Mar 7, 2025
44c8fd0
Add user to TE CI (#1547)
ksivaman Mar 7, 2025
2ad5da9
[PyTorch] Fix incorrect docstrings in tensor saving functions (#1549)
timmoon10 Mar 7, 2025
2a95efd
CP implementation refinement for BSHD/SBHD format (#1523)
xrennvidia Mar 7, 2025
77fa1e5
[PyTorch] Enabling Per-Tensor Current Scaling Recipe (#1471)
zhongbozhu Mar 8, 2025
5bb771e
Verified TE2.0 with offloading (#1514)
sanandaraj5597 Mar 9, 2025
b3e7035
Use internal quantizer for input to the modules (#1551)
ptrendx Mar 10, 2025
f090551
[PyTorch] Remove Megatron-LM convergence test (#1521)
timmoon10 Mar 10, 2025
314ab9a
Disable parallelism in core build test (#1550)
timmoon10 Mar 10, 2025
f3a009d
Revert "Use internal quantizer for input to the modules" (#1555)
ptrendx Mar 10, 2025
ab4fd3c
Remove xla_ignore_channel_id check and ignore Scan loop warning in un…
zlsh80826 Mar 12, 2025
8487e50
[PyTorch] Fix fused attention backward's FP8 dtypes (#1566)
cyanguwa Mar 12, 2025
31f32b3
Explicitly use `python3` and `pip3` executables (#1486)
timmoon10 Mar 13, 2025
0e13788
[JAX] FFI API compatibility with both 0.4 and 0.5 (#1562)
zlsh80826 Mar 13, 2025
8a20d66
Support tensors with only column-wise data (#1505)
timmoon10 Mar 13, 2025
09ffb5d
[PyTorch] Support Bgrad Cast FP8 Fusion for FP8 Current Scaling Recip…
zhongbozhu Mar 13, 2025
0664220
Merge branch 'main' into paged_attention
cyanguwa Mar 13, 2025
6239735
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 13, 2025
b9ffe65
remove page_table from IP.step() returns
cyanguwa Mar 14, 2025
430dd56
fix FP8 FlashAttn DPA fp8_dpa tests
cyanguwa Mar 14, 2025
10e50f5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 14, 2025
7a8c0c5
fix CP
cyanguwa Mar 14, 2025
a3bc1b4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 14, 2025
0fd197f
minor tweaks
cyanguwa Mar 14, 2025
e284346
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 14, 2025
7a9f357
update FA3 note and L3 test
cyanguwa Mar 14, 2025
2495c80
fix lint
cyanguwa Mar 14, 2025
28d9983
remove redundant import in test
cyanguwa Mar 14, 2025
dc40f9f
Update FE to 1.11 (#1580)
cyanguwa Mar 14, 2025
674535d
Merge branch 'main' into paged_attention
cyanguwa Mar 14, 2025
12c3e32
Fix import error on CPU only devices (#1578)
hxbai Mar 14, 2025
de48ef6
Merge branch 'main' into paged_attention
cyanguwa Mar 14, 2025
c257bf3
Blackwell devel commoverlap mlperftests (#1529)
vasunvidia Mar 14, 2025
496776b
adopt new FA3 APIs from FA2.7.3+/hopper for CP and non-CP
cyanguwa Mar 14, 2025
7f1c765
fix lint
cyanguwa Mar 14, 2025
de5a2f6
Merge branch 'main' into paged_attention
cyanguwa Mar 14, 2025
0cf5c0d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 14, 2025
5578b69
relax tols for TransformerLayers
cyanguwa Mar 14, 2025
3733947
Refactoring attention.py part 1 (#1542)
KshitijLakhani Mar 14, 2025
6a26e0e
Merge branch 'main' into paged_attention
cyanguwa Mar 15, 2025
2b1b72f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 15, 2025
a7eeb28
[PyTorch] Support TP Overlap in Per-Tensor Current Scaling Recipe (#1…
BestJuly Mar 15, 2025
a6c8455
fix merge
cyanguwa Mar 15, 2025
b598cb9
fix merge 2
cyanguwa Mar 15, 2025
5e45442
fix FA import comments
cyanguwa Mar 15, 2025
d770116
relax tols for Ampere
cyanguwa Mar 15, 2025
0025478
fix fa3 version and reduce messaging
cyanguwa Mar 15, 2025
bec87e7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 15, 2025
5475163
Merge branch 'main' into paged_attention
cyanguwa Mar 15, 2025
4a74ef8
Add issue template (#1584)
ksivaman Mar 17, 2025
7ddc593
Better cuBLAS handle management (#1389)
ptrendx Mar 17, 2025
cb2d56e
update FA3 to its latest commit on main
cyanguwa Mar 17, 2025
6a85596
Distopt with offload (#1573)
sanandaraj5597 Mar 17, 2025
c571c2f
[QA] Add error handling (#1570)
linxiddd Mar 17, 2025
80374da
Merge branch 'main' into paged_attention
cyanguwa Mar 17, 2025
d35d00c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 17, 2025
5da6e91
add default values to IP and assertion to graph.py
cyanguwa Mar 17, 2025
666f771
add more comments in attention
cyanguwa Mar 17, 2025
22f79f8
use custom_cache_manager instead of cache_manager
cyanguwa Mar 17, 2025
cfd30cf
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
47 changes: 47 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
name: Bug report
about: Create a report to help us improve
title: ''
labels: bug
assignees: ''

---

**Describe the bug**

A clear and concise description of what the bug is.

**Steps/Code to reproduce bug**

Please list *minimal* steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.


**Expected behavior**

A clear and concise description of what you expected to happen.

**Environment overview (please complete the following information)**

- Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
- Method of Transformer Engine install: [pip install or from source]. Please specify exact commands you used to install.
- If method of install is [Docker], provide `docker pull` & `docker run` commands used

**Environment details**

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version
- PyTorch version
- Python version
- Transformer Engine version
- CUDA version
- CUDNN version

**Device details**
- GPU model

**Additional context**

Add any other context about the problem here.
25 changes: 25 additions & 0 deletions .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
name: Feature request
about: Suggest an idea for this project
title: ''
labels: feature request
assignees: ''

---

**Is your feature request related to a problem? Please describe.**

A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**

A clear and concise description of what you want to happen.
Provide a code snippet on how new APIs/changes would be used by others.

**Describe alternatives you've considered**

A clear and concise description of any alternative solutions or features you've considered.

**Additional context**

Add any other context or screenshots about the feature request here.
2 changes: 1 addition & 1 deletion .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Fixes # (issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Infra/Build change
- [ ] Code refractor
- [ ] Code refactoring

## Changes

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/blossom-ci.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# See LICENSE for license information.

Expand Down
28 changes: 5 additions & 23 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# See LICENSE for license information.

Expand All @@ -12,7 +12,7 @@ jobs:
name: 'Core'
runs-on: ubuntu-latest
container:
image: nvcr.io/nvidia/cuda:12.0.0-devel-ubuntu22.04
image: nvcr.io/nvidia/cuda:12.1.0-devel-ubuntu22.04
options: --user root
steps:
- name: 'Dependencies'
Expand All @@ -28,14 +28,15 @@ jobs:
run: pip install . -v
env:
NVTE_FRAMEWORK: none
MAX_JOBS: 1
- name: 'Sanity check'
run: python3 -c "import transformer_engine"
working-directory: /
pytorch:
name: 'PyTorch'
runs-on: ubuntu-latest
container:
image: nvcr.io/nvidia/cuda:12.5.0-devel-ubuntu22.04
image: nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu22.04
options: --user root
steps:
- name: 'Dependencies'
Expand Down Expand Up @@ -70,25 +71,6 @@ jobs:
run: pip install . -v
env:
NVTE_FRAMEWORK: jax
MAX_JOBS: 1
- name: 'Sanity check'
run: python tests/jax/test_sanity_import.py
paddle:
name: 'PaddlePaddle'
runs-on: ubuntu-latest
container:
image: nvcr.io/nvidia/paddlepaddle:24.10-py3
options: --user root
steps:
- name: 'Checkout'
uses: actions/checkout@v3
with:
submodules: recursive
- name: 'Build'
run: |
apt-get update
apt-get install -y libgoogle-glog-dev
pip install . -v
env:
NVTE_FRAMEWORK: paddle
- name: 'Sanity check'
run: python tests/paddle/test_sanity_import.py
5 changes: 3 additions & 2 deletions .github/workflows/deploy_nightly_docs.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# See LICENSE for license information.

Expand All @@ -16,13 +16,14 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Download artifact
uses: actions/download-artifact@v4.1.7
uses: actions/download-artifact@v4
with:
name: "te_docs"
path: "html"
- name: Prepare for pages
uses: actions/upload-pages-artifact@v1.0.7
with:
name: github-pages
path: "html"
deploy:
needs: prepare
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# See LICENSE for license information.

Expand Down Expand Up @@ -27,7 +27,7 @@ jobs:
cd docs
make html
- name: 'Upload docs'
uses: actions/upload-artifact@v3
uses: actions/upload-artifact@v4
with:
name: te_docs
path: docs/_build/html
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/license.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# See LICENSE for license information.

Expand Down
29 changes: 1 addition & 28 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# See LICENSE for license information.

Expand Down Expand Up @@ -61,30 +61,3 @@ jobs:
export PYTHON_ONLY=1
export TE_PATH=.
bash ./qa/L0_jax_lint/test.sh
paddle_cpplint:
name: 'PaddlePaddle C++'
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: 'Lint'
run: |
sudo apt-get update
sudo apt-get install pip -y
export CPP_ONLY=1
export TE_PATH=.
bash ./qa/L0_paddle_lint/test.sh
paddle_pylint:
name: 'PaddlePaddle Python'
runs-on: ubuntu-latest
steps:
- name: 'Checkout'
uses: actions/checkout@v3
- name: 'Lint'
run: |
sudo apt-get update
sudo apt-get install pip -y
pip install paddlepaddle-gpu
export PYTHON_ONLY=1
export TE_PATH=.
bash ./qa/L0_paddle_lint/test.sh
7 changes: 6 additions & 1 deletion .github/workflows/trigger-ci.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# See LICENSE for license information.

Expand Down Expand Up @@ -40,6 +40,11 @@ jobs:
|| github.actor == 'vasunvidia'
|| github.actor == 'erhoo82'
|| github.actor == 'kocchop'
|| github.actor == 'youngeunkwon0405'
|| github.actor == 'KshitijLakhani'
|| github.actor == 'jberchtold-nvidia'
|| github.actor == 'sanandaraj5597'
|| github.actor == 'negvet'
)
steps:
- name: Check if comment is issued by authorized person
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/upload-ci-logs.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# See LICENSE for license information.

Expand Down
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
*.nsys-rep
*.ncu-rep
*.sqlite
*.onnx
*.eggs
build/
*.so
Expand Down Expand Up @@ -39,3 +38,4 @@ downloads/
.pytest_cache/
compile_commands.json
.nfs
tensor_dumps/
2 changes: 1 addition & 1 deletion 3rdparty/cudnn-frontend
2 changes: 1 addition & 1 deletion CONTRIBUTING.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
..
Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

See LICENSE for license information.

Expand Down
2 changes: 1 addition & 1 deletion CPPLINT.cfg
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2022-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# See LICENSE for license information.

Expand Down
Loading