Commit 23b4344
committed
add flight recorder tutorial
Summary:
Add a tutorial for debugging single-rank hangs in distributed PyTorch jobs using the TorchComms Flight Recorder, covering both aggregated text dump analysis and per-rank pickle-based CLI detection workflows.1 parent 05c08f2 commit 23b4344
File tree
3 files changed
+467
-0
lines changed- intermediate_source
3 files changed
+467
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
211 | 211 | | |
212 | 212 | | |
213 | 213 | | |
| 214 | + | |
214 | 215 | | |
215 | 216 | | |
216 | 217 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
723 | 723 | | |
724 | 724 | | |
725 | 725 | | |
| 726 | + | |
| 727 | + | |
| 728 | + | |
| 729 | + | |
| 730 | + | |
| 731 | + | |
| 732 | + | |
726 | 733 | | |
727 | 734 | | |
728 | 735 | | |
| |||
0 commit comments