Skip to content

Commit 23b4344

Browse files
committed
add flight recorder tutorial
Summary: Add a tutorial for debugging single-rank hangs in distributed PyTorch jobs using the TorchComms Flight Recorder, covering both aggregated text dump analysis and per-rank pickle-based CLI detection workflows.
1 parent 05c08f2 commit 23b4344

File tree

3 files changed

+467
-0
lines changed

3 files changed

+467
-0
lines changed

distributed.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -211,6 +211,7 @@ Custom Extensions
211211
intermediate/rpc_param_server_tutorial
212212
intermediate/rpc_async_execution
213213
intermediate/monarch_distributed_tutorial
214+
intermediate/debug_hangs_with_flight_recorder
214215
advanced/rpc_ddp_tutorial
215216
advanced/generic_join
216217
beginner/distributed_training_with_ray_tutorial

index.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -723,6 +723,13 @@ Welcome to PyTorch Tutorials
723723
:link: intermediate/monarch_distributed_tutorial.html
724724
:tags: Parallel-and-Distributed-Training
725725

726+
.. customcarditem::
727+
:header: Debugging Hangs with Flight Recorder Using TorchComms and Debug Server
728+
:card_description: Diagnose hangs using the TorchComms Flight Recorder and Debug Server periodic dumps.
729+
:image: _static/img/thumbnails/cropped/generic-pytorch-logo.png
730+
:link: intermediate/debug_hangs_with_flight_recorder.html
731+
:tags: Parallel-and-Distributed-Training,Debugging
732+
726733
.. Edge
727734
728735
.. customcarditem::

0 commit comments

Comments
 (0)