Skip to content

[BUG] DeepCompile: Training hang on random-sized inputs #7611

@eternalNight

Description

@eternalNight

Description

Real-life training data may not be of the same size for every rank and at every iteration. When DeepCompile is active, training with variable-length data can hang because DeepCompile requires communication among the ranks during profiling, but:

  1. The compute graph may not be exactly the same across ranks (e.g. some have specific padding while the others don't).
  2. Guard failure (due to tensor size change) on different ranks may occur at different iterations.

To Reproduce

  1. Download https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3
  2. Execute deepspeed --num_gpus=N openvla-like.py -c -r.

Expected behavior

DeepCompile works for variable-length training data.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtraining

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions