Skip to content
This repository was archived by the owner on Mar 3, 2026. It is now read-only.
This repository was archived by the owner on Mar 3, 2026. It is now read-only.

different SyncTensorsGraph for different training steps #360

@jialei777

Description

@jialei777

log here
https://github.com/AI-Hypercomputer/torchprime/actions/runs/16889868996/job/47847279717#step:7:1103

[2025-08-11 19:36:26,339][torchprime.torch_xla_models.trainer.base_trainer][INFO] - Finished training run
[2025-08-11 19:36:28,814][root][INFO] - Found newest profile: /tmp/gcs-mount/tp-run/16889868996-1/ds-v3-shallow-7c6cczci/profile/0-0/plugins/profile/2025_08_11_19_34_49/gke-tpu-7059ce52-gxns.xplane.pb
[2025-08-11 19:36:28,814][torchprime.metrics.step_duration][INFO] - Loading /tmp/gcs-mount/tp-run/16889868996-1/ds-v3-shallow-7c6cczci/profile/0-0/plugins/profile/2025_08_11_19_34_49/gke-tpu-7059ce52-gxns.xplane.pb
Plane ID: 2, Name: /device:TPU:0
  Line ID: 2, Name: XLA Modules
    Event Metadata Name: SyncTensorsGraph.13208(16646631473018641282), ID: 5703, Offset: 0.260 s, Duration: 0.288 s
    Event Metadata Name: SyncTensorsGraph.13214(15300851610953471503), ID: 11532, Offset: 71.466 s, Duration: 0.289 s
    Event Metadata Name: SyncTensorsGraph.13214(15300851610953471503), ID: 11532, Offset: 71.755 s, Duration: 0.290 s
    Event Metadata Name: SyncTensorsGraph.13226(1256450336661157904), ID: 17517, Offset: 142.631 s, Duration: 0.315 s
    Event Metadata Name: SyncTensorsGraph.13214(15300851610953471503), ID: 11532, Offset: 142.946 s, Duration: 0.290 s
    Event Metadata Name: SyncTensorsGraph.13226(1256450336661157904), ID: 17517, Offset: 143.236 s, Duration: 0.291 s
    Event Metadata Name: SyncTensorsGraph.13214(15300851610953471503), ID: 11532, Offset: 143.528 s, Duration: 0.290 s
Error executing job with overrides: ['model=deepseek-v3-shallow', 'dataset=wikitext', 'dataset.block_size=512', 'task=train', 'task.lr_scheduler.type=constant', 'task.global_batch_size=4', 'task.max_steps=15', 'ici_mesh.fsdp=4', 'profile_start_step=3', 'profile_dir=/tmp/gcs-mount/tp-run/16889868996-1/ds-v3-shallow-7c6cczci/profile/0-0', 'output_dir=/tmp/gcs-mount/tp-run/16889868996-1/ds-v3-shallow-7c6cczci/outputs/0-0']

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions