Skip to content

Multi-Node Sparse Training Error  #2

@gaow0007

Description

@gaow0007

Thanks for your releasing Ok-Topk. It is an interesting work, and I am developing certain functions based this repo.
I succeed in single-node training. However, when I try Ok-Topk across 2 nodes, a total 8 GPUs. I found that certain values in all_indexes are negative.

May I ask some suggestions about how to debug it?

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions