Skip to content

Strange behavior on the size of knowledge_graph.entity_to_idx in distributed training #331

@Jean-KOUAGOU

Description

@Jean-KOUAGOU

When running on this command dicee --sparql_endpoint "https://dbpedia.data.dice-research.org/sparql" --trainer PL --model "DeCaL" --num_epochs 10 --batch_size 32 --p 1 --q 1 --r 1 --embedding_dim 16 --scoring_technique KvsAll --eval_model None --optim Adam --lr 0.01 --num_core 32 --backend polars --path_to_store_single_run "DBpedia-Embs" --save_embeddings_as_csv

the size of self.trainer.dataset.entity_to_idx decreases when being ``initialized the second time due to 2 GPUs''. This causes target_dim in KvsAll to change and leads to a size mismatch error in the loss function: outputs from the model have a the same size as the initial value of ```len(self.trainer.dataset.entity_to_idx)```, while the targets take the new value of ```len(self.trainer.dataset.entity_to_idx)``` obtained through the second GPU process.

Is there a way to make sure that every initialization involving datasets are done only once independently of the number of GPUs?

Note: I was trying to fix the issue but I am not sure I will have enough time. Debugging already took me 5h:)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions