Skip to content

upload_benchmark_results workflow hangs on get_dataset_details due to full dataset downloads #601

@R-Palazzo

Description

@R-Palazzo

Environment Details

  • SDGym version: 0.14.3

Error Description

When running the benchmark upload script for multi_table modality, the process silently hangs after for building the Dataset_Details.xlsx and never completes. On Github action this causes a:
Error: Process completed with exit code 143.

The root cause lies in get_dataset_details: it calls explorer.summarize_datasets(modality=modality) on both the public and private buckets, which ultimately downloads the data ZIP for every dataset in each bucket (not just those in the current benchmark run) to compute the Total_Num_Rows column. Now that the rel-bench datasets have been added to the private bucket, the process fails when downloading them.

Expected behavior

get_dataset_details should only fetch information for datasets that were part of the benchmark run. To do this:

  • Add a datasets parameter to DatasetExplorer._load_and_summarize_datasets(): a list of dataset names. When None (the default), all datasets in the bucket are summarized. When provided, only the datasets in the list are summarized.
  • Update get_dataset_details to call _load_and_summarize_datasets passing only the datasets present in the benchmark results.

Additional Context

Total_Num_Rows is currently the only field in the dataset details table that requires downloading the data ZIP. This is out of scope for this issue, but we might consider storing this information elsewhere (e.g., in the dataset's metainfo.yaml) so we don't need to download the data to get it.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions