Environment Details
Error Description
When running the benchmark upload script for multi_table modality, the process silently hangs after for building the Dataset_Details.xlsx and never completes. On Github action this causes a:
Error: Process completed with exit code 143.
The root cause lies in get_dataset_details: it calls explorer.summarize_datasets(modality=modality) on both the public and private buckets, which ultimately downloads the data ZIP for every dataset in each bucket (not just those in the current benchmark run) to compute the Total_Num_Rows column. Now that the rel-bench datasets have been added to the private bucket, the process fails when downloading them.
Expected behavior
get_dataset_details should only fetch information for datasets that were part of the benchmark run. To do this:
- Add a datasets parameter to
DatasetExplorer._load_and_summarize_datasets(): a list of dataset names. When None (the default), all datasets in the bucket are summarized. When provided, only the datasets in the list are summarized.
- Update
get_dataset_details to call _load_and_summarize_datasets passing only the datasets present in the benchmark results.
Additional Context
Total_Num_Rows is currently the only field in the dataset details table that requires downloading the data ZIP. This is out of scope for this issue, but we might consider storing this information elsewhere (e.g., in the dataset's metainfo.yaml) so we don't need to download the data to get it.
Environment Details
Error Description
When running the benchmark upload script for multi_table modality, the process silently hangs after for building the
Dataset_Details.xlsxand never completes. On Github action this causes a:Error: Process completed with exit code 143.The root cause lies in
get_dataset_details: it callsexplorer.summarize_datasets(modality=modality)on both thepublicandprivatebuckets, which ultimately downloads the data ZIP for every dataset in each bucket (not just those in the current benchmark run) to compute theTotal_Num_Rowscolumn. Now that therel-benchdatasets have been added to the private bucket, the process fails when downloading them.Expected behavior
get_dataset_detailsshould only fetch information for datasets that were part of the benchmark run. To do this:DatasetExplorer._load_and_summarize_datasets(): a list of dataset names. When None (the default), all datasets in the bucket are summarized. When provided, only the datasets in the list are summarized.get_dataset_detailsto call_load_and_summarize_datasetspassing only the datasets present in the benchmark results.Additional Context
Total_Num_Rowsis currently the only field in the dataset details table that requires downloading the data ZIP. This is out of scope for this issue, but we might consider storing this information elsewhere (e.g., in the dataset's metainfo.yaml) so we don't need to download the data to get it.