SynapseML version
com.microsoft.azure:synapseml_2.12:0.11.4-spark3.3
System information
- Language version (e.g. python 3.8, scala 2.12): python 3.9
- Spark Version (e.g. 3.2.3): 3.3.2
- Spark Platform (e.g. Synapse, Databricks): Databricks
Describe the problem
I have a for-loop lightgbm fit job for rolling back validation;
The job failed on multi-node cluster with log error Connection Refused, and after checked the failed tasks, the executor failed with detail error message java.lang.ArrayIndexOutOfBoundsException and caused the Connection Refused error;
Meanwhile the job can run on single-node cluster without any issue.
The dataframe sent to model is around 48,000, with partition as below
Partition 0 has 19000 records
Partition 1 has 18000 records
Partition 2 has 7000 records
Partition 3 has 4000 records
And the issue cannot be fixed by df.repartition(5).
Code to reproduce issue
max_base_date = '2024-09-01'
tmp_train_df = train_merged_df.where(sf.col('base_date')<max_base_date).cache()
tmp_actual_df = actual_merged_df.where(sf.col('base_date')<max_base_date).cache()
model.fit(tmp_train_df, tmp_actual_df)
Other info / logs
No response
What component(s) does this bug affect?
What language(s) does this bug affect?
What integration(s) does this bug affect?
SynapseML version
com.microsoft.azure:synapseml_2.12:0.11.4-spark3.3
System information
Describe the problem
I have a for-loop lightgbm fit job for rolling back validation;
The job failed on multi-node cluster with log error
Connection Refused, and after checked the failed tasks, the executor failed with detail error messagejava.lang.ArrayIndexOutOfBoundsExceptionand caused theConnection Refusederror;Meanwhile the job can run on single-node cluster without any issue.
The dataframe sent to model is around 48,000, with partition as below
Partition 0 has 19000 records
Partition 1 has 18000 records
Partition 2 has 7000 records
Partition 3 has 4000 records
And the issue cannot be fixed by
df.repartition(5).Code to reproduce issue
Other info / logs
No response
What component(s) does this bug affect?
area/cognitive: Cognitive projectarea/core: Core projectarea/deep-learning: DeepLearning projectarea/lightgbm: Lightgbm projectarea/opencv: Opencv projectarea/vw: VW projectarea/website: Websitearea/build: Project build systemarea/notebooks: Samples under notebooks folderarea/docker: Docker usagearea/models: models related issueWhat language(s) does this bug affect?
language/scala: Scala source codelanguage/python: Pyspark APIslanguage/r: R APIslanguage/csharp: .NET APIslanguage/new: Proposals for new client languagesWhat integration(s) does this bug affect?
integrations/synapse: Azure Synapse integrationsintegrations/azureml: Azure ML integrationsintegrations/databricks: Databricks integrations