Skip to content
This repository was archived by the owner on May 29, 2025. It is now read-only.
This repository was archived by the owner on May 29, 2025. It is now read-only.

Albert does not run locally due to FSx data source #43

@Cpruce

Description

@Cpruce

Error:

root@6eca14cacd82:/shared/sm_bert_tf/tf2/deep-learning-models/models/nlp# ./launch.sh
wandb: WARNING W&B installed but not logged in.  Run `wandb login` or set the WANDB_API_KEY env variable.
2020-10-26 13:42:31.332706: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Parameter distribution will be renamed to {'mpi': {'enabled': True, 'processes_per_host': 8, 'custom_mpi_options': '-verbose --NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none'}} in SageMaker Python SDK v2.
Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.
DATA_SOURCE {'FileSystemDataSource': {'FileSystemId': 'fs-0d3f0dfc5f6428d70', 'FileSystemType': 'FSxLustre', 'DirectoryPath': '/fsx', 'FileSystemAccessMode': 'rw'}}
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/shared/sm_bert_tf/tf2/deep-learning-models/models/nlp/albert/launch_sagemaker.py", line 63, in <module>
    security_group_ids=sm_args.security_group_ids,
  File "/shared/sm_bert_tf/tf2/deep-learning-models/models/nlp/common/sagemaker_utils.py", line 97, in launch_sagemaker_job
    estimator_hvd.fit(fsx_input)
  File "/usr/local/lib/python3.7/dist-packages/sagemaker/tensorflow/estimator.py", line 483, in fit
    fit_super()
  File "/usr/local/lib/python3.7/dist-packages/sagemaker/tensorflow/estimator.py", line 462, in fit_super
    super(TensorFlow, self).fit(inputs, wait, logs, job_name, experiment_config)
  File "/usr/local/lib/python3.7/dist-packages/sagemaker/estimator.py", line 496, in fit
    self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
  File "/usr/local/lib/python3.7/dist-packages/sagemaker/estimator.py", line 1091, in start_new
    estimator.sagemaker_session.train(**train_args)
  File "/usr/local/lib/python3.7/dist-packages/sagemaker/session.py", line 590, in train
    self.sagemaker_client.create_training_job(**train_request)
  File "/usr/local/lib/python3.7/dist-packages/sagemaker/local/local_session.py", line 102, in create_training_job
    training_job.start(InputDataConfig, OutputDataConfig, hyperparameters, TrainingJobName)
  File "/usr/local/lib/python3.7/dist-packages/sagemaker/local/entities.py", line 80, in start
    "Need channel['DataSource'] to have ['S3DataSource'] or ['FileDataSource']"
ValueError: Need channel['DataSource'] to have ['S3DataSource'] or ['FileDataSource']

This is due to the FileSystemDataSource not being supported by _LocalTrainingJob

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions