Skip to content

update hadoop to more recent version #109

@krisiasty

Description

@krisiasty

Currently used hadoop version (2.7.3) is way too old (released back in June 2017). One of the consequences is the missing support for "fs.s3a.path.style.access" property of s3a filesystem layer, which means the s3-compatible object store must be configured with virtual hosting for buckets. This in turn is not supported on the OpenShift Container Storage 4.4 (or at least it is not enabled by default and not properly documented how to configure this feature).

That means the Spark + Object Store example in the basic tutorial won't work on the latest OpenShift Container Platform (4.4) with OpenShift Container Storage.

Despite having the following settings:

s3_endpoint_url = 'https://s3.openshift-storage.svc:443'
s3_bucket = 'odh-jupyterhub-9654ef69-1f36-48f1-b50f-4d2dbef1357d'

hadoopConf.set("fs.s3a.path.style.access", "true")

the code from tutorial raises exception trying to connect to the bucket vi virtual host (http://bucket.s3endpoint/ instead of https://s3endpoint/bucket/):

Py4JJavaError: An error occurred while calling o96.csv.
: com.amazonaws.AmazonClientException: Unable to execute HTTP request: odh-jupyterhub-9654ef69-1f36-48f1-b50f-4d2dbef1357d.s3.openshift-storage.svc: Name or service not known

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions