Skip to content

Unable to load an iceberg table from aws glue catalog #515

@arookieds

Description

@arookieds

Question

PyIceberg version: 0.6.0
Python version: 3.11.1

Comments:

  • Iceberg tables are saved in a AWS Glue catalog
  • catalog, list of namespaces and list of tables are retrievable through the catalog api

Hi,

I am facing issues loading iceberg tables from AWS Glue.
The code I am using is as follow:

from opensea.resources.resources import *
import pyiceberg.catalog
    
profile_name = "saml2aws_profile_name"
catalog_name = "catalog name"
table_name = "table name"
aws_region = "aws region"

catalog = pyiceberg.catalog.load_catalog(
    catalog_name, **{"type": "glue", "profile_name": profile_name}
)

print(catalog.list_namespaces())

table = catalog.load_table((catalog_name, table_name))

The code allow me to:

  • list namespaces
  • list tables

But load_table throw the following error:

Traceback (most recent call last):
  File "/path/to/the/project/testing.py", line 15, in <module>
    table = catalog.load_table((catalog_name, table_name))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 473, in load_table
    return self._convert_glue_to_iceberg(self._get_glue_table(database_name=database_name, table_name=table_name))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 296, in _convert_glue_to_iceberg
    metadata = FromInputFile.table_metadata(file)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/serializers.py", line 112, in table_metadata
    with input_file.open() as input_stream:
         ^^^^^^^^^^^^^^^^^
  File "/path/to/the/project/venv/lib/python3.11/site-packages/pyiceberg/io/pyarrow.py", line 263, in open
    input_file = self._filesystem.open_input_file(self._path)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_fs.pyx", line 780, in pyarrow._fs.FileSystem.open_input_file
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When reading information for key 'path/to/s3/table/location/metadata/100000-458c8ffc-de06-4eb5-bc4a-b94c3034a548.metadata.json' in bucket 's3_bucket_name': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.

I have checked I have the proper accesses, but it wasn't the issue.
I have tried a few other things but they were all unsuccessful.

  • using load_glue, instead of load_catalog
  • providing access_key and secret_key directly in the load_catalog call

The table definition is as follow and was created via Trino:

create table catalog_name.table_name (
          "timestamp" timestamp,
          "type" varchar(20),
          distribution int,
          service int,
          code varchar(20),
          base_id bigint,
          counter_id bigint,
          "category" varchar(50),
          volume double)
        with (
          format = 'PARQUET',
          partitioning = ARRAY['day(timestamp)'],
          location = 's3://s3_bucket/path/to/table/folder/'
        )

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions