Skip to content

Cannot read parquet from Azure when archive file exists #406

@mavestergaard

Description

@mavestergaard

This is the setup in the blob storage:

container-name/YEAR2023/MONTH03/DAY15/file_name.parquet
container-name/YEAR2023/MONTH03/DAY15/file_name.parquet.archive_202303162004
container-name/YEAR2023/MONTH03/DAY14/file_name.parquet

I get the following error:
azure.core.exceptions.HttpResponseError: The specifed resource name contains invalid characters.
When I am trying to read:
file_name = "az://container-name/YEAR2023/MONTH03/DAY15/file_name.parquet"
I only get the error when the following file exists:
"az://container-name/YEAR2023/MONTH03/DAY15/file_name.parquet.archive_202303162004"

file_name = "az://container-name/YEAR2023/MONTH03/DAY14/file_name.parquet"
is read fine.

There appears to be a regression from 2022.10.0 to any newer version as with 2022.10.0 the below works:

import pandas as pd
df = pd.read_parquet('az://container-name/YEAR2023/MONTH03/DAY15/file_name.parquet',storage_options={'connection_string': con_str, })

works with:

adlfs == 2022.10.0
fsspec ==  2023.3.0

but not newer versions of adlfs

With the newest version of adlfs this works though:

import fsspec
import pandas as pd
fs = fsspec.filesystem('az', connection_string=con_str)
fs.ls("/container-name/YEAR2023/MONTH03/DAY15/")
df = pd.read_parquet(file_name, storage_options={'connection_string': con_str})

but not if i ran the command
df = pd.read_parquet('az://container-name/YEAR2023/MONTH03/DAY15/file_name.parquet',storage_options={'connection_string': con_str}) first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions