Skip to content

Avro/Parquet readers bypass FileIO::ResolvePath(), causing S3 URI scheme errors #7

@smaheshwar-pltr

Description

@smaheshwar-pltr

Problem

When reading Avro or Parquet files on S3 via ArrowFileSystemFileIO, the readers call io->fs()->OpenInputFile() directly on the underlying Arrow filesystem, bypassing ArrowFileSystemFileIO::ResolvePath(). This means s3:// prefixes are never stripped. Arrow's S3FileSystem expects bare bucket/key paths, not full URIs.

Consumers using iceberg-cpp to scan tables with S3-backed storage hit this when the REST catalog returns manifest/data file paths with s3:// schemes — the file paths flow through the Avro manifest reader untransformed.

Error

Invalid: Expected an S3 object path of the form 'bucket/key...', got a URI:
's3://warehouse/default/test_table/metadata/snap-487842974509551922-0-dc0a55d6-5df1-4ffa-a01c-b7481e5c663c.avro'

This originates from Arrow's S3Path::FromString() in s3fs.cc:

if (internal::IsLikelyUri(s)) {
    return Status::Invalid(
        "Expected an S3 object path of the form 'bucket/key...', got a URI: '", s, "'");
}

Variant errors depending on FileIO configuration

When the FileIO falls back to local filesystem instead of S3:

Invalid: The filesystem expected a URI with one of the schemes (file) but received
s3://warehouse/testing/sample/metadata/snap-4065910918800248368-0-422aac49-0c9d-43c9-8cb4-3d21931e29f2.avro

Affected code paths

  • avro_reader.cc — calls io->fs()->OpenInputFile() directly
  • parquet_reader.cc — calls io->fs()->OpenInputFile() directly
  • avro_writer.cc — calls io->fs()->OpenOutputStream() directly
  • parquet_writer.cc — calls io->fs()->OpenOutputStream() directly

All four bypass ArrowFileSystemFileIO::ResolvePath() which handles URI scheme stripping.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions