You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PyIceberg is becoming more coupled with PyArrow, to_arrow() and pa.Table are widely used for reading and writing, including the new feature create_table with a PyArrow Schema #305
Easier to keep the 2 FileIO's behavior in sync. For example, FsSpec defaults the path with no scheme (/tmp/warehouse) to the file scheme, but PyArrow does not. See #301
The two FileIO implementations are not that different from one another. FsSpec can use its underlying FS implementations, including LocalFileSystem, S3FileSystem, GCSFileSystem, and AzureBlobFileSystem.
While PyArrow uses its FS implementations including LocalFileSystem, S3FileSystem, HadoopFileSystem, and GcsFileSystem.
PyArrow is currently missing the HadoopFileSystem implementation but it has support for HDFS.
Feature Request / Improvement
Can we consolidate and standardize FileIO to the PyArrow implementation?
There are currently two different FileIO implementations,
ARROW_FILE_IOandFSSPEC_FILE_IO.ARROW_FILE_IOuses Apache Arrow's Filesystem Interface whileFSSPEC_FILE_IOuses thefsspeclibrary.Here are a few reasons for consolidating:
PyArrow is already preferred over FsSpec for various FS implementations.
iceberg-python/pyiceberg/io/__init__.py
Lines 273 to 282 in cd7fb50
PyIceberg is becoming more coupled with PyArrow,
to_arrow()andpa.Tableare widely used for reading and writing, including the new featurecreate_tablewith a PyArrow Schema #305Easier to keep the 2 FileIO's behavior in sync. For example, FsSpec defaults the path with no scheme (
/tmp/warehouse) to thefilescheme, but PyArrow does not. See #301The two FileIO implementations are not that different from one another. FsSpec can use its underlying FS implementations, including
LocalFileSystem,S3FileSystem,GCSFileSystem, andAzureBlobFileSystem.While PyArrow uses its FS implementations including
LocalFileSystem,S3FileSystem,HadoopFileSystem, andGcsFileSystem.PyArrow is currently missing the
HadoopFileSystemimplementation but it has support for HDFS.Fsspec and PyArrow can be used directionally
PyArrow can use fsspec-based filesystem.
FsSpec can wrap PyArrow filesystem.