Arrow

DataFusion implements the Apache Arrow PyCapsule interface for importing and exporting DataFrames with zero copy. With this feature, any Python project that implements this interface can share data back and forth with DataFusion with zero copy.

We can demonstrate using pyarrow.

Importing to DataFusion

Here we will create an Arrow table and import it to DataFusion.

To import an Arrow table, use :py:func:`datafusion.context.SessionContext.from_arrow`. This will accept any Python object that implements __arrow_c_stream__ or __arrow_c_array__ and returns a StructArray. Common pyarrow sources you can use are:

Array (but it must return a Struct Array)
Record Batch
Record Batch Reader
Table

.. ipython:: python

    from datafusion import SessionContext
    import pyarrow as pa

    data = {"a": [1, 2, 3], "b": [4, 5, 6]}
    table = pa.Table.from_pydict(data)

    ctx = SessionContext()
    df = ctx.from_arrow(table)
    df

Exporting from DataFusion

DataFusion DataFrames implement __arrow_c_stream__ PyCapsule interface, so any Python library that accepts these can import a DataFusion DataFrame directly.

Invoking __arrow_c_stream__ triggers execution of the underlying query, but batches are yielded incrementally rather than materialized all at once in memory. Consumers can process the stream as it arrives. The stream executes lazily, letting downstream readers pull batches on demand.

.. ipython:: python

    from datafusion import col, lit

    df = df.select((col("a") * lit(1.5)).alias("c"), lit("df").alias("d"))
    pa.table(df)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow

Importing to DataFusion

Exporting from DataFusion

FilesExpand file tree

arrow.rst

Latest commit

History

arrow.rst

File metadata and controls

Arrow

Importing to DataFusion

Exporting from DataFusion