DataFusion implements the Apache Arrow PyCapsule interface for importing and exporting DataFrames with zero copy. With this feature, any Python project that implements this interface can share data back and forth with DataFusion with zero copy.
We can demonstrate using pyarrow.
Here we will create an Arrow table and import it to DataFusion.
To import an Arrow table, use :py:func:`datafusion.context.SessionContext.from_arrow`.
This will accept any Python object that implements
__arrow_c_stream__
or __arrow_c_array__
and returns a StructArray. Common pyarrow sources you can use are:
- Array (but it must return a Struct Array)
- Record Batch
- Record Batch Reader
- Table
.. ipython:: python
from datafusion import SessionContext
import pyarrow as pa
data = {"a": [1, 2, 3], "b": [4, 5, 6]}
table = pa.Table.from_pydict(data)
ctx = SessionContext()
df = ctx.from_arrow(table)
df
DataFusion DataFrames implement __arrow_c_stream__ PyCapsule interface, so any
Python library that accepts these can import a DataFusion DataFrame directly.
Note
Invoking __arrow_c_stream__ still triggers execution of the underlying
query, but batches are yielded incrementally rather than materialized all at
once in memory. Consumers can process the stream as it arrives, avoiding the
memory overhead of a full
:py:func:`datafusion.dataframe.DataFrame.collect`.
For an example of this streamed execution and its memory safety, see the
test_arrow_c_stream_large_dataset unit test in
:mod:`python.tests.test_io`.
.. ipython:: python
df = df.select((col("a") * lit(1.5)).alias("c"), lit("df").alias("d"))
pa.table(df)