Skip to content

Latest commit

 

History

History
317 lines (216 loc) · 9.45 KB

File metadata and controls

317 lines (216 loc) · 9.45 KB

DataFrames

Overview

The DataFrame class is the core abstraction in DataFusion that represents tabular data and operations on that data. DataFrames provide a flexible API for transforming data through various operations such as filtering, projection, aggregation, joining, and more.

A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when terminal operations like collect(), show(), or to_pandas() are called.

Creating DataFrames

DataFrames can be created in several ways:

  • From SQL queries via a SessionContext:

    from datafusion import SessionContext
    
    ctx = SessionContext()
    df = ctx.sql("SELECT * FROM your_table")
  • From registered tables:

    df = ctx.table("your_table")
  • From various data sources:

    # From CSV files (see :ref:`io_csv` for detailed options)
    df = ctx.read_csv("path/to/data.csv")
    
    # From Parquet files (see :ref:`io_parquet` for detailed options)
    df = ctx.read_parquet("path/to/data.parquet")
    
    # From JSON files (see :ref:`io_json` for detailed options)
    df = ctx.read_json("path/to/data.json")
    
    # From Avro files (see :ref:`io_avro` for detailed options)
    df = ctx.read_avro("path/to/data.avro")
    
    # From Pandas DataFrame
    import pandas as pd
    pandas_df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
    df = ctx.from_pandas(pandas_df)
    
    # From Arrow data
    import pyarrow as pa
    batch = pa.RecordBatch.from_arrays(
        [pa.array([1, 2, 3]), pa.array([4, 5, 6])],
        names=["a", "b"]
    )
    df = ctx.from_arrow(batch)

For detailed information about reading from different data sources, see the :doc:`I/O Guide <../io/index>`. For custom data sources, see :ref:`io_custom_table_provider`.

Common DataFrame Operations

DataFusion's DataFrame API offers a wide range of operations:

from datafusion import column, literal

# Select specific columns
df = df.select("col1", "col2")

# Select with expressions
df = df.select(column("a") + column("b"), column("a") - column("b"))

# Filter rows
df = df.filter(column("age") > literal(25))

# Add computed columns
df = df.with_column("full_name", column("first_name") + literal(" ") + column("last_name"))

# Multiple column additions
df = df.with_columns(
    (column("a") + column("b")).alias("sum"),
    (column("a") * column("b")).alias("product")
)

# Sort data
df = df.sort(column("age").sort(ascending=False))

# Join DataFrames
df = df1.join(df2, on="user_id", how="inner")

# Aggregate data
from datafusion import functions as f
df = df.aggregate(
    [],  # Group by columns (empty for global aggregation)
    [f.sum(column("amount")).alias("total_amount")]
)

# Limit rows
df = df.limit(100)

# Drop columns
df = df.drop("temporary_column")

Terminal Operations

To materialize the results of your DataFrame operations:

# Collect all data as PyArrow RecordBatches
result_batches = df.collect()

# Convert to various formats
pandas_df = df.to_pandas()        # Pandas DataFrame
polars_df = df.to_polars()        # Polars DataFrame
arrow_table = df.to_arrow_table() # PyArrow Table
py_dict = df.to_pydict()          # Python dictionary
py_list = df.to_pylist()          # Python list of dictionaries

# Display results
df.show()                         # Print tabular format to console

# Count rows
count = df.count()

Zero-copy streaming to Arrow-based Python libraries

DataFusion DataFrames implement the __arrow_c_stream__ protocol, enabling zero-copy, lazy streaming into Arrow-based Python libraries. With the streaming protocol, batches are produced on demand so you can process arbitrarily large results without out-of-memory errors.

Note

The protocol is implementation-agnostic and works with any Python library that understands the Arrow C streaming interface (for example, PyArrow or other Arrow-compatible implementations). The sections below provide a short PyArrow-specific example and general guidance for other implementations.

PyArrow

import pyarrow as pa

# Create a PyArrow RecordBatchReader without materializing all batches
reader = pa.RecordBatchReader.from_stream(df)
for batch in reader:
    ...  # process each batch as it is produced

DataFrames are also iterable, yielding :class:`datafusion.RecordBatch` objects lazily so you can loop over results directly without importing PyArrow:

.. code-block:: python
for batch in df:
... # each batch is a datafusion.RecordBatch

Each batch exposes to_pyarrow(), allowing conversion to a PyArrow table. pa.table(df) collects the entire DataFrame eagerly into a PyArrow table:

.. code-block:: python
import pyarrow as pa table = pa.table(df)

Asynchronous iteration is supported as well, allowing integration with asyncio event loops:

.. code-block:: python
async for batch in df:
... # process each batch as it is produced

To work with the stream directly, use execute_stream(), which returns a :class:`~datafusion.RecordBatchStream`:

.. code-block:: python

stream = df.execute_stream() for batch in stream:

...

Execute as Stream

For finer control over streaming execution, use :py:meth:`~datafusion.DataFrame.execute_stream` to obtain a :py:class:`datafusion.RecordBatchStream`:

stream = df.execute_stream()
for batch in stream:
    ...  # process each batch as it is produced

Tip

To get a PyArrow reader instead, call pa.RecordBatchReader.from_stream(df).

When partition boundaries are important, :py:meth:`~datafusion.DataFrame.execute_stream_partitioned` returns an iterable of :py:class:`datafusion.RecordBatchStream` objects, one per partition:

for stream in df.execute_stream_partitioned():
    for batch in stream:
        ...  # each stream yields RecordBatches

To process partitions concurrently, first collect the streams into a list and then poll each one in a separate asyncio task:

import asyncio

async def consume(stream):
    async for batch in stream:
        ...

streams = list(df.execute_stream_partitioned())
await asyncio.gather(*(consume(s) for s in streams))

See :doc:`../io/arrow` for additional details on the Arrow interface.

HTML Rendering

When working in Jupyter notebooks or other environments that support HTML rendering, DataFrames will automatically display as formatted HTML tables. For detailed information about customizing HTML rendering, formatting options, and advanced styling, see :doc:`rendering`.

Core Classes

DataFrame

The main DataFrame class for building and executing queries.

See: :py:class:`datafusion.DataFrame`

SessionContext

The primary entry point for creating DataFrames from various data sources.

Key methods for DataFrame creation:

See: :py:class:`datafusion.SessionContext`

Expression Classes

Expr

Represents expressions that can be used in DataFrame operations.

See: :py:class:`datafusion.Expr`

Functions for creating expressions:

Built-in Functions

DataFusion provides many built-in functions for data manipulation:

For a complete list of available functions, see the :py:mod:`datafusion.functions` module documentation.

.. toctree::
   :maxdepth: 1

   rendering