The DataFrame class is the core abstraction in DataFusion that represents tabular data and operations
on that data. DataFrames provide a flexible API for transforming data through various operations such as
filtering, projection, aggregation, joining, and more.
A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when
terminal operations like collect(), show(), or to_pandas() are called.
DataFrames can be created in several ways:
From SQL queries via a
SessionContext:from datafusion import SessionContext ctx = SessionContext() df = ctx.sql("SELECT * FROM your_table")
From registered tables:
df = ctx.table("your_table")
From various data sources:
# From CSV files (see :ref:`io_csv` for detailed options) df = ctx.read_csv("path/to/data.csv") # From Parquet files (see :ref:`io_parquet` for detailed options) df = ctx.read_parquet("path/to/data.parquet") # From JSON files (see :ref:`io_json` for detailed options) df = ctx.read_json("path/to/data.json") # From Avro files (see :ref:`io_avro` for detailed options) df = ctx.read_avro("path/to/data.avro") # From Pandas DataFrame import pandas as pd pandas_df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) df = ctx.from_pandas(pandas_df) # From Arrow data import pyarrow as pa batch = pa.RecordBatch.from_arrays( [pa.array([1, 2, 3]), pa.array([4, 5, 6])], names=["a", "b"] ) df = ctx.from_arrow(batch)
For detailed information about reading from different data sources, see the :doc:`I/O Guide <../io/index>`. For custom data sources, see :ref:`io_custom_table_provider`.
DataFusion's DataFrame API offers a wide range of operations:
from datafusion import column, literal
# Select specific columns
df = df.select("col1", "col2")
# Select with expressions
df = df.select(column("a") + column("b"), column("a") - column("b"))
# Filter rows
df = df.filter(column("age") > literal(25))
# Add computed columns
df = df.with_column("full_name", column("first_name") + literal(" ") + column("last_name"))
# Multiple column additions
df = df.with_columns(
(column("a") + column("b")).alias("sum"),
(column("a") * column("b")).alias("product")
)
# Sort data
df = df.sort(column("age").sort(ascending=False))
# Join DataFrames
df = df1.join(df2, on="user_id", how="inner")
# Aggregate data
from datafusion import functions as f
df = df.aggregate(
[], # Group by columns (empty for global aggregation)
[f.sum(column("amount")).alias("total_amount")]
)
# Limit rows
df = df.limit(100)
# Drop columns
df = df.drop("temporary_column")To materialize the results of your DataFrame operations:
# Collect all data as PyArrow RecordBatches
result_batches = df.collect()
# Convert to various formats
pandas_df = df.to_pandas() # Pandas DataFrame
polars_df = df.to_polars() # Polars DataFrame
arrow_table = df.to_arrow_table() # PyArrow Table
py_dict = df.to_pydict() # Python dictionary
py_list = df.to_pylist() # Python list of dictionaries
# Display results
df.show() # Print tabular format to console
# Count rows
count = df.count()DataFusion DataFrames implement the __arrow_c_stream__ protocol, enabling
zero-copy streaming into libraries like PyArrow.
Earlier versions eagerly converted the entire DataFrame when exporting to
PyArrow, which could exhaust memory on large datasets. With streaming, batches
are produced lazily so you can process arbitrarily large results without
out-of-memory errors.
import pyarrow as pa
# Create a PyArrow RecordBatchReader without materializing all batches
reader = pa.RecordBatchReader._import_from_c_capsule(df.__arrow_c_stream__())
for batch in reader:
... # process each batch as it is producedDataFrames are also iterable, yielding :class:`pyarrow.RecordBatch` objects lazily so you can loop over results directly:
for batch in df:
... # process each batch as it is producedSee :doc:`../io/arrow` for additional details on the Arrow interface.
When working in Jupyter notebooks or other environments that support HTML rendering, DataFrames will automatically display as formatted HTML tables. For detailed information about customizing HTML rendering, formatting options, and advanced styling, see :doc:`rendering`.
- DataFrame
The main DataFrame class for building and executing queries.
- SessionContext
The primary entry point for creating DataFrames from various data sources.
Key methods for DataFrame creation:
- :py:meth:`~datafusion.SessionContext.read_csv` - Read CSV files
- :py:meth:`~datafusion.SessionContext.read_parquet` - Read Parquet files
- :py:meth:`~datafusion.SessionContext.read_json` - Read JSON files
- :py:meth:`~datafusion.SessionContext.read_avro` - Read Avro files
- :py:meth:`~datafusion.SessionContext.table` - Access registered tables
- :py:meth:`~datafusion.SessionContext.sql` - Execute SQL queries
- :py:meth:`~datafusion.SessionContext.from_pandas` - Create from Pandas DataFrame
- :py:meth:`~datafusion.SessionContext.from_arrow` - Create from Arrow data
- Expr
Represents expressions that can be used in DataFrame operations.
Functions for creating expressions:
- :py:func:`datafusion.column` - Reference a column by name
- :py:func:`datafusion.literal` - Create a literal value expression
DataFusion provides many built-in functions for data manipulation:
- :py:mod:`datafusion.functions` - Mathematical, string, date/time, and aggregation functions
For a complete list of available functions, see the :py:mod:`datafusion.functions` module documentation.
.. toctree:: :maxdepth: 1 rendering