The DataFrame class is the core abstraction in DataFusion that represents tabular data and operations
on that data. DataFrames provide a flexible API for transforming data through various operations such as
filtering, projection, aggregation, joining, and more.
A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when
terminal operations like collect(), show(), or to_pandas() are called.
This page follows that lifecycle: first creating a DataFrame, then building a query plan with common operations, and finally executing the plan to inspect or export results. More specialized topics live on their own pages:
- :doc:`../common-operations/index` covers filtering, joins, aggregations, functions, windows, and UDFs in more detail.
- :doc:`arrow-interface` explains Arrow streaming and the
__arrow_c_stream__interface. - :doc:`rendering` covers display behavior in notebooks and terminals.
- :doc:`execution-metrics` explains how to inspect runtime statistics after a DataFrame executes.
DataFrames can be created in several ways:
From SQL queries via a
SessionContext:from datafusion import SessionContext ctx = SessionContext() df = ctx.sql("SELECT * FROM your_table")
From registered tables:
df = ctx.table("your_table")
From various data sources:
# From CSV files (see :ref:`io_csv` for detailed options) df = ctx.read_csv("path/to/data.csv") # From Parquet files (see :ref:`io_parquet` for detailed options) df = ctx.read_parquet("path/to/data.parquet") # From JSON files (see :ref:`io_json` for detailed options) df = ctx.read_json("path/to/data.json") # From Avro files (see :ref:`io_avro` for detailed options) df = ctx.read_avro("path/to/data.avro") # From Pandas DataFrame import pandas as pd pandas_df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) df = ctx.from_pandas(pandas_df) # From Arrow data import pyarrow as pa batch = pa.RecordBatch.from_arrays( [pa.array([1, 2, 3]), pa.array([4, 5, 6])], names=["a", "b"] ) df = ctx.from_arrow(batch)
For detailed information about reading from different data sources, see the :doc:`I/O Guide <../io/index>`. For custom data sources, see :ref:`io_custom_table_provider`.
DataFusion's DataFrame API offers a wide range of operations:
from datafusion import column, literal
# Select specific columns
df = df.select("col1", "col2")
# Select with expressions
df = df.select(column("a") + column("b"), column("a") - column("b"))
# Filter rows (expressions or SQL strings)
df = df.filter(column("age") > literal(25))
df = df.filter("age > 25")
# Add computed columns
df = df.with_column("full_name", column("first_name") + literal(" ") + column("last_name"))
# Multiple column additions
df = df.with_columns(
(column("a") + column("b")).alias("sum"),
(column("a") * column("b")).alias("product")
)
# Sort data
df = df.sort(column("age").sort(ascending=False))
# Join DataFrames
df = df1.join(df2, on="user_id", how="inner")
# Aggregate data
from datafusion import functions as f
df = df.aggregate(
[], # Group by columns (empty for global aggregation)
[f.sum(column("amount")).alias("total_amount")]
)
# Limit rows
df = df.limit(100)
# Drop columns
df = df.drop("temporary_column")Some DataFrame methods accept column names when an argument refers to an
existing column. These include:
- :py:meth:`~datafusion.DataFrame.select`
- :py:meth:`~datafusion.DataFrame.sort`
- :py:meth:`~datafusion.DataFrame.drop`
- :py:meth:`~datafusion.DataFrame.join` (
onargument) - :py:meth:`~datafusion.DataFrame.aggregate` (grouping columns)
See the full function documentation for details on any specific function.
Note that :py:meth:`~datafusion.DataFrame.join_on` expects col()/column() expressions rather than plain strings.
For such methods, you can pass column names directly:
from datafusion import col, functions as f
df.sort('id')
df.aggregate('id', [f.count(col('value'))])The same operation can also be written with explicit column expressions, using either col() or column():
from datafusion import col, column, functions as f
df.sort(col('id'))
df.aggregate(column('id'), [f.count(col('value'))])Note that column() is an alias of col(), so you can use either name; the example above shows both in action.
Whenever an argument represents an expression—such as in
:py:meth:`~datafusion.DataFrame.filter` or
:py:meth:`~datafusion.DataFrame.with_column`—use col() to reference
columns. The comparison and arithmetic operators on Expr will automatically
convert any non-Expr value into a literal expression, so writing
from datafusion import col
df.filter(col("age") > 21)is equivalent to using lit(21) explicitly. Use lit() (also available
as literal()) when you need to construct a literal expression directly.
To materialize the results of your DataFrame operations:
# Collect all data as PyArrow RecordBatches
result_batches = df.collect()
# Convert to various formats
pandas_df = df.to_pandas() # Pandas DataFrame
polars_df = df.to_polars() # Polars DataFrame
arrow_table = df.to_arrow_table() # PyArrow Table
py_dict = df.to_pydict() # Python dictionary
py_list = df.to_pylist() # Python list of dictionaries
# Display results
df.show() # Print tabular format to console
# Count rows
count = df.count()
# Collect a single column of data as a PyArrow Array
arr = df.collect_column("age")DataFusion DataFrames implement the __arrow_c_stream__ protocol and can
also be displayed in notebooks, streamed to Arrow-based libraries, and inspected
after execution. These topics are covered separately so this overview can stay
focused on the DataFrame lifecycle:
- :doc:`arrow-interface` covers lazy PyArrow interop, direct
RecordBatchStreamusage, and partitioned stream processing. - :doc:`rendering` covers notebook and terminal display behavior, formatting options, and custom styling.
- :doc:`execution-metrics` covers per-operator runtime statistics after execution, including row counts and compute time.
- DataFrame
The main DataFrame class for building and executing queries.
- SessionContext
The primary entry point for creating DataFrames from various data sources.
Key methods for DataFrame creation:
- :py:meth:`~datafusion.SessionContext.read_csv` - Read CSV files
- :py:meth:`~datafusion.SessionContext.read_parquet` - Read Parquet files
- :py:meth:`~datafusion.SessionContext.read_json` - Read JSON files
- :py:meth:`~datafusion.SessionContext.read_avro` - Read Avro files
- :py:meth:`~datafusion.SessionContext.table` - Access registered tables
- :py:meth:`~datafusion.SessionContext.sql` - Execute SQL queries
- :py:meth:`~datafusion.SessionContext.from_pandas` - Create from Pandas DataFrame
- :py:meth:`~datafusion.SessionContext.from_arrow` - Create from Arrow data
- Expr
Represents expressions that can be used in DataFrame operations.
Functions for creating expressions:
- :py:func:`datafusion.column` - Reference a column by name
- :py:func:`datafusion.literal` - Create a literal value expression
DataFusion provides many built-in functions for data manipulation:
- :py:mod:`datafusion.functions` - Mathematical, string, date/time, and aggregation functions
For a complete list of available functions, see the :py:mod:`datafusion.functions` module documentation.
.. toctree:: :maxdepth: 1 arrow-interface rendering execution-metrics