DataFrames

Overview

The DataFrame class is the core abstraction in DataFusion that represents tabular data and operations on that data. DataFrames provide a flexible API for transforming data through various operations such as filtering, projection, aggregation, joining, and more.

A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when terminal operations like collect(), show(), or to_pandas() are called.

This page follows that lifecycle: first creating a DataFrame, then building a query plan with common operations, and finally executing the plan to inspect or export results. More specialized topics live on their own pages:

:doc:`../common-operations/index` covers filtering, joins, aggregations, functions, windows, and UDFs in more detail.
:doc:`arrow-interface` explains Arrow streaming and the __arrow_c_stream__ interface.
:doc:`rendering` covers display behavior in notebooks and terminals.
:doc:`execution-metrics` explains how to inspect runtime statistics after a DataFrame executes.

Creating DataFrames

DataFrames can be created in several ways:

From SQL queries via a SessionContext:

from datafusion import SessionContext

ctx = SessionContext()
df = ctx.sql("SELECT * FROM your_table")

From registered tables:
```
df = ctx.table("your_table")
```

From various data sources:

# From CSV files (see :ref:`io_csv` for detailed options)
df = ctx.read_csv("path/to/data.csv")

# From Parquet files (see :ref:`io_parquet` for detailed options)
df = ctx.read_parquet("path/to/data.parquet")

# From JSON files (see :ref:`io_json` for detailed options)
df = ctx.read_json("path/to/data.json")

# From Avro files (see :ref:`io_avro` for detailed options)
df = ctx.read_avro("path/to/data.avro")

# From Pandas DataFrame
import pandas as pd
pandas_df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df = ctx.from_pandas(pandas_df)

# From Arrow data
import pyarrow as pa
batch = pa.RecordBatch.from_arrays(
    [pa.array([1, 2, 3]), pa.array([4, 5, 6])],
    names=["a", "b"]
)
df = ctx.from_arrow(batch)

For detailed information about reading from different data sources, see the :doc:`I/O Guide <../io/index>`. For custom data sources, see :ref:`io_custom_table_provider`.

Common DataFrame Operations

DataFusion's DataFrame API offers a wide range of operations:

from datafusion import column, literal

# Select specific columns
df = df.select("col1", "col2")

# Select with expressions
df = df.select(column("a") + column("b"), column("a") - column("b"))

# Filter rows (expressions or SQL strings)
df = df.filter(column("age") > literal(25))
df = df.filter("age > 25")

# Add computed columns
df = df.with_column("full_name", column("first_name") + literal(" ") + column("last_name"))

# Multiple column additions
df = df.with_columns(
    (column("a") + column("b")).alias("sum"),
    (column("a") * column("b")).alias("product")
)

# Sort data
df = df.sort(column("age").sort(ascending=False))

# Join DataFrames
df = df1.join(df2, on="user_id", how="inner")

# Aggregate data
from datafusion import functions as f
df = df.aggregate(
    [],  # Group by columns (empty for global aggregation)
    [f.sum(column("amount")).alias("total_amount")]
)

# Limit rows
df = df.limit(100)

# Drop columns
df = df.drop("temporary_column")

Column Names as Function Arguments

Some DataFrame methods accept column names when an argument refers to an existing column. These include:

:py:meth:`~datafusion.DataFrame.select`
:py:meth:`~datafusion.DataFrame.sort`
:py:meth:`~datafusion.DataFrame.drop`
:py:meth:`~datafusion.DataFrame.join` (on argument)
:py:meth:`~datafusion.DataFrame.aggregate` (grouping columns)

See the full function documentation for details on any specific function.

Note that :py:meth:`~datafusion.DataFrame.join_on` expects col()/column() expressions rather than plain strings.

For such methods, you can pass column names directly:

from datafusion import col, functions as f

df.sort('id')
df.aggregate('id', [f.count(col('value'))])

The same operation can also be written with explicit column expressions, using either col() or column():

from datafusion import col, column, functions as f

df.sort(col('id'))
df.aggregate(column('id'), [f.count(col('value'))])

Note that column() is an alias of col(), so you can use either name; the example above shows both in action.

Whenever an argument represents an expression—such as in :py:meth:`~datafusion.DataFrame.filter` or :py:meth:`~datafusion.DataFrame.with_column`—use col() to reference columns. The comparison and arithmetic operators on Expr will automatically convert any non-Expr value into a literal expression, so writing

from datafusion import col
df.filter(col("age") > 21)

is equivalent to using lit(21) explicitly. Use lit() (also available as literal()) when you need to construct a literal expression directly.

Terminal Operations

To materialize the results of your DataFrame operations:

# Collect all data as PyArrow RecordBatches
result_batches = df.collect()

# Convert to various formats
pandas_df = df.to_pandas()        # Pandas DataFrame
polars_df = df.to_polars()        # Polars DataFrame
arrow_table = df.to_arrow_table() # PyArrow Table
py_dict = df.to_pydict()          # Python dictionary
py_list = df.to_pylist()          # Python list of dictionaries

# Display results
df.show()                         # Print tabular format to console

# Count rows
count = df.count()

# Collect a single column of data as a PyArrow Array
arr = df.collect_column("age")

Core Classes

DataFrame

The main DataFrame class for building and executing queries.

See: :py:class:`datafusion.DataFrame`

SessionContext

The primary entry point for creating DataFrames from various data sources.

Key methods for DataFrame creation:

:py:meth:`~datafusion.SessionContext.read_csv` - Read CSV files
:py:meth:`~datafusion.SessionContext.read_parquet` - Read Parquet files
:py:meth:`~datafusion.SessionContext.read_json` - Read JSON files
:py:meth:`~datafusion.SessionContext.read_avro` - Read Avro files
:py:meth:`~datafusion.SessionContext.table` - Access registered tables
:py:meth:`~datafusion.SessionContext.sql` - Execute SQL queries
:py:meth:`~datafusion.SessionContext.from_pandas` - Create from Pandas DataFrame
:py:meth:`~datafusion.SessionContext.from_arrow` - Create from Arrow data

See: :py:class:`datafusion.SessionContext`

Expression Classes

Expr

Represents expressions that can be used in DataFrame operations.

See: :py:class:`datafusion.Expr`

Functions for creating expressions:

:py:func:`datafusion.column` - Reference a column by name
:py:func:`datafusion.literal` - Create a literal value expression

Built-in Functions

DataFusion provides many built-in functions for data manipulation:

:py:mod:`datafusion.functions` - Mathematical, string, date/time, and aggregation functions

For a complete list of available functions, see the :py:mod:`datafusion.functions` module documentation.

.. toctree::
   :maxdepth: 1

   arrow-interface
   rendering
   execution-metrics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrames

Overview

Creating DataFrames

Common DataFrame Operations

Column Names as Function Arguments

Terminal Operations

Related Topics

Core Classes

Expression Classes

Built-in Functions

FilesExpand file tree

index.rst

Latest commit

History

index.rst

File metadata and controls

DataFrames

Overview

Creating DataFrames

Common DataFrame Operations

Column Names as Function Arguments

Terminal Operations

Related Topics

Core Classes

Expression Classes

Built-in Functions