DataFrame API

Overview

The DataFrame class is the core abstraction in DataFusion that represents tabular data and operations on that data. DataFrames provide a flexible API for transforming data through various operations such as filtering, projection, aggregation, joining, and more.

A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when terminal operations like collect(), show(), or to_pandas() are called.

Creating DataFrames

DataFrames can be created in several ways:

From SQL queries via a SessionContext:

from datafusion import SessionContext

ctx = SessionContext()
df = ctx.sql("SELECT * FROM your_table")

From registered tables:
```
df = ctx.table("your_table")
```

From various data sources:

# From CSV files (see :ref:`io_csv` for detailed options)
df = ctx.read_csv("path/to/data.csv")

# From Parquet files (see :ref:`io_parquet` for detailed options)
df = ctx.read_parquet("path/to/data.parquet")

# From JSON files (see :ref:`io_json` for detailed options)
df = ctx.read_json("path/to/data.json")

# From Avro files (see :ref:`io_avro` for detailed options)
df = ctx.read_avro("path/to/data.avro")

# From Pandas DataFrame
import pandas as pd
pandas_df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df = ctx.from_pandas(pandas_df)

# From Arrow data
import pyarrow as pa
batch = pa.RecordBatch.from_arrays(
    [pa.array([1, 2, 3]), pa.array([4, 5, 6])],
    names=["a", "b"]
)
df = ctx.from_arrow(batch)

For detailed information about reading from different data sources, see the :doc:`I/O Guide <../user-guide/io/index>`. For custom data sources, see :ref:`io_custom_table_provider`.

Common DataFrame Operations

DataFusion's DataFrame API offers a wide range of operations:

from datafusion import column, literal

# Select specific columns
df = df.select("col1", "col2")

# Select with expressions
df = df.select(column("a") + column("b"), column("a") - column("b"))

# Filter rows
df = df.filter(column("age") > literal(25))

# Add computed columns
df = df.with_column("full_name", column("first_name") + literal(" ") + column("last_name"))

# Multiple column additions
df = df.with_columns(
    (column("a") + column("b")).alias("sum"),
    (column("a") * column("b")).alias("product")
)

# Sort data
df = df.sort(column("age").sort(ascending=False))

# Join DataFrames
df = df1.join(df2, on="user_id", how="inner")

# Aggregate data
from datafusion import functions as f
df = df.aggregate(
    [],  # Group by columns (empty for global aggregation)
    [f.sum(column("amount")).alias("total_amount")]
)

# Limit rows
df = df.limit(100)

# Drop columns
df = df.drop("temporary_column")

Terminal Operations

To materialize the results of your DataFrame operations:

# Collect all data as PyArrow RecordBatches
result_batches = df.collect()

# Convert to various formats
pandas_df = df.to_pandas()        # Pandas DataFrame
polars_df = df.to_polars()        # Polars DataFrame
arrow_table = df.to_arrow_table() # PyArrow Table
py_dict = df.to_pydict()          # Python dictionary
py_list = df.to_pylist()          # Python list of dictionaries

# Display results
df.show()                         # Print tabular format to console

# Count rows
count = df.count()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame API

Overview

Creating DataFrames

Common DataFrame Operations

Terminal Operations

FilesExpand file tree

dataframe.rst

Latest commit

History

dataframe.rst

File metadata and controls

DataFrame API

Overview

Creating DataFrames

Common DataFrame Operations

Terminal Operations