The DataFrame class is the core abstraction in DataFusion that represents tabular data and operations
on that data. DataFrames provide a flexible API for transforming data through various operations such as
filtering, projection, aggregation, joining, and more.
A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when
terminal operations like collect(), show(), or to_pandas() are called.
DataFrames can be created in several ways:
From SQL queries via a
SessionContext:from datafusion import SessionContext ctx = SessionContext() df = ctx.sql("SELECT * FROM your_table")
From registered tables:
df = ctx.table("your_table")
From various data sources:
# From CSV files (see :ref:`io_csv` for detailed options) df = ctx.read_csv("path/to/data.csv") # From Parquet files (see :ref:`io_parquet` for detailed options) df = ctx.read_parquet("path/to/data.parquet") # From JSON files (see :ref:`io_json` for detailed options) df = ctx.read_json("path/to/data.json") # From Avro files (see :ref:`io_avro` for detailed options) df = ctx.read_avro("path/to/data.avro") # From Pandas DataFrame import pandas as pd pandas_df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) df = ctx.from_pandas(pandas_df) # From Arrow data import pyarrow as pa batch = pa.RecordBatch.from_arrays( [pa.array([1, 2, 3]), pa.array([4, 5, 6])], names=["a", "b"] ) df = ctx.from_arrow(batch)
For detailed information about reading from different data sources, see the :doc:`I/O Guide <../user-guide/io/index>`. For custom data sources, see :ref:`io_custom_table_provider`.
DataFusion's DataFrame API offers a wide range of operations:
from datafusion import column, literal
# Select specific columns
df = df.select("col1", "col2")
# Select with expressions
df = df.select(column("a") + column("b"), column("a") - column("b"))
# Filter rows
df = df.filter(column("age") > literal(25))
# Add computed columns
df = df.with_column("full_name", column("first_name") + literal(" ") + column("last_name"))
# Multiple column additions
df = df.with_columns(
(column("a") + column("b")).alias("sum"),
(column("a") * column("b")).alias("product")
)
# Sort data
df = df.sort(column("age").sort(ascending=False))
# Join DataFrames
df = df1.join(df2, on="user_id", how="inner")
# Aggregate data
from datafusion import functions as f
df = df.aggregate(
[], # Group by columns (empty for global aggregation)
[f.sum(column("amount")).alias("total_amount")]
)
# Limit rows
df = df.limit(100)
# Drop columns
df = df.drop("temporary_column")To materialize the results of your DataFrame operations:
# Collect all data as PyArrow RecordBatches
result_batches = df.collect()
# Convert to various formats
pandas_df = df.to_pandas() # Pandas DataFrame
polars_df = df.to_polars() # Polars DataFrame
arrow_table = df.to_arrow_table() # PyArrow Table
py_dict = df.to_pydict() # Python dictionary
py_list = df.to_pylist() # Python list of dictionaries
# Display results
df.show() # Print tabular format to console
# Count rows
count = df.count()