Skip to content

Latest commit

 

History

History
285 lines (197 loc) · 9.29 KB

File metadata and controls

285 lines (197 loc) · 9.29 KB

DataFrames

Overview

The DataFrame class is the core abstraction in DataFusion that represents tabular data and operations on that data. DataFrames provide a flexible API for transforming data through various operations such as filtering, projection, aggregation, joining, and more.

A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when terminal operations like collect(), show(), or to_pandas() are called.

This page follows that lifecycle: first creating a DataFrame, then building a query plan with common operations, and finally executing the plan to inspect or export results. More specialized topics live on their own pages:

Creating DataFrames

DataFrames can be created in several ways:

  • From SQL queries via a SessionContext:

    from datafusion import SessionContext
    
    ctx = SessionContext()
    df = ctx.sql("SELECT * FROM your_table")
  • From registered tables:

    df = ctx.table("your_table")
  • From various data sources:

    # From CSV files (see :ref:`io_csv` for detailed options)
    df = ctx.read_csv("path/to/data.csv")
    
    # From Parquet files (see :ref:`io_parquet` for detailed options)
    df = ctx.read_parquet("path/to/data.parquet")
    
    # From JSON files (see :ref:`io_json` for detailed options)
    df = ctx.read_json("path/to/data.json")
    
    # From Avro files (see :ref:`io_avro` for detailed options)
    df = ctx.read_avro("path/to/data.avro")
    
    # From Pandas DataFrame
    import pandas as pd
    pandas_df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
    df = ctx.from_pandas(pandas_df)
    
    # From Arrow data
    import pyarrow as pa
    batch = pa.RecordBatch.from_arrays(
        [pa.array([1, 2, 3]), pa.array([4, 5, 6])],
        names=["a", "b"]
    )
    df = ctx.from_arrow(batch)

For detailed information about reading from different data sources, see the :doc:`I/O Guide <../io/index>`. For custom data sources, see :ref:`io_custom_table_provider`.

Common DataFrame Operations

DataFusion's DataFrame API offers a wide range of operations:

from datafusion import column, literal

# Select specific columns
df = df.select("col1", "col2")

# Select with expressions
df = df.select(column("a") + column("b"), column("a") - column("b"))

# Filter rows (expressions or SQL strings)
df = df.filter(column("age") > literal(25))
df = df.filter("age > 25")

# Add computed columns
df = df.with_column("full_name", column("first_name") + literal(" ") + column("last_name"))

# Multiple column additions
df = df.with_columns(
    (column("a") + column("b")).alias("sum"),
    (column("a") * column("b")).alias("product")
)

# Sort data
df = df.sort(column("age").sort(ascending=False))

# Join DataFrames
df = df1.join(df2, on="user_id", how="inner")

# Aggregate data
from datafusion import functions as f
df = df.aggregate(
    [],  # Group by columns (empty for global aggregation)
    [f.sum(column("amount")).alias("total_amount")]
)

# Limit rows
df = df.limit(100)

# Drop columns
df = df.drop("temporary_column")

Column Names as Function Arguments

Some DataFrame methods accept column names when an argument refers to an existing column. These include:

See the full function documentation for details on any specific function.

Note that :py:meth:`~datafusion.DataFrame.join_on` expects col()/column() expressions rather than plain strings.

For such methods, you can pass column names directly:

from datafusion import col, functions as f

df.sort('id')
df.aggregate('id', [f.count(col('value'))])

The same operation can also be written with explicit column expressions, using either col() or column():

from datafusion import col, column, functions as f

df.sort(col('id'))
df.aggregate(column('id'), [f.count(col('value'))])

Note that column() is an alias of col(), so you can use either name; the example above shows both in action.

Whenever an argument represents an expression—such as in :py:meth:`~datafusion.DataFrame.filter` or :py:meth:`~datafusion.DataFrame.with_column`—use col() to reference columns. The comparison and arithmetic operators on Expr will automatically convert any non-Expr value into a literal expression, so writing

from datafusion import col
df.filter(col("age") > 21)

is equivalent to using lit(21) explicitly. Use lit() (also available as literal()) when you need to construct a literal expression directly.

Terminal Operations

To materialize the results of your DataFrame operations:

# Collect all data as PyArrow RecordBatches
result_batches = df.collect()

# Convert to various formats
pandas_df = df.to_pandas()        # Pandas DataFrame
polars_df = df.to_polars()        # Polars DataFrame
arrow_table = df.to_arrow_table() # PyArrow Table
py_dict = df.to_pydict()          # Python dictionary
py_list = df.to_pylist()          # Python list of dictionaries

# Display results
df.show()                         # Print tabular format to console

# Count rows
count = df.count()

# Collect a single column of data as a PyArrow Array
arr = df.collect_column("age")

Related Topics

DataFusion DataFrames implement the __arrow_c_stream__ protocol and can also be displayed in notebooks, streamed to Arrow-based libraries, and inspected after execution. These topics are covered separately so this overview can stay focused on the DataFrame lifecycle:

  • :doc:`arrow-interface` covers lazy PyArrow interop, direct RecordBatchStream usage, and partitioned stream processing.
  • :doc:`rendering` covers notebook and terminal display behavior, formatting options, and custom styling.
  • :doc:`execution-metrics` covers per-operator runtime statistics after execution, including row counts and compute time.

Core Classes

DataFrame

The main DataFrame class for building and executing queries.

See: :py:class:`datafusion.DataFrame`

SessionContext

The primary entry point for creating DataFrames from various data sources.

Key methods for DataFrame creation:

See: :py:class:`datafusion.SessionContext`

Expression Classes

Expr

Represents expressions that can be used in DataFrame operations.

See: :py:class:`datafusion.Expr`

Functions for creating expressions:

Built-in Functions

DataFusion provides many built-in functions for data manipulation:

For a complete list of available functions, see the :py:mod:`datafusion.functions` module documentation.

.. toctree::
   :maxdepth: 1

   arrow-interface
   rendering
   execution-metrics