In this section, we will cover a basic example to introduce a few key concepts.
import datafusion
from datafusion import col
import pyarrow
# create a context
ctx = datafusion.SessionContext()
# create a RecordBatch and a new DataFrame from it
batch = pyarrow.RecordBatch.from_arrays(
[pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
names=["a", "b"],
)
df = ctx.create_dataframe([[batch]])
# create a new statement
df = df.select(
col("a") + col("b"),
col("a") - col("b"),
)
# execute and collect the first (and only) batch
result = df.collect()[0]The first statement group:
# create a context
ctx = datafusion.SessionContext()creates a :py:class:`~datafusion.context.SessionContext`, that is, the main interface for executing queries with DataFusion. It maintains the state of the connection between a user and an instance of the DataFusion engine. Additionally it provides the following functionality:
- Create a DataFrame from a CSV or Parquet data source.
- Register a CSV or Parquet data source as a table that can be referenced from a SQL query.
- Register a custom data source that can be referenced from a SQL query.
- Execute a SQL query
The second statement group creates a DataFrame,
# create a RecordBatch and a new DataFrame from it
batch = pyarrow.RecordBatch.from_arrays(
[pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
names=["a", "b"],
)
df = ctx.create_dataframe([[batch]])A DataFrame refers to a (logical) set of rows that share the same column names, similar to a Pandas DataFrame.
DataFrames are typically created by calling a method on :py:class:`~datafusion.context.SessionContext`, such as read_csv, and can then be modified by
calling the transformation methods, such as :py:func:`~datafusion.dataframe.DataFrame.filter`, :py:func:`~datafusion.dataframe.DataFrame.select`, :py:func:`~datafusion.dataframe.DataFrame.aggregate`,
and :py:func:`~datafusion.dataframe.DataFrame.limit` to build up a query definition.
The third statement uses Expressions to build up a query definition.
df = df.select(
col("a") + col("b"),
col("a") - col("b"),
)Finally the :py:func:`~datafusion.dataframe.DataFrame.collect` method converts the logical plan represented by the DataFrame into a physical plan and execute it, collecting all results into a list of RecordBatch.