DataFusion provides a large number of built-in functions for performing complex queries without requiring user-defined functions. In here we will cover some of the more popular use cases. If you want to view all the functions go to the :py:mod:`Functions <datafusion.functions>` API Reference.
We'll use the pokemon dataset in the following examples.
.. ipython:: python
from datafusion import SessionContext
ctx = SessionContext()
ctx.register_csv("pokemon", "pokemon.csv")
df = ctx.table("pokemon")
DataFusion offers mathematical functions such as :py:func:`~datafusion.functions.pow` or :py:func:`~datafusion.functions.log`
.. ipython:: python
from datafusion import col, literal, string_literal, str_lit
from datafusion import functions as f
df.select(
f.pow(col('"Attack"'), literal(2)) - f.pow(col('"Defense"'), literal(2))
).limit(10)
There 3 conditional functions in DataFusion :py:func:`~datafusion.functions.coalesce`, :py:func:`~datafusion.functions.nullif` and :py:func:`~datafusion.functions.case`.
.. ipython:: python
df.select(
f.coalesce(col('"Type 1"'), col('"Type 2"')).alias("dominant_type")
).limit(10)
For selecting the current time use :py:func:`~datafusion.functions.now`
.. ipython:: python
df.select(f.now())
Convert to timestamps using :py:func:`~datafusion.functions.to_timestamp`
.. ipython:: python
df.select(f.to_timestamp(col('"Total"')).alias("timestamp"))
Extracting parts of a date using :py:func:`~datafusion.functions.date_part` (alias :py:func:`~datafusion.functions.extract`)
.. ipython:: python
df.select(
f.date_part(literal("month"), f.to_timestamp(col('"Total"'))).alias("month"),
f.extract(literal("day"), f.to_timestamp(col('"Total"'))).alias("day")
)
In the field of data science, working with textual data is a common task. To make string manipulation easier, DataFusion offers a range of helpful options.
.. ipython:: python
df.select(
f.char_length(col('"Name"')).alias("len"),
f.lower(col('"Name"')).alias("lower"),
f.left(col('"Name"'), literal(4)).alias("code")
)
This also includes the functions for regular expressions like :py:func:`~datafusion.functions.regexp_replace` and :py:func:`~datafusion.functions.regexp_match`
.. ipython:: python
df.select(
f.regexp_match(col('"Name"'), literal("Char")).alias("dragons"),
f.regexp_replace(col('"Name"'), literal("saur"), literal("fleur")).alias("flowers")
)
Casting expressions to different data types using :py:func:`~datafusion.functions.arrow_cast`
.. ipython:: python
df.select(
f.arrow_cast(col('"Total"'), string_literal("Float64")).alias("total_as_float"),
f.arrow_cast(col('"Total"'), str_lit("Int32")).alias("total_as_int")
)
The function :py:func:`~datafusion.functions.in_list` allows to check a column for the presence of multiple values:
.. ipython:: python
types = [literal("Grass"), literal("Fire"), literal("Water")]
(
df.select(f.in_list(col('"Type 1"'), types, negated=False).alias("basic_types"))
.limit(20)
.to_pandas()
)
DataFusion provides methods to handle missing values in DataFrames:
The fill_null() method replaces NULL values in specified columns with a provided value:
# Fill all NULL values with 0 where possible
df = df.fill_null(0)
# Fill NULL values only in specific string columns
df = df.fill_null("missing", subset=["name", "category"])The fill value will be cast to match each column's type. If casting fails for a column, that column remains unchanged.
The fill_nan() method replaces NaN values in floating-point columns with a provided numeric value:
# Fill all NaN values with 0 in numeric columns
df = df.fill_nan(0)
# Fill NaN values in specific numeric columns
df = df.fill_nan(99.9, subset=["price", "score"])This only works on floating-point columns (float32, float64). The fill value must be numeric (int or float).