Skip to content

Latest commit

 

History

History
167 lines (109 loc) · 5.01 KB

File metadata and controls

167 lines (109 loc) · 5.01 KB

Functions

DataFusion provides a large number of built-in functions for performing complex queries without requiring user-defined functions. In here we will cover some of the more popular use cases. If you want to view all the functions go to the :py:mod:`Functions <datafusion.functions>` API Reference.

We'll use the pokemon dataset in the following examples.

.. ipython:: python

    from datafusion import SessionContext

    ctx = SessionContext()
    ctx.register_csv("pokemon", "pokemon.csv")
    df = ctx.table("pokemon")

Mathematical

DataFusion offers mathematical functions such as :py:func:`~datafusion.functions.pow` or :py:func:`~datafusion.functions.log`

.. ipython:: python

    from datafusion import col, literal, string_literal, str_lit
    from datafusion import functions as f

    df.select(
        f.pow(col('"Attack"'), literal(2)) - f.pow(col('"Defense"'), literal(2))
    ).limit(10)


Conditional

There 3 conditional functions in DataFusion :py:func:`~datafusion.functions.coalesce`, :py:func:`~datafusion.functions.nullif` and :py:func:`~datafusion.functions.case`.

.. ipython:: python

    df.select(
        f.coalesce(col('"Type 1"'), col('"Type 2"')).alias("dominant_type")
    ).limit(10)

Temporal

For selecting the current time use :py:func:`~datafusion.functions.now`

.. ipython:: python

    df.select(f.now())

Convert to timestamps using :py:func:`~datafusion.functions.to_timestamp`

.. ipython:: python

    df.select(f.to_timestamp(col('"Total"')).alias("timestamp"))

Extracting parts of a date using :py:func:`~datafusion.functions.date_part` (alias :py:func:`~datafusion.functions.extract`)

.. ipython:: python

     df.select(
        f.date_part(literal("month"), f.to_timestamp(col('"Total"'))).alias("month"),
        f.extract(literal("day"), f.to_timestamp(col('"Total"'))).alias("day")
     )

String

In the field of data science, working with textual data is a common task. To make string manipulation easier, DataFusion offers a range of helpful options.

.. ipython:: python

    df.select(
        f.char_length(col('"Name"')).alias("len"),
        f.lower(col('"Name"')).alias("lower"),
        f.left(col('"Name"'), literal(4)).alias("code")
    )

This also includes the functions for regular expressions like :py:func:`~datafusion.functions.regexp_replace` and :py:func:`~datafusion.functions.regexp_match`

.. ipython:: python

    df.select(
        f.regexp_match(col('"Name"'), literal("Char")).alias("dragons"),
        f.regexp_replace(col('"Name"'), literal("saur"), literal("fleur")).alias("flowers")
    )

Casting

Casting expressions to different data types using :py:func:`~datafusion.functions.arrow_cast`

.. ipython:: python

    df.select(
        f.arrow_cast(col('"Total"'), string_literal("Float64")).alias("total_as_float"),
        f.arrow_cast(col('"Total"'), str_lit("Int32")).alias("total_as_int")
    )

Other

The function :py:func:`~datafusion.functions.in_list` allows to check a column for the presence of multiple values:

.. ipython:: python

    types = [literal("Grass"), literal("Fire"), literal("Water")]
    (
        df.select(f.in_list(col('"Type 1"'), types, negated=False).alias("basic_types"))
          .limit(20)
          .to_pandas()
    )


Handling Missing Values

DataFusion provides methods to handle missing values in DataFrames:

fill_null

The fill_null() method replaces NULL values in specified columns with a provided value:

# Fill all NULL values with 0 where possible
df = df.fill_null(0)

# Fill NULL values only in specific string columns
df = df.fill_null("missing", subset=["name", "category"])

The fill value will be cast to match each column's type. If casting fails for a column, that column remains unchanged.

fill_nan

The fill_nan() method replaces NaN values in floating-point columns with a provided numeric value:

# Fill all NaN values with 0 in numeric columns
df = df.fill_nan(0)

# Fill NaN values in specific numeric columns
df = df.fill_nan(99.9, subset=["price", "score"])

This only works on floating-point columns (float32, float64). The fill value must be numeric (int or float).