Skip to content

Latest commit

 

History

History
64 lines (46 loc) · 2.28 KB

File metadata and controls

64 lines (46 loc) · 2.28 KB

SQL

DataFusion also offers a SQL API, read the full reference here

.. ipython:: python

    import datafusion
    from datafusion import col
    import pyarrow

    # create a context
    ctx = datafusion.SessionContext()

    # register a CSV
    ctx.register_csv('pokemon', 'pokemon.csv')

    # create a new statement via SQL
    df = ctx.sql('SELECT "Attack"+"Defense", "Attack"-"Defense" FROM pokemon')

    # collect and convert to pandas DataFrame
    df.to_pandas()

Automatic variable registration

You can opt-in to DataFusion automatically registering Arrow-compatible Python objects that appear in SQL queries. This removes the need to call register_* helpers explicitly when working with in-memory data structures.

import pyarrow as pa
from datafusion import SessionContext

ctx = SessionContext(auto_register_python_objects=True)

orders = pa.Table.from_pydict({"item": ["apple", "pear"], "qty": [5, 2]})

result = ctx.sql("SELECT item, qty FROM orders WHERE qty > 2")
print(result.to_pandas())

The feature inspects the call stack for variables whose names match missing tables and registers them if they expose Arrow data (including pandas and Polars DataFrames). Existing contexts can enable or disable the behavior at runtime through :py:meth:`SessionContext.set_python_table_lookup` or by passing auto_register_python_objects when constructing the session.