In this section, we will cover a basic example to introduce a few key concepts. We will use the [2021 Yellow Taxi Trip Records](https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet) dataset from the [TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page).
.. ipython:: python
from datafusion import SessionContext, col, lit, functions as f
ctx = SessionContext()
df = ctx.read_parquet("yellow_tripdata_2021-01.parquet")
df = df.select(
"trip_distance",
col("total_amount").alias("total"),
(f.round(lit(100.0) * col("tip_amount") / col("total_amount"), lit(1))).alias("tip_percent"),
)
df.show()
The first statement group creates a :py:class:`~datafusion.context.SessionContext`.
# create a context
ctx = datafusion.SessionContext()A Session Context is the main interface for executing queries with DataFusion. It maintains the state of the connection between a user and an instance of the DataFusion engine. Additionally it provides the following functionality:
- Create a DataFrame from a data source.
- Register a data source as a table that can be referenced from a SQL query.
- Execute a SQL query
The second statement group creates a DataFrame,
# Create a DataFrame from a file
df = ctx.read_parquet("yellow_tripdata_2021-01.parquet")A DataFrame refers to a (logical) set of rows that share the same column names, similar to a Pandas DataFrame.
DataFrames are typically created by calling a method on :py:class:`~datafusion.context.SessionContext`, such as read_csv, and can then be modified by
calling the transformation methods, such as :py:func:`~datafusion.dataframe.DataFrame.filter`, :py:func:`~datafusion.dataframe.DataFrame.select`, :py:func:`~datafusion.dataframe.DataFrame.aggregate`,
and :py:func:`~datafusion.dataframe.DataFrame.limit` to build up a query definition.
The third statement uses Expressions to build up a query definition. You can find
explanations for what the functions below do in the user documentation for
:py:func:`~datafusion.col`, :py:func:`~datafusion.lit`, :py:func:`~datafusion.functions.round`,
and :py:func:`~datafusion.expr.Expr.alias`.
df = df.select(
"trip_distance",
col("total_amount").alias("total"),
(f.round(lit(100.0) * col("tip_amount") / col("total_amount"), lit(1))).alias("tip_percent"),
)Finally the :py:func:`~datafusion.dataframe.DataFrame.show` method converts the logical plan represented by the DataFrame into a physical plan and execute it, collecting all results and displaying them to the user. It is important to note that DataFusion performs lazy evaluation of the DataFrame. Until you call a method such as :py:func:`~datafusion.dataframe.DataFrame.show` or :py:func:`~datafusion.dataframe.DataFrame.collect`, DataFusion will not perform the query.