The DataFrame class is the core abstraction in DataFusion that represents tabular data and operations
on that data. DataFrames provide a flexible API for transforming data through various operations such as
filtering, projection, aggregation, joining, and more.
A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when
terminal operations like collect(), show(), or to_pandas() are called.
DataFrames can be created in several ways:
From SQL queries via a
SessionContext:from datafusion import SessionContext ctx = SessionContext() df = ctx.sql("SELECT * FROM your_table")
From registered tables:
df = ctx.table("your_table")
From various data sources:
# From CSV files (see :ref:`io_csv` for detailed options) df = ctx.read_csv("path/to/data.csv") # From Parquet files (see :ref:`io_parquet` for detailed options) df = ctx.read_parquet("path/to/data.parquet") # From JSON files (see :ref:`io_json` for detailed options) df = ctx.read_json("path/to/data.json") # From Avro files (see :ref:`io_avro` for detailed options) df = ctx.read_avro("path/to/data.avro") # From Pandas DataFrame import pandas as pd pandas_df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) df = ctx.from_pandas(pandas_df) # From Arrow data import pyarrow as pa batch = pa.RecordBatch.from_arrays( [pa.array([1, 2, 3]), pa.array([4, 5, 6])], names=["a", "b"] ) df = ctx.from_arrow(batch)
For detailed information about reading from different data sources, see the :doc:`I/O Guide <../user-guide/io/index>`. For custom data sources, see :ref:`io_custom_table_provider`.
DataFusion's DataFrame API offers a wide range of operations:
from datafusion import column, literal
# Select specific columns
df = df.select("col1", "col2")
# Select with expressions
df = df.select(column("a") + column("b"), column("a") - column("b"))
# Filter rows
df = df.filter(column("age") > literal(25))
# Add computed columns
df = df.with_column("full_name", column("first_name") + literal(" ") + column("last_name"))
# Multiple column additions
df = df.with_columns(
(column("a") + column("b")).alias("sum"),
(column("a") * column("b")).alias("product")
)
# Sort data
df = df.sort(column("age").sort(ascending=False))
# Join DataFrames
df = df1.join(df2, on="user_id", how="inner")
# Aggregate data
from datafusion import functions as f
df = df.aggregate(
[], # Group by columns (empty for global aggregation)
[f.sum(column("amount")).alias("total_amount")]
)
# Limit rows
df = df.limit(100)
# Drop columns
df = df.drop("temporary_column")To materialize the results of your DataFrame operations:
# Collect all data as PyArrow RecordBatches
result_batches = df.collect()
# Convert to various formats
pandas_df = df.to_pandas() # Pandas DataFrame
polars_df = df.to_polars() # Polars DataFrame
arrow_table = df.to_arrow_table() # PyArrow Table
py_dict = df.to_pydict() # Python dictionary
py_list = df.to_pylist() # Python list of dictionaries
# Display results
df.show() # Print tabular format to console
# Count rows
count = df.count()When working in Jupyter notebooks or other environments that support rich HTML display,
DataFusion DataFrames automatically render as nicely formatted HTML tables. This functionality
is provided by the _repr_html_ method, which is automatically called by Jupyter.
In a Jupyter environment, simply displaying a DataFrame object will trigger HTML rendering:
# Will display as HTML table in Jupyter
df
# Explicit display also uses HTML rendering
display(df)DataFusion provides extensive customization options for HTML table rendering through the
datafusion.html_formatter module.
You can customize how DataFrames are rendered by configuring the formatter:
from datafusion.html_formatter import configure_formatter
configure_formatter(
max_cell_length=30, # Maximum length of cell content before truncation
max_width=800, # Maximum width of table in pixels
max_height=400, # Maximum height of table in pixels
max_memory_bytes=2 * 1024 * 1024,# Maximum memory used for rendering (2MB)
min_rows_display=10, # Minimum rows to display
repr_rows=20, # Number of rows to display in representation
enable_cell_expansion=True, # Allow cells to be expandable on click
custom_css=None, # Custom CSS to apply
show_truncation_message=True, # Show message when data is truncated
style_provider=None, # Custom style provider class
use_shared_styles=True # Share styles across tables to reduce duplication
)For advanced styling needs, you can create a custom style provider class:
from datafusion.html_formatter import configure_formatter
class CustomStyleProvider:
def get_cell_style(self) -> str:
return "background-color: #f5f5f5; color: #333; padding: 8px; border: 1px solid #ddd;"
def get_header_style(self) -> str:
return "background-color: #4285f4; color: white; font-weight: bold; padding: 10px;"
# Apply custom styling
configure_formatter(style_provider=CustomStyleProvider())You can register custom formatters for specific data types:
from datafusion.html_formatter import get_formatter
formatter = get_formatter()
# Format integers with color based on value
def format_int(value):
return f'<span style="color: {"red" if value > 100 else "blue"}">{value}</span>'
formatter.register_formatter(int, format_int)
# Format date values
def format_date(value):
return f'<span class="date-value">{value.isoformat()}</span>'
formatter.register_formatter(datetime.date, format_date)For complete control over cell rendering:
formatter = get_formatter()
def custom_cell_builder(value, row, col, table_id):
try:
num_value = float(value)
if num_value > 0: # Positive values get green
return f'<td style="background-color: #d9f0d3">{value}</td>'
if num_value < 0: # Negative values get red
return f'<td style="background-color: #f0d3d3">{value}</td>'
except (ValueError, TypeError):
pass
# Default styling for non-numeric or zero values
return f'<td style="border: 1px solid #ddd">{value}</td>'
formatter.set_custom_cell_builder(custom_cell_builder)Similarly, you can customize the rendering of table headers:
def custom_header_builder(field):
tooltip = f"Type: {field.type}"
return f'<th style="background-color: #333; color: white" title="{tooltip}">{field.name}</th>'
formatter.set_custom_header_builder(custom_header_builder)Managing Formatter State -----------------------~
The HTML formatter maintains global state that can be managed:
from datafusion.html_formatter import reset_formatter, reset_styles_loaded_state, get_formatter
# Reset the formatter to default settings
reset_formatter()
# Reset only the styles loaded state (useful when styles were loaded but need reloading)
reset_styles_loaded_state()
# Get the current formatter instance to make changes
formatter = get_formatter()Advanced Example: Dashboard-Style Formatting ------------------------------------------~~
This example shows how to create a dashboard-like styling for your DataFrames:
from datafusion.html_formatter import configure_formatter, get_formatter
# Define custom CSS
custom_css = """
.datafusion-table {
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
border-collapse: collapse;
width: 100%;
box-shadow: 0 2px 3px rgba(0,0,0,0.1);
}
.datafusion-table th {
position: sticky;
top: 0;
z-index: 10;
}
.datafusion-table tr:hover td {
background-color: #f1f7fa !important;
}
.datafusion-table .numeric-positive {
color: #0a7c00;
}
.datafusion-table .numeric-negative {
color: #d13438;
}
"""
class DashboardStyleProvider:
def get_cell_style(self) -> str:
return "padding: 8px 12px; border-bottom: 1px solid #e0e0e0;"
def get_header_style(self) -> str:
return ("background-color: #0078d4; color: white; font-weight: 600; "
"padding: 12px; text-align: left; border-bottom: 2px solid #005a9e;")
# Apply configuration
configure_formatter(
max_height=500,
enable_cell_expansion=True,
custom_css=custom_css,
style_provider=DashboardStyleProvider(),
max_cell_length=50
)
# Add custom formatters for numbers
formatter = get_formatter()
def format_number(value):
try:
num = float(value)
cls = "numeric-positive" if num > 0 else "numeric-negative" if num < 0 else ""
return f'<span class="{cls}">{value:,}</span>' if cls else f'{value:,}'
except (ValueError, TypeError):
return str(value)
formatter.register_formatter(int, format_number)
formatter.register_formatter(float, format_number)- Memory Management: For large datasets, use
max_memory_bytesto limit memory usage. - Responsive Design: Set reasonable
max_widthandmax_heightvalues to ensure tables display well on different screens. - Style Optimization: Use
use_shared_styles=Trueto avoid duplicate style definitions when displaying multiple tables. - Reset When Needed: Call
reset_formatter()when you want to start fresh with default settings. - Cell Expansion: Use
enable_cell_expansion=Truewhen cells might contain longer content that users may want to see in full.
- :doc:`../user-guide/dataframe` - Complete guide to using DataFrames
- :doc:`../user-guide/io/index` - I/O Guide for reading data from various sources
- :doc:`../user-guide/data-sources` - Comprehensive data sources guide
- :ref:`io_csv` - CSV file reading
- :ref:`io_parquet` - Parquet file reading
- :ref:`io_json` - JSON file reading
- :ref:`io_avro` - Avro file reading
- :ref:`io_custom_table_provider` - Custom table providers
- API Reference - Full API reference