BigQuery DataFrames (bigframes) is an open-source Python library that brings the power of distributed computing to your data science workflow. By providing a familiar pandas and scikit-learn compatible API, BigFrames allows you to analyze and model massive datasets where they live—directly in BigQuery.
BigFrames eliminates the "data movement bottleneck." Instead of downloading large datasets to a local environment, BigFrames translates your Python code into optimized SQL, executing complex transformations across the BigQuery fleet.
- Petabyte-Scale Scalability: Effortlessly process datasets that far exceed local memory limits.
- Familiar Python Ecosystem: Use the same
read_gbq,groupby,merge, andpivot_tablefunctions you already know from pandas. - Integrated Machine Learning: Access BigQuery ML's powerful algorithms via a scikit-learn-like interface (
bigframes.ml), including seamless Gemini AI integration. - Enterprise-Grade Security: Maintain data governance and security by keeping your data within the BigQuery perimeter.
- Hybrid Flexibility: Easily move between distributed BigQuery processing and local pandas analysis with
to_pandas().
BigQuery DataFrames is organized into specialized modules designed for the modern data stack:
- :mod:`bigframes.pandas`: A high-performance, pandas-compatible API for scalable data exploration, cleaning, and transformation.
- :mod:`bigframes.bigquery`: Specialized utilities for direct BigQuery resource management, including integrations with Gemini and other AI models in the :mod:`bigframes.bigquery.ai` submodule.
Install BigQuery DataFrames via pip:
pip install --upgrade bigframesThe following example demonstrates how to perform a distributed aggregation on a public dataset with millions of rows using just a few lines of Python:
import bigframes.pandas as bpd
# Initialize BigFrames and load a public dataset
df = bpd.read_gbq("bigquery-public-data.usa_names.usa_1910_2013")
# Perform familiar pandas operations that execute in the cloud
top_names = (
df.groupby("name")
.agg({"number": "sum"})
.sort_values("number", ascending=False)
.head(10)
)
# Bring the final, aggregated results back to local memory if needed
print(top_names.to_pandas()).. toctree::
:maxdepth: 2
:caption: User Documentation
user_guide/index
.. toctree::
:maxdepth: 2
:caption: API Reference
reference/index
supported_pandas_apis
.. toctree::
:maxdepth: 1
:caption: Community & Updates
changelog