BigQuery DataFrames (bigframes) is an open-source Python library that brings the power of distributed computing to your data science and data engineering workflows. By providing a familiar pandas and scikit-learn compatible API, BigFrames allows data scientists, data engineers, and data analysts to analyze, transform, and model massive datasets where they live—directly in BigQuery.
BigFrames eliminates the "data movement bottleneck." Instead of downloading large datasets to a local environment, BigFrames translates your Python code into optimized SQL, executing complex transformations across the BigQuery fleet.
- Petabyte-Scale Scalability: Effortlessly process datasets that far exceed local memory limits.
- Familiar Python Ecosystem: Use the same
read_gbq,groupby,merge, andpivot_tablefunctions you already know from pandas. - Integrated Machine Learning: Access BigQuery ML's powerful algorithms via a scikit-learn-like interface (
bigframes.ml), including seamless Gemini AI integration for generative AI workflows and MLOps. - Enterprise-Grade Security: Maintain data governance and security by keeping your data within the BigQuery perimeter.
- Hybrid Flexibility: Easily move between distributed BigQuery processing and local pandas analysis with
to_pandas().
BigQuery DataFrames is organized into specialized modules designed for the modern data stack, empowering big data analytics, AI/ML pipelines, and data engineering:
- :mod:`bigframes.pandas`: A high-performance, pandas-compatible API for scalable data exploration, cleaning, and transformation for data analysts.
- :mod:`bigframes.bigquery`: Specialized utilities for direct BigQuery resource management, including integrations with Gemini and other AI models in the :mod:`bigframes.bigquery.ai` submodule for data engineers and AI developers.
Install BigQuery DataFrames via pip:
pip install --upgrade bigframesThe following example demonstrates how to perform a distributed aggregation on a public dataset with millions of rows using just a few lines of Python:
import bigframes.pandas as bpd
# Initialize BigFrames and load a public dataset
df = bpd.read_gbq("bigquery-public-data.usa_names.usa_1910_2013")
# Perform familiar pandas operations that execute in the cloud
top_names = (
df.groupby("name")
.agg({"number": "sum"})
.sort_values("number", ascending=False)
.head(10)
)
# Bring the final, aggregated results back to local memory if needed
print(top_names.to_pandas()).. toctree::
:maxdepth: 2
:caption: User Documentation
user_guide/index
.. toctree::
:maxdepth: 2
:caption: API Reference
reference/index
supported_pandas_apis
.. toctree::
:maxdepth: 1
:caption: Community & Updates
changelog