Scalable Python Data Analysis with BigQuery DataFrames (BigFrames)

BigQuery DataFrames (bigframes) is an open-source Python library that brings the power of distributed computing to your data science and data engineering workflows. By providing a familiar pandas and scikit-learn compatible API, BigFrames allows data scientists, data engineers, and data analysts to analyze, transform, and model massive datasets where they live—directly in BigQuery.

Why Choose BigQuery DataFrames?

BigFrames eliminates the "data movement bottleneck." Instead of downloading large datasets to a local environment, BigFrames translates your Python code into optimized SQL, executing complex transformations across the BigQuery fleet.

Petabyte-Scale Scalability: Effortlessly process datasets that far exceed local memory limits.
Familiar Python Ecosystem: Use the same read_gbq, groupby, merge, and pivot_table functions you already know from pandas.
Integrated Machine Learning: Access BigQuery ML's powerful algorithms via a scikit-learn-like interface (bigframes.ml), including seamless Gemini AI integration for generative AI workflows and MLOps.
Enterprise-Grade Security: Maintain data governance and security by keeping your data within the BigQuery perimeter.
Hybrid Flexibility: Easily move between distributed BigQuery processing and local pandas analysis with to_pandas().

Core Components of BigFrames

BigQuery DataFrames is organized into specialized modules designed for the modern data stack, empowering big data analytics, AI/ML pipelines, and data engineering:

:mod:`bigframes.pandas`: A high-performance, pandas-compatible API for scalable data exploration, cleaning, and transformation for data analysts.
:mod:`bigframes.bigquery`: Specialized utilities for direct BigQuery resource management, including integrations with Gemini and other AI models in the :mod:`bigframes.bigquery.ai` submodule for data engineers and AI developers.

Quickstart: Scalable Data Analysis in Seconds

Install BigQuery DataFrames via pip:

pip install --upgrade bigframes

The following example demonstrates how to perform a distributed aggregation on a public dataset with millions of rows using just a few lines of Python:

import bigframes.pandas as bpd

# Initialize BigFrames and load a public dataset
df = bpd.read_gbq("bigquery-public-data.usa_names.usa_1910_2013")

# Perform familiar pandas operations that execute in the cloud
top_names = (
    df.groupby("name")
    .agg({"number": "sum"})
    .sort_values("number", ascending=False)
    .head(10)
)

# Bring the final, aggregated results back to local memory if needed
print(top_names.to_pandas())

Explore the Documentation

.. toctree::
    :maxdepth: 2
    :caption: User Documentation

    user_guide/index

.. toctree::
    :maxdepth: 2
    :caption: API Reference

    reference/index
    supported_pandas_apis

.. toctree::
    :maxdepth: 1
    :caption: Community & Updates

    changelog

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalable Python Data Analysis with BigQuery DataFrames (BigFrames)

Why Choose BigQuery DataFrames?

Core Components of BigFrames

Quickstart: Scalable Data Analysis in Seconds

Explore the Documentation

FilesExpand file tree

index.rst

Latest commit

History

index.rst

File metadata and controls

Scalable Python Data Analysis with BigQuery DataFrames (BigFrames)

Why Choose BigQuery DataFrames?

Core Components of BigFrames

Quickstart: Scalable Data Analysis in Seconds

Explore the Documentation