Skip to content

Processing on a laptop with polars streaming API (untested on full dataset) #4

@marius-mather

Description

@marius-mather

I've only had time (and disk space) to try this on 10 billion rows, but on my M1 Macbook (with 10 cores), I can process the 10 billion rows in 85 seconds. As far as I can tell, the time is scaling more or less linearly with increasing size, which would put 1 trillion rows at roughly 2.5 hours. This could probably be brought down with more cores. But as far as I can tell, it makes processing the dataset viable on a single laptop.

If anyone is able to give it a try on the full dataset, please do!

import glob
import time
import polars as pl


def aggregate(df: pl.LazyFrame) -> pl.LazyFrame:
    return (
        df.group_by("station")
        .agg(pl.col("measure").min().alias("min"),
             pl.col("measure").max().alias("max"),
             pl.col("measure").mean().alias("mean"))
        .sort("station")
    )


if __name__ == "__main__":
    print("Polars config:", pl.Config.state())
    start = time.time()
    files = glob.glob("data/*.parquet")
    data = pl.scan_parquet(list(files), rechunk=True)
    query = aggregate(data)
    result = query.collect(streaming=True)
    end = time.time()
    print(f"Time: {end - start}")
    print(result)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions