Skip to content

performance of median and iqr compared to python libraries #3462

@lampretl

Description

@lampretl

I'd like to efficiently and in parallel compute the median = q_0.5 and IQR = q_0.75 - q_0.25 of each column in a dataframe. Let's compare the 3 most used libraries:

pandas:

import numpy as np, pandas as pd, scipy
n,m=10**8,10;   df = pd.DataFrame(np.random.rand(n,m))
%time df.median(axis=0)
%time df.quantile(0.5)
%time df.quantile(0.75)-df.quantile(0.25)
%time scipy.stats.iqr(df,axis=0)
CPU times: user 23.4 s, sys: 921 ms, total: 24.4 s
Wall time: 24.4 s
CPU times: user 20.3 s, sys: 830 ms, total: 21.1 s
Wall time: 21.2 s
CPU times: user 39.9 s, sys: 1.71 s, total: 41.6 s
Wall time: 41.6 s
CPU times: user 25.6 s, sys: 5.28 s, total: 30.9 s
Wall time: 31 s

polars:

import numpy as np, polars as pl
n,m=10**8,10;   df = pl.DataFrame(np.random.rand(n,m), schema=[f"x{k}" for k in range(m)])
%time df.median()
%time df.quantile(0.75,interpolation='linear')
%time df.quantile(0.75,interpolation='linear') - df.quantile(0.25,interpolation='linear')
CPU times: user 21.4 s, sys: 3.51 s, total: 24.9 s
Wall time: 2.95 s
CPU times: user 19.2 s, sys: 3.86 s, total: 23.1 s
Wall time: 2.95 s
CPU times: user 43.8 s, sys: 11.4 s, total: 55.2 s
Wall time: 6.44 s

DataFrames.jl + Julia:

using DataFrames, StatsBase
n,m=10^1,10;   df = DataFrame(rand(n,m), :auto); 
function f1(df::DataFrame) ::Vector{Float64}  return map(median, eachcol(df)) end
function f2(df::DataFrame) ::Vector{Float64}  return map(iqr, eachcol(df)) end
function f3(df::DataFrame) ::Vector{Float64}  m=size(df,2);  res=fill(NaN,m);  Threads.@threads for j=1:m res[j] = median(df[:,j]) end; return res end
function f4(df::DataFrame) ::Vector{Float64}  m=size(df,2);  res=fill(NaN,m);  Threads.@threads for j=1:m res[j] = iqr(df[:,j]) end; return res end
@time f1(df);
@time f2(df);
@time f3(df);
@time f4(df);
14.686185 seconds (53 allocations: 14.901 GiB, 4.56% gc time)
86.758428 seconds (53 allocations: 7.451 GiB, 0.36% gc time)
8.259288 seconds (146 allocations: 22.352 GiB, 9.15% gc time)
50.395623 seconds (144 allocations: 14.901 GiB, 0.47% gc time)

Is there a better, more efficient way to compute medians and IQRs in Julia?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions