Skip to content

faster bmerge numeric and roll#5187

Draft
mattdowle wants to merge 25 commits into
masterfrom
bmerge_numeric
Draft

faster bmerge numeric and roll#5187
mattdowle wants to merge 25 commits into
masterfrom
bmerge_numeric

Conversation

@mattdowle

@mattdowle mattdowle commented Sep 30, 2021

Copy link
Copy Markdown
Member

Thanks to a report comparing a single column case to base::findInterval, I had a look.
Some low hanging fruit was that INHERITS() (an R API call looking up class attribute and looping through the class vector) was being calling deeply in the recursive call to support integer64. Also the XIND branch, although a trinary, is called in many places and the branch could be raised. dtwiddle() has long been sand in the cogs given that the default rounding is 0 anyway, and there was a todo to get NA for double out of the way up front. Tackling those yielded a 2x speedup. There's too much recursion in the case of a large table rolling to small (many rows rolling to the same row) and I'm hoping for another 2x speedup on that.
Related #2490

@mattdowle mattdowle added the WIP label Sep 30, 2021
@mattdowle mattdowle added this to the 1.14.3 milestone Sep 30, 2021
@codecov

codecov Bot commented Sep 30, 2021

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.85%. Comparing base (deb9acc) to head (8e8cbbe).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5187   +/-   ##
=======================================
  Coverage   98.85%   98.85%           
=======================================
  Files          87       87           
  Lines       17140    17176   +36     
=======================================
+ Hits        16944    16980   +36     
  Misses        196      196           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@MichaelChirico

Copy link
Copy Markdown
Member

LMK if you'd like an extra pair of eyes on this / when it's ready

@mattdowle

Copy link
Copy Markdown
Member Author

Status so far in the branch on this particular large-to-small benchmark. Low hanging fruit without tackling the too-much-recursion yet. The code rework allows the too-much-recursion to be done next.

setDTthreads(1)
n <- 1e7 
x.df <- data.frame(date = as.POSIXct("2000-1-1") + 1:n, price = rnorm(n))
x.dt <- data.table(x.df, key = "date")

n2 <- 100
y.df <- data.frame(date = sort(sample(x.df$date, n2)), adjustment = runif(n2))
y.dt <- data.table(y.df, key = "date")

falign <- function(x, y) {
  i <- findInterval(as.numeric(x$date), as.numeric(y$date))
  i[which(i == 0)] <- NA
  x$price.adj <- x[, "price"] + y[i, "adjustment"]
  x
}

falign.dt <- function(x, y) {
  i <- findInterval(as.numeric(x$date), as.numeric(y$date))
  i[which(i == 0)] <- NA
  x[, price.adj := x[, "price"] + y[i, "adjustment"]]
  x
}

microbenchmark(
  falign(x.df, y.df),
  falign.dt(x.dt, y.dt),
  x.dt[, price.adj := y.dt[x.dt, price+adjustment, roll=TRUE]],
  times=10, unit="s"
))
# v1.14.0
Unit: seconds
                                                              expr       min        lq      mean    median        uq       max neval
                                                falign(x.df, y.df) 0.1488798 0.1542454 0.1617588 0.1593861 0.1620447 0.1992783    10
                                             falign.dt(x.dt, y.dt) 0.2815458 0.2901276 0.2958419 0.2960433 0.3009906 0.3185344    10
 x.dt[, `:=`(price.adj, y.dt[x.dt, price + adjustment, roll = T])] 0.8177566 0.8206021 0.8365346 0.8349606 0.8461528 0.8593044    10

# this branch as of now (3rd line down from 0.83 to 0.57 but still a lot slower than `findInterval`) : 
                                                              expr       min        lq      mean    median        uq       max neval
                                                falign(x.df, y.df) 0.1460495 0.1486709 0.1554336 0.1537538 0.1616508 0.1732326    10
                                             falign.dt(x.dt, y.dt) 0.2766880 0.3006418 0.3010593 0.3025785 0.3060092 0.3161539    10
 x.dt[, `:=`(price.adj, y.dt[x.dt, price + adjustment, roll = T])] 0.5418650 0.5635949 0.5735462 0.5717389 0.5876390 0.5971523    10

@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
@jangorecki jangorecki modified the milestones: 1.15.0, 1.15.1 Nov 6, 2023
@MichaelChirico

Copy link
Copy Markdown
Member

Anyone want to take over this PR? @jangorecki and @tlapak are the only two I see with commits to this file in recent memory.

@jangorecki

Copy link
Copy Markdown
Member

This PR will have conflicts with mergelist, which has been already merged and conflicts resolved to recent master

@tdhock

tdhock commented Jan 8, 2024

Copy link
Copy Markdown
Member

@Anirban166 @DorisAmoakohene please investigate speedups with atime

@MichaelChirico

Copy link
Copy Markdown
Member

Marking as draft over label:WIP

@MichaelChirico MichaelChirico marked this pull request as draft February 19, 2024 02:45
@MichaelChirico MichaelChirico modified the milestones: 1.16.0, 1.17.0 Jul 10, 2024
@MichaelChirico MichaelChirico modified the milestones: 1.17.0, 1.18.0 Dec 3, 2024
@jangorecki jangorecki modified the milestones: 1.18.0, 1.19.0 Nov 30, 2025
@github-actions

Copy link
Copy Markdown

No obvious timing issues in HEAD=bmerge_numeric
Comparison Plot

Generated via commit 8e8cbbe

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 6 minutes and 28 seconds
Installing different package versions 11 minutes and 40 seconds
Running and plotting the test cases 5 minutes and 18 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants