faster bmerge numeric and roll#5187
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5187 +/- ##
=======================================
Coverage 98.85% 98.85%
=======================================
Files 87 87
Lines 17140 17176 +36
=======================================
+ Hits 16944 16980 +36
Misses 196 196 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
LMK if you'd like an extra pair of eyes on this / when it's ready |
|
Status so far in the branch on this particular large-to-small benchmark. Low hanging fruit without tackling the too-much-recursion yet. The code rework allows the too-much-recursion to be done next. setDTthreads(1)
n <- 1e7
x.df <- data.frame(date = as.POSIXct("2000-1-1") + 1:n, price = rnorm(n))
x.dt <- data.table(x.df, key = "date")
n2 <- 100
y.df <- data.frame(date = sort(sample(x.df$date, n2)), adjustment = runif(n2))
y.dt <- data.table(y.df, key = "date")
falign <- function(x, y) {
i <- findInterval(as.numeric(x$date), as.numeric(y$date))
i[which(i == 0)] <- NA
x$price.adj <- x[, "price"] + y[i, "adjustment"]
x
}
falign.dt <- function(x, y) {
i <- findInterval(as.numeric(x$date), as.numeric(y$date))
i[which(i == 0)] <- NA
x[, price.adj := x[, "price"] + y[i, "adjustment"]]
x
}
microbenchmark(
falign(x.df, y.df),
falign.dt(x.dt, y.dt),
x.dt[, price.adj := y.dt[x.dt, price+adjustment, roll=TRUE]],
times=10, unit="s"
)) |
|
Anyone want to take over this PR? @jangorecki and @tlapak are the only two I see with commits to this file in recent memory. |
|
This PR will have conflicts with mergelist, which has been already merged and conflicts resolved to recent master |
|
@Anirban166 @DorisAmoakohene please investigate speedups with atime |
|
Marking as draft over |
|
No obvious timing issues in HEAD=bmerge_numeric Generated via commit 8e8cbbe Download link for the artifact containing the test results: ↓ atime-results.zip
|

Thanks to a report comparing a single column case to
base::findInterval, I had a look.Some low hanging fruit was that INHERITS() (an R API call looking up class attribute and looping through the class vector) was being calling deeply in the recursive call to support integer64. Also the XIND branch, although a trinary, is called in many places and the branch could be raised.
dtwiddle()has long been sand in the cogs given that the default rounding is 0 anyway, and there was a todo to get NA for double out of the way up front. Tackling those yielded a 2x speedup. There's too much recursion in the case of a large table rolling to small (many rows rolling to the same row) and I'm hoping for another 2x speedup on that.Related #2490