Skip to content

Commit bb6c6ed

Browse files
committed
Merge branch 'master' into modular_gforce
2 parents 2ea0174 + 20b463c commit bb6c6ed

4 files changed

Lines changed: 54 additions & 43 deletions

File tree

NEWS.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,14 @@
1818

1919
2. `[,showProgress=]` and `options(datatable.showProgress)` now accept an integer to control the progress bar update interval in seconds, allowing finer control over progress reporting frequency; `TRUE` uses the default 3-second interval, [#6514](https://github.com/Rdatatable/data.table/issues/6514). Thanks @ethanbsmith for the report and @ben-schwen for the PR.
2020

21+
3. GForce and lapply optimization detection has been refactored to use modular optimization paths and an AST (Abstract Syntax Tree) walker for improved maintainability and extensibility. The new architecture separates optimization detection into distinct, composable phases. This makes future optimization enhancements a lot easier. Thanks to @grantmcdermott, @jangorecki, @MichaelChirico, and @HughParsonage for the suggestions and @ben-schwen for the implementation.
22+
23+
This rewrite also introduces several new optimizations:
24+
- Enables Map in addition to lapply optimizations (e.g., `Map(fun, .SD)` -> `list(fun(col1), fun(col2), ...)`) [#5336](https://github.com/Rdatatable/data.table/issues/5336)
25+
- lapply optimization works without .SD (e.g., `lapply(list(col1, col2), fun)` -> `list(fun(col1), fun(col2))` [#5032](https://github.com/Rdatatable/data.table/issues/5032)
26+
- Type conversion support in GForce expressions (e.g., `sum(as.numeric(x))` will use GForce, saving the need to coerce `x` in a setup step) [#2934](https://github.com/Rdatatable/data.table/issues/2934)
27+
- Arithmetic operation support in GForce (e.g., `max(x) - min(x)` will use GForce on both `max(x)` and `min(x)`, saving the need to do the subtraction in a follow-up step) [#3815](https://github.com/Rdatatable/data.table/issues/3815)
28+
2129
### BUG FIXES
2230

2331
1. `fread()` with `skip=0` and `(header=TRUE|FALSE)` no longer skips the first row when it has fewer fields than subsequent rows, [#7463](https://github.com/Rdatatable/data.table/issues/7463). Thanks @emayerhofer for the report and @ben-schwen for the fix.
@@ -34,13 +42,7 @@
3442

3543
7. Fixed compilation failure like "error: unknown type name 'siginfo_t'" in v1.18.0 in some strict environments, e.g., FreeBSD, where the header file declaring the POSIX function `waitid` does not transitively include the header file defining the `siginfo_t` type, [#7516](https://github.com/rdatatable/data.table/issues/7516). Thanks to @jszhao for the report and @aitap for the fix.
3644

37-
8. GForce and lapply optimization detection has been refactored to use modular optimization paths and an AST (Abstract Syntax Tree) walker for improved maintainability and extensibility. The new architecture separates optimization detection into distinct, composable phases. This makes future optimization enhancements a lot easier. Thanks to @grantmcdermott, @jangorecki, @MichaelChirico, and @HughParsonage for the suggestions and @ben-schwen for the implementation.
38-
39-
This rewrite also introduces several new optimizations:
40-
- Enables Map in addition to lapply optimizations (e.g., `Map(fun, .SD)` -> `list(fun(col1), fun(col2), ...)`) [#5336](https://github.com/Rdatatable/data.table/issues/5336)
41-
- lapply optimization works without .SD (e.g., `lapply(list(col1, col2), fun)` -> `list(fun(col1), fun(col2))` [#5032](https://github.com/Rdatatable/data.table/issues/5032)
42-
- Type conversion support in GForce expressions (e.g., `sum(as.numeric(x))` will use GForce, saving the need to coerce `x` in a setup step) [#2934](https://github.com/Rdatatable/data.table/issues/2934)
43-
- Arithmetic operation support in GForce (e.g., `max(x) - min(x)` will use GForce on both `max(x)` and `min(x)`, saving the need to do the subtraction in a follow-up step) [#3815](https://github.com/Rdatatable/data.table/issues/3815)
45+
8. When fixing duplicate factor levels, `setattr()` no longer crashes upon encountering missing factor values, [#7595](https://github.com/Rdatatable/data.table/issues/7595). Thanks to @sindribaldur for the report and @aitap for the fix.
4446

4547
### Notes
4648

inst/tests/tests.Rraw

Lines changed: 33 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -21432,51 +21432,55 @@ test(2360.4, rowwiseDT(x =, expr =, 1, quote(a + b)),
2143221432
test(2360.5, rowwiseDT(x =, plist =, 1, as.pairlist(list(123))),
2143321433
error = "Column 'plist' is type 'pairlist'. Non-atomic, non-list objects must be wrapped in list\\(\\)")
2143421434

21435+
# setattr() must not crash for out-of-bounds factor indices when fixing duplicate levels, #7595
21436+
test(2361.1, setattr(factor(c(1, NA), levels = 1), "levels", c("1", "1")), factor(c(1, NA)))
21437+
test(2361.2, setattr(structure(c(-999L, 999L), class = "factor", levels = "a"), "levels", c("b", "b")), factor(c(NA, NA), levels = "b"))
21438+
2143521439
# gforce should also work with Map in j #5336
2143621440
# conversions should not turn gforce off #2934
2143721441
# lapply gforce should also work without .SD #5032
2143821442
# support arithmetic in j with gforce #3815
2143921443
out = c("GForce FALSE", "GForce FALSE", "GForce TRUE")
2144021444
# unwrap type conversions
2144121445
dt = data.table(a=1:4, b=1:2)
21442-
test(2361.01, optimize=0:2, dt[, max(as.character(a)), by=b, verbose=TRUE], data.table(b=1:2, V1=c("3","4")), output=out)
21443-
test(2361.02, optimize=0:2, dt[, max(as.numeric(a)), by=b, verbose=TRUE], data.table(b=1:2, V1=c(3,4)), output=out)
21444-
test(2361.03, optimize=0:2, dt[, max(as.integer(as.integer(as.integer(a)))), by=b, verbose=TRUE], data.table(b=1:2, V1=3:4), output=out)
21446+
test(2362.01, optimize=0:2, dt[, max(as.character(a)), by=b, verbose=TRUE], data.table(b=1:2, V1=c("3","4")), output=out)
21447+
test(2362.02, optimize=0:2, dt[, max(as.numeric(a)), by=b, verbose=TRUE], data.table(b=1:2, V1=c(3,4)), output=out)
21448+
test(2362.03, optimize=0:2, dt[, max(as.integer(as.integer(as.integer(a)))), by=b, verbose=TRUE], data.table(b=1:2, V1=3:4), output=out)
2144521449
# Map in j
2144621450
dt = data.table(a=1:4, b=1:2)
21447-
test(2361.11, optimize=0:2, dt[, Map(sum, .SD), b, verbose=TRUE], dt[, lapply(.SD, sum), b], output=out)
21448-
test(2361.12, optimize=0:2, dt[, Map(sum, .SD, .SD), by=b, verbose=TRUE], output="GForce FALSE")
21451+
test(2362.11, optimize=0:2, dt[, Map(sum, .SD), b, verbose=TRUE], dt[, lapply(.SD, sum), b], output=out)
21452+
test(2362.12, optimize=0:2, dt[, Map(sum, .SD, .SD), by=b, verbose=TRUE], output="GForce FALSE")
2144921453
# lapply without .SD
2145021454
dt = data.table(a = NA_integer_, b = 1:2, c = c(TRUE, FALSE))
21451-
test(2361.13, optimize=0:2, dt[, Map(weighted.mean, .SD, na.rm=c), b, .SDcols="a", verbose=TRUE], data.table(b=1:2, a=c(NaN, NA_real_)), output="GForce FALSE")
21452-
test(2361.14, optimize=0:2, dt[,list(weighted.mean(a, na.rm=c)), b, verbose=TRUE], data.table(b=1:2, V1=c(NaN, NA_real_)), output="GForce FALSE")
21453-
test(2361.15, optimize=0:2, dt[, Map(sum, .SD), by=b, .SDcols=c("a","c"), verbose=TRUE], dt[, lapply(.SD, sum), by=b, .SDcols=c("a","c")], output=out)
21455+
test(2362.13, optimize=0:2, dt[, Map(weighted.mean, .SD, na.rm=c), b, .SDcols="a", verbose=TRUE], data.table(b=1:2, a=c(NaN, NA_real_)), output="GForce FALSE")
21456+
test(2362.14, optimize=0:2, dt[,list(weighted.mean(a, na.rm=c)), b, verbose=TRUE], data.table(b=1:2, V1=c(NaN, NA_real_)), output="GForce FALSE")
21457+
test(2362.15, optimize=0:2, dt[, Map(sum, .SD), by=b, .SDcols=c("a","c"), verbose=TRUE], dt[, lapply(.SD, sum), by=b, .SDcols=c("a","c")], output=out)
2145421458
dt = data.table(a=1:2, b=1, c=1:4)
21455-
test(2361.21, optimize=0:2, dt[, lapply(list(b, c), sum), by=a, verbose=TRUE], output=out)
21456-
test(2361.22, optimize=0:2, dt[, c(list(sum(b), sum(c))), by=a, verbose=TRUE], output=out)
21459+
test(2362.21, optimize=0:2, dt[, lapply(list(b, c), sum), by=a, verbose=TRUE], output=out)
21460+
test(2362.22, optimize=0:2, dt[, c(list(sum(b), sum(c))), by=a, verbose=TRUE], output=out)
2145721461
# support arithmetic in j
2145821462
dt = data.table(a=1:4, b=1:2)
21459-
test(2361.31, optimize=0:2, dt[, .(max(a)-min(a)), by=b, verbose=TRUE], output=out)
21460-
test(2361.32, optimize=0:2, dt[, .((max(a) - min(a)) / (max(a) + min(a))), by=b, verbose=TRUE], data.table(b=1:2, V1=c(0.5, 1/3)), output=out)
21461-
test(2361.33, optimize=0:2, dt[, sum(a) / .N, b, verbose=TRUE], output=out)
21462-
test(2361.34, optimize=0:2, dt[, mean(a) * 2L + sum(a), b, verbose=TRUE], output=out)
21463-
test(2361.35, optimize=0:2, dt[, list(range=max(a)-min(a), avg=mean(a)), by=b, verbose=TRUE], output=out)
21464-
test(2361.36, optimize=0:2, dt[, .(max(a)-sqrt(min(a))), by=b, verbose=TRUE], output="GForce FALSE")
21465-
test(2361.37, optimize=0:2, dt[, sum(a) %% 2, b, verbose=TRUE], output=out)
21466-
test(2361.38, optimize=0:2, dt[, sum(a) %/% 2, b, verbose=TRUE], output=out)
21467-
test(2361.39, optimize=0:2, dt[, -sum(a), b, verbose=TRUE], output=out)
21468-
test(2361.40, optimize=0:2, dt[, .(sum(a)-sum(b)), b, verbose=TRUE], output="GForce FALSE")
21463+
test(2362.31, optimize=0:2, dt[, .(max(a)-min(a)), by=b, verbose=TRUE], output=out)
21464+
test(2362.32, optimize=0:2, dt[, .((max(a) - min(a)) / (max(a) + min(a))), by=b, verbose=TRUE], data.table(b=1:2, V1=c(0.5, 1/3)), output=out)
21465+
test(2362.33, optimize=0:2, dt[, sum(a) / .N, b, verbose=TRUE], output=out)
21466+
test(2362.34, optimize=0:2, dt[, mean(a) * 2L + sum(a), b, verbose=TRUE], output=out)
21467+
test(2362.35, optimize=0:2, dt[, list(range=max(a)-min(a), avg=mean(a)), by=b, verbose=TRUE], output=out)
21468+
test(2362.36, optimize=0:2, dt[, .(max(a)-sqrt(min(a))), by=b, verbose=TRUE], output="GForce FALSE")
21469+
test(2362.37, optimize=0:2, dt[, sum(a) %% 2, b, verbose=TRUE], output=out)
21470+
test(2362.38, optimize=0:2, dt[, sum(a) %/% 2, b, verbose=TRUE], output=out)
21471+
test(2362.39, optimize=0:2, dt[, -sum(a), b, verbose=TRUE], output=out)
21472+
test(2362.40, optimize=0:2, dt[, .(sum(a)-sum(b)), b, verbose=TRUE], output="GForce FALSE")
2146921473
# mix cases of the above
2147021474
dt = data.table(a=1:4, b=1:2)
21471-
test(2361.41, optimize=0:2, dt[, sum(as.numeric(a)) + mean(as.integer(a)), by=b, verbose=TRUE], output=out)
21472-
test(2361.42, optimize=0:2, dt[, Map(sum, .SD), by=b, .SDcols="a", verbose=TRUE], output=out)
21473-
test(2361.43, optimize=0:2, dt[, lapply(list(as.numeric(a)), sum), by=b, verbose=TRUE], output="GForce FALSE")
21474-
test(2361.44, optimize=0:2, dt[, sum(a) + as.numeric("5"), by=b, verbose=TRUE], data.table(b=1:2, V1=c(9, 11)), output="GForce FALSE")
21475-
test(2361.45, optimize=0:2, dt[, sum(a[a > 2]), by=b, verbose=TRUE], output="GForce FALSE")
21475+
test(2362.41, optimize=0:2, dt[, sum(as.numeric(a)) + mean(as.integer(a)), by=b, verbose=TRUE], output=out)
21476+
test(2362.42, optimize=0:2, dt[, Map(sum, .SD), by=b, .SDcols="a", verbose=TRUE], output=out)
21477+
test(2362.43, optimize=0:2, dt[, lapply(list(as.numeric(a)), sum), by=b, verbose=TRUE], output="GForce FALSE")
21478+
test(2362.44, optimize=0:2, dt[, sum(a) + as.numeric("5"), by=b, verbose=TRUE], data.table(b=1:2, V1=c(9, 11)), output="GForce FALSE")
21479+
test(2362.45, optimize=0:2, dt[, sum(a[a > 2]), by=b, verbose=TRUE], output="GForce FALSE")
2147621480
dt = data.table(a=1:4, b=1:2, c=2:5)
21477-
test(2361.46, optimize=0:2, dt[, .(sum(a) + sum(c)), by=b, verbose=TRUE], output=out)
21481+
test(2362.46, optimize=0:2, dt[, .(sum(a) + sum(c)), by=b, verbose=TRUE], output=out)
2147821482
# coverage and edge cases for lapply(.SD, ...)
2147921483
dt = data.table(a=1:4, b=1:2)
21480-
test(2361.51, optimize=0:2, dt[, c(list()), b, verbose=TRUE], data.table(b=integer(0L)), output="GForce FALSE")
21481-
test(2361.52, optimize=0:2, dt[, c(lapply(.SD, sum), list()), b, verbose=TRUE], output=out)
21482-
test(2361.53, optimize=0:2, dt[, list(lapply(.SD, sum), list()), b, verbose=TRUE], output="GForce FALSE")
21484+
test(2362.51, optimize=0:2, dt[, c(list()), b, verbose=TRUE], data.table(b=integer(0L)), output="GForce FALSE")
21485+
test(2362.52, optimize=0:2, dt[, c(lapply(.SD, sum), list()), b, verbose=TRUE], output=out)
21486+
test(2362.53, optimize=0:2, dt[, list(lapply(.SD, sum), list()), b, verbose=TRUE], output="GForce FALSE")

src/froll.c

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1707,11 +1707,13 @@ void frollmedianFast(const double *x, uint64_t nx, ans_t *ans, int k, double fil
17071707
snprintf(end(ans->message[3]), 500, _("%s: 's[A] + s[B] == h' is not true\n"), "frollmedianFast");
17081708
return;
17091709
}*/
1710-
if (n[A]!=tail && m[A] == n[A]) {
1711-
n[A] = tail;
1712-
}
1713-
if (n[B]!=tail && m[B] == n[B]) {
1714-
n[B] = tail;
1710+
if (even) {
1711+
if (n[A]!=tail && m[A] == n[A]) {
1712+
n[A] = tail;
1713+
}
1714+
if (n[B]!=tail && m[B] == n[B]) {
1715+
n[B] = tail;
1716+
}
17151717
}
17161718
ansv[j*k+i] = even ? MED2(A, B) : MED(A, B);
17171719
}

src/wrappers.c

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,8 +44,11 @@ SEXP setlevels(SEXP x, SEXP levels, SEXP ulevels) {
4444
SEXP xchar, newx;
4545
xchar = PROTECT(allocVector(STRSXP, nx));
4646
int *ix = INTEGER(x);
47-
for (int i=0; i<nx; ++i)
48-
SET_STRING_ELT(xchar, i, STRING_ELT(levels, ix[i]-1));
47+
const int nlevels = length(levels);
48+
for (int i=0; i<nx; ++i) {
49+
const int ixi = ix[i];
50+
SET_STRING_ELT(xchar, i, (ixi >= 1 && ixi <= nlevels) ? STRING_ELT(levels, ix[i]-1) : NA_STRING);
51+
}
4952
newx = PROTECT(chmatch(xchar, ulevels, NA_INTEGER));
5053
int *inewx = INTEGER(newx);
5154
for (int i=0; i<nx; ++i) ix[i] = inewx[i];

0 commit comments

Comments
 (0)