Skip to content

Commit 9af40c1

Browse files
authored
Merge branch 'Rdatatable:master' into fix-#7354
2 parents c6277c9 + 8364344 commit 9af40c1

5 files changed

Lines changed: 35 additions & 25 deletions

File tree

NEWS.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,8 @@
4848

4949
8. `frollapply()` no longer produces output longer than the input when the window length is also longer than the input [#7646](https://github.com/Rdatatable/data.table/issues/7646). Thanks to @hadley-johnson for reporting and @jangorecki for the fix.
5050

51+
9. `fread()` no longer replaces a literal header column name `"NA"` with an auto-generated `Vn` name when `na.strings` includes `"NA"`, [#5124](https://github.com/Rdatatable/data.table/issues/5124). Data rows still continue to parse `"NA"` as missing. Thanks @Mashin6 for the report and @shrektan for the fix.
52+
5153
### Notes
5254

5355
1. {data.table} now depends on R 3.5.0 (2018).
@@ -118,15 +120,15 @@
118120
119121
5. Negative and missing values of `n` argument of adaptive rolling functions trigger an error.
120122
121-
### NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES
123+
### NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES
122124
123125
1. `data.table(x=1, <expr>)`, where `<expr>` is an expression resulting in a 1-column matrix without column names, will eventually have names `x` and `V2`, not `x` and `V1`, consistent with `data.table(x=1, <expr>)` where `<expr>` results in an atomic vector, for example `data.table(x=1, cbind(1))` and `data.table(x=1, 1)` will both have columns named `x` and `V2`. In this release, the matrix case continues to be named `V1`, but the new behavior can be activated by setting `options(datatable.old.matrix.autoname)` to `FALSE`. See point 5 under Bug Fixes for more context; this change will provide more internal consistency as well as more consistency with `data.frame()`.
124126
125127
2. The behavior of `week()` will be changed in a future release to calculate weeks sequentially (days 1-7 as week 1), which is a potential breaking change. For now, the current "legacy" behavior, where week numbers advance every 7th day of the year (e.g., day 7 starts week 2), remains the default, and a deprecation warning will be issued when the old and new behaviors differ. Users can control this behavior with the temporary option `options(datatable.week = "...")`:
126128
* `"sequential"`: Opt-in to the new, sequential behavior (no warning).
127129
* `"legacy"`: Continue using the legacy behavior but suppress the deprecation warning.
128130
See [#2611](https://github.com/Rdatatable/data.table/issues/2611) for details. Thanks @MichaelChirico for the report and @venom1204 for the implementation.
129-
131+
130132
### NEW FEATURES
131133
132134
1. New `sort_by()` method for data.tables, [#6662](https://github.com/Rdatatable/data.table/issues/6662). It uses `forder()` to improve upon the data.frame method and also matches `DT[order(...)]` behavior with respect to locale. Thanks @rikivillalba for the suggestion and PR.
@@ -405,7 +407,7 @@ See [#2611](https://github.com/Rdatatable/data.table/issues/2611) for details. T
405407
9. Fixed incorrect sorting of merges where the first column of a key is a factor with non-`sort()`-ed levels (e.g. `factor(1:2, 2:1)` and it is joined to a character column, [#5361](https://github.com/Rdatatable/data.table/issues/5361). Thanks to @gbrunick for the report, Benjamin Schwendinger for the fix, and @MichaelChirico for a follow-up fix caught by revdep testing.
406408
407409
10. Spurious warnings from internal code in `cube()`, `rollup()`, and `groupingsets()` are no longer surfaced to the caller, [#6964](https://github.com/Rdatatable/data.table/issues/6964). Thanks @ferenci-tamas for the report and @venom1204 for the fix.
408-
410+
409411
11. `droplevels()` works on 0-row data.tables, [#7043](https://github.com/Rdatatable/data.table/issues/7043). The result will have factor columns `factor(character())`, consistent with the data.frame method. Thanks @advieser for the report and @MichaelChirico for the fix.
410412
411413
12. `print(..., col.names = 'none')` now correctly adapts column widths to the data content, ignoring the original column names and producing a more compact output, [#6882](https://github.com/Rdatatable/data.table/issues/6882). Thanks to @brooksambrose for the report and @venom1204 for the PR.
@@ -587,7 +589,7 @@ rowwiseDT(
587589
3. Tagging/naming arguments of `c()` in `j=c()` should now more closely follow base R conventions for concatenation of named lists during grouping, [#2311](https://github.com/Rdatatable/data.table/issues/2311). Naming an `lapply(.SD, FUN)` call as an argument of `c()` in `j` will now always cause that tag to get prepended (with a single dot separator) to the resulting column names. Additionally, naming a `list()` call as an argument of `c()` in `j` will now always cause that tag to get prepended to any names specified within the list call. This bug only affected queries with (1) `by=` grouping (2) `getOption("datatable.optimize") >= 1L` and (3) `lapply(.SD, FUN)` in `j`.
588590
589591
While the names returned by `data.table` when `j=c()` will now mostly follow base R conventions for concatenating lists, note that names which are completely unspecified will still be named positionally, matching the typical behavior in `j` and `data.table()`. according to position in `j` (e.g. `V1`, `V2`).
590-
592+
591593
Thanks to @franknarf1 for reporting and @myoung3 for the PR.
592594
593595
```r

inst/tests/froll.Rraw

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,9 @@ if (exists("test.data.table", .GlobalEnv, inherits=FALSE)) {
99
froll = data.table:::froll
1010
}
1111

12-
exact_NaN = isTRUE(capabilities()["long.double"]) && identical(as.integer(.Machine$longdouble.digits), 64L)
12+
exact_NaN = identical(NA_real_+0, NA_real_)
1313
if (!exact_NaN) {
14-
cat("\n**** Skipping 8 NaN/NA algo='exact' tests because .Machine$longdouble.digits==", .Machine$longdouble.digits, " (!=64); e.g. under valgrind\n\n", sep="")
15-
# for Matt when he runs valgrind it is 53, but 64 when running regular R
16-
# froll.c uses long double and appears to require full long double accuracy in the algo='exact'
14+
cat("\n**** Skipping 10 NaN/NA algo='exact' tests because NaN payload doesn't propagate through arithmetic operations\n\n")
1715
}
1816

1917
## rolling features
@@ -1456,8 +1454,10 @@ test(6001.731, between(frollvar(y, 3)[4L], 0, 1e-7))
14561454
test(6001.732, between(frollsd(y, 3)[4L], 0, 1e-7))
14571455
test(6001.733, frollvar(y, c(3,3,3,3), adaptive=TRUE)[4L], 0)
14581456
test(6001.734, frollsd(y, c(3,3,3,3), adaptive=TRUE)[4L], 0)
1459-
test(6001.740, frollvar(c(1.5,2.5,2,NA), c(3,3)), list(c(NA,NA,0.25,NA), c(NA,NA,0.25,NA)), output="running sequentially, because outer parallelism has been used", options=c(datatable.verbose=TRUE)) # ensure no nested parallelism in rolling functions #7352
1460-
test(6001.741, frollsd(c(1.5,2.5,2,NA), c(3,3)), list(c(NA,NA,0.5,NA), c(NA,NA,0.5,NA)), output="running sequentially, because outer parallelism has been used", options=c(datatable.verbose=TRUE))
1457+
if (exact_NaN) {
1458+
test(6001.740, frollvar(c(1.5,2.5,2,NA), c(3,3)), list(c(NA,NA,0.25,NA), c(NA,NA,0.25,NA)), output="running sequentially, because outer parallelism has been used", options=c(datatable.verbose=TRUE)) # ensure no nested parallelism in rolling functions #7352
1459+
test(6001.741, frollsd(c(1.5,2.5,2,NA), c(3,3)), list(c(NA,NA,0.5,NA), c(NA,NA,0.5,NA)), output="running sequentially, because outer parallelism has been used", options=c(datatable.verbose=TRUE))
1460+
}
14611461
test(6001.742, frollvar(c(1.5,2.5,2,1.5), c(3,3)), list(c(NA,NA,0.25,0.25), c(NA,NA,0.25,0.25)), notOutput="running sequentially, because outer parallelism has been used", options=c(datatable.verbose=TRUE)) # no NA - no fallback to exact
14621462
test(6001.743, frollsd(c(1.5,2.5,2,1.5), c(3,3)), list(c(NA,NA,0.5,0.5), c(NA,NA,0.5,0.5)), notOutput="running sequentially, because outer parallelism has been used", options=c(datatable.verbose=TRUE))
14631463
test(6001.744, frollvar(c(1.5,2.5,2,NA), 3), c(NA,NA,0.25,NA), notOutput="running sequentially, because outer parallelism has been used", options=c(datatable.verbose=TRUE)) # not vectorized - no outer parallelism

inst/tests/frollBatch.Rraw

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,6 @@ if (exists("test.data.table", .GlobalEnv, inherits=FALSE)) {
99
froll = data.table:::froll
1010
}
1111

12-
exact_NaN = isTRUE(capabilities()["long.double"]) && identical(as.integer(.Machine$longdouble.digits), 64L)
13-
if (!exact_NaN) {
14-
cat("\n**** Skipping 7 NaN/NA algo='exact' tests because .Machine$longdouble.digits==", .Machine$longdouble.digits, " (!=64); e.g. under valgrind\n\n", sep="")
15-
# for Matt when he runs valgrind it is 53, but 64 when running regular R
16-
# froll.c uses long double and appears to require full long double accuracy in the algo='exact'
17-
}
18-
19-
2012
## batch validation
2113
set.seed(108)
2214
makeNA = function(x, ratio=0.1, nf=FALSE) {

inst/tests/tests.Rraw

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,11 @@ if (!test_longdouble) {
144144
cat("\n**** Full long double accuracy is not available. Tests using this will be skipped.\n\n")
145145
# e.g. under valgrind, longdouble.digits==53; causing these to fail: 1262, 1729.04, 1729.08, 1729.09, 1729.11, 1729.13, 1830.7; #4639
146146
}
147+
exact_NA = identical(NA_real_ + 0, NA_real_)
148+
if (!exact_NA) {
149+
cat("\n**** NaN payload does not propagate through arithmetic operations. Tests requiring NA not to decay to NaN will be skipped.\n\n")
150+
# e.g. on riscv64
151+
}
147152

148153
tt = Sys.getenv("TZ", unset=NA)
149154
TZnotUTC = !identical(tt,"") && !is_utc(tt)
@@ -1957,12 +1962,9 @@ basemean = base::mean # to isolate time of `::` itself
19571962
ans3 = DT[,list(basemean(x),basemean(y)),by=list(grp1,grp2)]
19581963
test(646, ans1, ans2)
19591964
test(647, ans1, ans3)
1960-
if (test_longdouble) {
1965+
if (exact_NA) {
19611966
test(648, anyNA(ans1$V1) && !any(is.nan(ans1$V1)))
1962-
# used to error with `valgrind` because of the 'long double' usage in gsumm.c (although I wonder if we need long double precision).
1963-
# it doesn't seem to error under valgrind anymore so the test_longdouble may be removable
1964-
# http://valgrind.org/docs/manual/manual-core.html#manual-core.limits
1965-
# http://comments.gmane.org/gmane.comp.debugging.valgrind/10340
1967+
# Valgrind may have had NaN payload propagation problems in the past; RISC-V will in the future
19661968
}
19671969
ans1 = DT[,list(mean(x,na.rm=TRUE),mean(y,na.rm=TRUE)),by=list(grp1,grp2)]
19681970
ans2 = DT[,list(mean.default(x,na.rm=TRUE),mean.default(y,na.rm=TRUE)),by=list(grp1,grp2)]
@@ -2767,6 +2769,12 @@ test(946, fread('A,B,,D\n1,3,foo,5\n2,4,bar,6\n'), data.table(A=1:2,B=3:4,c("foo
27672769
test(947, fread('0,2,,4\n1,3,foo,5\n2,4,bar,6\n'), data.table(0:2,2:4,c("","foo","bar"),4:6))
27682770
test(948, fread('A,B,C\nD,E,F\n',header=TRUE), data.table(A="D",B="E",C="F"))
27692771
test(949, fread('A,B,\nD,E,F\n',header=TRUE), data.table(A="D",B="E",V3="F"))
2772+
# #5124 fread should preserve literal "NA" header names while still parsing data "NA" as missing
2773+
test(949.1, names(fread('A,NA,C\n1,NA,3\n', header=TRUE)), c("A", "NA", "C"))
2774+
ans = data.table(A=1L, tmp=as.logical(NA), C=3L)
2775+
setnames(ans, "tmp", "NA")
2776+
test(949.2, fread('A,NA,C\n1,NA,3\n', header=TRUE), ans)
2777+
test(949.3, names(fread('"A","NA","C"\n1,NA,3\n', header=TRUE)), c("A", "NA", "C"))
27702778

27712779
# +/- with no numbers afterwards should read as character
27722780
test(950, fread('A,B,C\n1,+,4\n2,-,5\n3,-,6\n'), data.table(A=1:3,B=c("+","-","-"),C=4:6))
@@ -4243,7 +4251,7 @@ test(1163, last(x), character(0))
42434251
# Bug fix for #5159 - chmatch and character encoding (for some reason this seems to pass the test on a mac as well)
42444252
a = c("a","\u00E4","\u00DF","z")
42454253
au = iconv(a,"UTF8","latin1")
4246-
test(1164.1, requires_utf8=c("\u00E4", "\u00DF"), chmatch(a, au), match(a, au))
4254+
test(1164.1, chmatch(a, au), match(a, au))
42474255

42484256
# Bug fix for #73 - segfault when rbindlist on empty data.tables
42494257
x <- as.data.table(BOD)
@@ -13560,7 +13568,7 @@ alloc.col(ans)
1356013568
test(1965.3, setDT(list(1:2, M)), ans, warning='Some columns are a multi-column type.*for example column 2')
1356113569

1356213570
# fread/fwrite file name in native and utf-8 encoding, #3078
13563-
if (.Platform$OS.type=="windows") {
13571+
if (.Platform$OS.type=="windows" && utf8_check("\u00c3\u00b6\u00fc")) {
1356413572
f = tempfile("\u00f6"); cat("3.14", file = f)
1356513573
fn = enc2native(f); f8 = enc2utf8(f)
1356613574
test(1966.1, fread(fn), data.table(V1=3.14))

src/fread.c

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2325,6 +2325,12 @@ int freadMain(freadMainArgs _args)
23252325
.targets = targets,
23262326
.anchor = colNamesAnchor,
23272327
};
2328+
const char * const* savedNAstrings = NAstrings;
2329+
const bool savedBlankIsNAString = blank_is_a_NAstring;
2330+
// Column names should preserve literal header text, even when it matches na.strings.
2331+
// Blank headers still keep len==0 from Field() and are assigned default V<n> names later.
2332+
NAstrings = NULL;
2333+
blank_is_a_NAstring = false;
23282334
ch--;
23292335
for (int i = 0; i < ncol; i++) {
23302336
// Use Field() here as it handles quotes, leading space etc inside it
@@ -2345,6 +2351,8 @@ int freadMain(freadMainArgs _args)
23452351
if (ch[1] == '\r' || ch[1] == '\n' || ch[1] == '\0') { ch++; break; }
23462352
}
23472353
}
2354+
NAstrings = savedNAstrings;
2355+
blank_is_a_NAstring = savedBlankIsNAString;
23482356
if (eol(&ch)) pos = ++ch;
23492357
else if (*ch == '\0') pos = ch;
23502358
else INTERNAL_STOP("reading colnames ending on '%c'", *ch); // # nocov

0 commit comments

Comments
 (0)