Merge branch 'Rdatatable:master' into fix-#7354

coffeecookey · web-flow · commit 9af40c12b3d2 · 2026-04-10T11:33:50.000+05:30
diff --git a/NEWS.md b/NEWS.md
@@ -48,6 +48,8 @@
 
 8. `frollapply()` no longer produces output longer than the input when the window length is also longer than the input [#7646](https://github.com/Rdatatable/data.table/issues/7646). Thanks to @hadley-johnson for reporting and @jangorecki for the fix.
 
+9. `fread()` no longer replaces a literal header column name `"NA"` with an auto-generated `Vn` name when `na.strings` includes `"NA"`, [#5124](https://github.com/Rdatatable/data.table/issues/5124). Data rows still continue to parse `"NA"` as missing. Thanks @Mashin6 for the report and @shrektan for the fix.
+
 ### Notes
 
 1. {data.table} now depends on R 3.5.0 (2018).
@@ -118,15 +120,15 @@
 
 5. Negative and missing values of `n` argument of adaptive rolling functions trigger an error.
 
-### NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES 
+### NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES
 
 1. `data.table(x=1, <expr>)`, where `<expr>` is an expression resulting in a 1-column matrix without column names, will eventually have names `x` and `V2`, not `x` and `V1`, consistent with `data.table(x=1, <expr>)` where `<expr>` results in an atomic vector, for example `data.table(x=1, cbind(1))` and `data.table(x=1, 1)` will both have columns named `x` and `V2`. In this release, the matrix case continues to be named `V1`, but the new behavior can be activated by setting `options(datatable.old.matrix.autoname)` to `FALSE`. See point 5 under Bug Fixes for more context; this change will provide more internal consistency as well as more consistency with `data.frame()`.
 
 2. The behavior of `week()` will be changed in a future release to calculate weeks sequentially (days 1-7 as week 1), which is a potential breaking change. For now, the current "legacy" behavior, where week numbers advance every 7th day of the year (e.g., day 7 starts week 2), remains the default, and a deprecation warning will be issued when the old and new behaviors differ. Users can control this behavior with the temporary option `options(datatable.week = "...")`:
     *   `"sequential"`: Opt-in to the new, sequential behavior (no warning).
     *   `"legacy"`: Continue using the legacy behavior but suppress the deprecation warning.
 See [#2611](https://github.com/Rdatatable/data.table/issues/2611) for details. Thanks @MichaelChirico for the report and @venom1204 for the implementation.
-    
+
 ### NEW FEATURES
 
 1. New `sort_by()` method for data.tables, [#6662](https://github.com/Rdatatable/data.table/issues/6662). It uses `forder()` to improve upon the data.frame method and also matches `DT[order(...)]` behavior with respect to locale. Thanks @rikivillalba for the suggestion and PR.
@@ -405,7 +407,7 @@ See [#2611](https://github.com/Rdatatable/data.table/issues/2611) for details. T
 9. Fixed incorrect sorting of merges where the first column of a key is a factor with non-`sort()`-ed levels (e.g. `factor(1:2, 2:1)` and it is joined to a character column, [#5361](https://github.com/Rdatatable/data.table/issues/5361). Thanks to @gbrunick for the report, Benjamin Schwendinger for the fix, and @MichaelChirico for a follow-up fix caught by revdep testing.
 
 10. Spurious warnings from internal code in `cube()`, `rollup()`, and `groupingsets()` are no longer surfaced to the caller, [#6964](https://github.com/Rdatatable/data.table/issues/6964). Thanks @ferenci-tamas for the report and @venom1204 for the fix.
- 
+
 11. `droplevels()` works on 0-row data.tables, [#7043](https://github.com/Rdatatable/data.table/issues/7043). The result will have factor columns `factor(character())`, consistent with the data.frame method. Thanks @advieser for the report and @MichaelChirico for the fix.
 
 12. `print(..., col.names = 'none')` now correctly adapts column widths to the data content, ignoring the original column names and producing a more compact output, [#6882](https://github.com/Rdatatable/data.table/issues/6882). Thanks to @brooksambrose for the report and @venom1204 for the PR.
@@ -587,7 +589,7 @@ rowwiseDT(
 3. Tagging/naming arguments of `c()` in `j=c()` should now more closely follow base R conventions for concatenation of named lists during grouping, [#2311](https://github.com/Rdatatable/data.table/issues/2311). Naming an `lapply(.SD, FUN)` call as an argument of `c()` in `j` will now always cause that tag to get prepended (with a single dot separator) to the resulting column names. Additionally, naming a `list()` call as an argument of `c()` in `j` will now always cause that tag to get prepended to any names specified within the list call. This bug only affected queries with (1) `by=` grouping (2) `getOption("datatable.optimize") >= 1L` and (3) `lapply(.SD, FUN)` in `j`.
 
     While the names returned by `data.table` when `j=c()` will now mostly follow base R conventions for concatenating lists, note that names which are completely unspecified will still be named positionally, matching the typical behavior in `j` and `data.table()`. according to position in `j` (e.g. `V1`, `V2`).
-    
+
     Thanks to @franknarf1 for reporting and @myoung3 for the PR.
 
     ```r
diff --git a/inst/tests/froll.Rraw b/inst/tests/froll.Rraw
@@ -9,11 +9,9 @@ if (exists("test.data.table", .GlobalEnv, inherits=FALSE)) {
   froll = data.table:::froll
 }
 
-exact_NaN = isTRUE(capabilities()["long.double"]) && identical(as.integer(.Machine$longdouble.digits), 64L)
+exact_NaN = identical(NA_real_+0, NA_real_)
 if (!exact_NaN) {
-  cat("\n**** Skipping 8 NaN/NA algo='exact' tests because .Machine$longdouble.digits==", .Machine$longdouble.digits, " (!=64); e.g. under valgrind\n\n", sep="")
-  # for Matt when he runs valgrind it is 53, but 64 when running regular R
-  # froll.c uses long double and appears to require full long double accuracy in the algo='exact'
+  cat("\n**** Skipping 10 NaN/NA algo='exact' tests because NaN payload doesn't propagate through arithmetic operations\n\n")
 }
 
 ## rolling features
@@ -1456,8 +1454,10 @@ test(6001.731, between(frollvar(y, 3)[4L], 0, 1e-7))
 test(6001.732, between(frollsd(y, 3)[4L], 0, 1e-7))
 test(6001.733, frollvar(y, c(3,3,3,3), adaptive=TRUE)[4L], 0)
 test(6001.734, frollsd(y, c(3,3,3,3), adaptive=TRUE)[4L], 0)
-test(6001.740, frollvar(c(1.5,2.5,2,NA), c(3,3)), list(c(NA,NA,0.25,NA), c(NA,NA,0.25,NA)), output="running sequentially, because outer parallelism has been used", options=c(datatable.verbose=TRUE)) # ensure no nested parallelism in rolling functions #7352
-test(6001.741, frollsd(c(1.5,2.5,2,NA), c(3,3)), list(c(NA,NA,0.5,NA), c(NA,NA,0.5,NA)), output="running sequentially, because outer parallelism has been used", options=c(datatable.verbose=TRUE))
+if (exact_NaN) {
+  test(6001.740, frollvar(c(1.5,2.5,2,NA), c(3,3)), list(c(NA,NA,0.25,NA), c(NA,NA,0.25,NA)), output="running sequentially, because outer parallelism has been used", options=c(datatable.verbose=TRUE)) # ensure no nested parallelism in rolling functions #7352
+  test(6001.741, frollsd(c(1.5,2.5,2,NA), c(3,3)), list(c(NA,NA,0.5,NA), c(NA,NA,0.5,NA)), output="running sequentially, because outer parallelism has been used", options=c(datatable.verbose=TRUE))
+}
 test(6001.742, frollvar(c(1.5,2.5,2,1.5), c(3,3)), list(c(NA,NA,0.25,0.25), c(NA,NA,0.25,0.25)), notOutput="running sequentially, because outer parallelism has been used", options=c(datatable.verbose=TRUE)) # no NA - no fallback to exact
 test(6001.743, frollsd(c(1.5,2.5,2,1.5), c(3,3)), list(c(NA,NA,0.5,0.5), c(NA,NA,0.5,0.5)), notOutput="running sequentially, because outer parallelism has been used", options=c(datatable.verbose=TRUE))
 test(6001.744, frollvar(c(1.5,2.5,2,NA), 3), c(NA,NA,0.25,NA), notOutput="running sequentially, because outer parallelism has been used", options=c(datatable.verbose=TRUE)) # not vectorized - no outer parallelism
diff --git a/inst/tests/frollBatch.Rraw b/inst/tests/frollBatch.Rraw
@@ -9,14 +9,6 @@ if (exists("test.data.table", .GlobalEnv, inherits=FALSE)) {
   froll = data.table:::froll
 }
 
-exact_NaN = isTRUE(capabilities()["long.double"]) && identical(as.integer(.Machine$longdouble.digits), 64L)
-if (!exact_NaN) {
-  cat("\n**** Skipping 7 NaN/NA algo='exact' tests because .Machine$longdouble.digits==", .Machine$longdouble.digits, " (!=64); e.g. under valgrind\n\n", sep="")
-  # for Matt when he runs valgrind it is 53, but 64 when running regular R
-  # froll.c uses long double and appears to require full long double accuracy in the algo='exact'
-}
-
-
 ## batch validation
 set.seed(108)
 makeNA = function(x, ratio=0.1, nf=FALSE) {
diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw
@@ -144,6 +144,11 @@ if (!test_longdouble) {
   cat("\n**** Full long double accuracy is not available. Tests using this will be skipped.\n\n")
   # e.g. under valgrind, longdouble.digits==53; causing these to fail: 1262, 1729.04, 1729.08, 1729.09, 1729.11, 1729.13, 1830.7; #4639
 }
+exact_NA = identical(NA_real_ + 0, NA_real_)
+if (!exact_NA) {
+  cat("\n**** NaN payload does not propagate through arithmetic operations. Tests requiring NA not to decay to NaN will be skipped.\n\n")
+  # e.g. on riscv64
+}
 
 tt = Sys.getenv("TZ", unset=NA)
 TZnotUTC = !identical(tt,"") && !is_utc(tt)
@@ -1957,12 +1962,9 @@ basemean = base::mean  # to isolate time of `::` itself
 ans3 = DT[,list(basemean(x),basemean(y)),by=list(grp1,grp2)]
 test(646, ans1, ans2)
 test(647, ans1, ans3)
-if (test_longdouble) {
+if (exact_NA) {
   test(648, anyNA(ans1$V1) && !any(is.nan(ans1$V1)))
-  # used to error with `valgrind` because of the 'long double' usage in gsumm.c (although I wonder if we need long double precision).
-  # it doesn't seem to error under valgrind anymore so the test_longdouble may be removable
-  # http://valgrind.org/docs/manual/manual-core.html#manual-core.limits
-  # http://comments.gmane.org/gmane.comp.debugging.valgrind/10340
+  # Valgrind may have had NaN payload propagation problems in the past; RISC-V will in the future
 }
 ans1 = DT[,list(mean(x,na.rm=TRUE),mean(y,na.rm=TRUE)),by=list(grp1,grp2)]
 ans2 = DT[,list(mean.default(x,na.rm=TRUE),mean.default(y,na.rm=TRUE)),by=list(grp1,grp2)]
@@ -2767,6 +2769,12 @@ test(946, fread('A,B,,D\n1,3,foo,5\n2,4,bar,6\n'), data.table(A=1:2,B=3:4,c("foo
 test(947, fread('0,2,,4\n1,3,foo,5\n2,4,bar,6\n'), data.table(0:2,2:4,c("","foo","bar"),4:6))
 test(948, fread('A,B,C\nD,E,F\n',header=TRUE), data.table(A="D",B="E",C="F"))
 test(949, fread('A,B,\nD,E,F\n',header=TRUE), data.table(A="D",B="E",V3="F"))
+# #5124 fread should preserve literal "NA" header names while still parsing data "NA" as missing
+test(949.1, names(fread('A,NA,C\n1,NA,3\n', header=TRUE)), c("A", "NA", "C"))
+ans = data.table(A=1L, tmp=as.logical(NA), C=3L)
+setnames(ans, "tmp", "NA")
+test(949.2, fread('A,NA,C\n1,NA,3\n', header=TRUE), ans)
+test(949.3, names(fread('"A","NA","C"\n1,NA,3\n', header=TRUE)), c("A", "NA", "C"))
 
 # +/- with no numbers afterwards should read as character
 test(950, fread('A,B,C\n1,+,4\n2,-,5\n3,-,6\n'), data.table(A=1:3,B=c("+","-","-"),C=4:6))
@@ -4243,7 +4251,7 @@ test(1163, last(x), character(0))
 # Bug fix for #5159 - chmatch and character encoding (for some reason this seems to pass the test on a mac as well)
 a = c("a","\u00E4","\u00DF","z")
 au = iconv(a,"UTF8","latin1")
-test(1164.1, requires_utf8=c("\u00E4", "\u00DF"), chmatch(a, au), match(a, au))
+test(1164.1, chmatch(a, au), match(a, au))
 
 # Bug fix for #73 - segfault when rbindlist on empty data.tables
 x <- as.data.table(BOD)
@@ -13560,7 +13568,7 @@ alloc.col(ans)
 test(1965.3, setDT(list(1:2, M)), ans, warning='Some columns are a multi-column type.*for example column 2')
 
 # fread/fwrite file name in native and utf-8 encoding, #3078
-if (.Platform$OS.type=="windows") {
+if (.Platform$OS.type=="windows" && utf8_check("\u00c3\u00b6\u00fc")) {
   f = tempfile("\u00f6"); cat("3.14", file = f)
   fn = enc2native(f); f8 = enc2utf8(f)
   test(1966.1, fread(fn), data.table(V1=3.14))
diff --git a/src/fread.c b/src/fread.c
@@ -2325,6 +2325,12 @@ int freadMain(freadMainArgs _args)
         .targets = targets,
         .anchor = colNamesAnchor,
       };
+      const char * const* savedNAstrings = NAstrings;
+      const bool savedBlankIsNAString = blank_is_a_NAstring;
+      // Column names should preserve literal header text, even when it matches na.strings.
+      // Blank headers still keep len==0 from Field() and are assigned default V<n> names later.
+      NAstrings = NULL;
+      blank_is_a_NAstring = false;
       ch--;
       for (int i = 0; i < ncol; i++) {
         // Use Field() here as it handles quotes, leading space etc inside it
@@ -2345,6 +2351,8 @@ int freadMain(freadMainArgs _args)
           if (ch[1] == '\r' || ch[1] == '\n' || ch[1] == '\0') { ch++; break; }
         }
       }
+      NAstrings = savedNAstrings;
+      blank_is_a_NAstring = savedBlankIsNAString;
       if (eol(&ch)) pos = ++ch;
       else if (*ch == '\0') pos = ch;
       else INTERNAL_STOP("reading colnames ending on '%c'", *ch); // # nocov