Skip to content

Commit c02aa3c

Browse files
Asa-Henryskitsy24
authored andcommitted
Throw Warning When Reading File with Blank Lines (#7707)
* Added a check to detect when blank lines should have been skipped, but it doesn't solve the problem yet. * Fixed check added to only check if 'topSkip' is greater than 0. 'topSkip' is greater than 0 when blank lines are present, so I also check if blank lines should be skipped so I can throw a warning to let the user know. * Updated check for blank lines to ask if 'topSkip' isgreater than 1 to accomodate situation where the header and data are separated by a blank line. * Used the 'prevStart' variable to detect when each line is separated by a blank line. In the case of each line separated by a blank line, 'prevStart' is always NULL because each line could be a possible header. * Added test '1578.10' for initial case which issue #3339 pointed out. Causes an error in test 1578.1? * Changed the number of the test written to verify fix for #3339. * Removed newline from end of expected warning message. * Updated NEWS.md with news of the new warning. * Updated sub-test numbers for test '1578' to match the format established where 0s are prepended when there is more than one significant digit following the decimal.
1 parent a803bb1 commit c02aa3c

3 files changed

Lines changed: 18 additions & 9 deletions

File tree

NEWS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,8 @@
4646

4747
6. `fread()` no longer replaces a literal header column name `"NA"` with an auto-generated `Vn` name when `na.strings` includes `"NA"`, [#5124](https://github.com/Rdatatable/data.table/issues/5124). Data rows still continue to parse `"NA"` as missing. Thanks @Mashin6 for the report and @shrektan for the fix.
4848

49+
11. `fread()` would not give a warning when every second line of input was empty, [#3339](https://github.com/Rdatatable/data.table/issues/3339). Now, a warning message 'The rows in this file appear to be separated by blank lines.' is given and suggests to set `blank.lines.skip` to `TRUE`. Thanks to @Henrik-P for the report and @Asa-Henry for the fix.
50+
4951
### Notes
5052

5153
1. {data.table} now depends on R 3.5.0 (2018).

inst/tests/tests.Rraw

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8059,20 +8059,22 @@ test(1577.3, levels(X$b), character(0))
80598059

80608060
# FR #530, skip blank lines
80618061
input = "Header not 2 columns\n\n1,3\n2,4"
8062-
test(1578.1, fread(input), data.table(V1=1:2, V2=3:4))
8062+
test(1578.01, fread(input), data.table(V1=1:2, V2=3:4))
80638063
input = "a,b\n\n1,3\n2,4"
8064-
test(1578.2, fread(input), data.table(V1=1:2, V2=3:4)) # the block of 2x2 dominates the one line with sep in auto-removed header section
8065-
test(1578.3, fread(input, blank.lines.skip=TRUE), data.table( a=1:2, b=3:4))
8064+
test(1578.02, fread(input), data.table(V1=1:2, V2=3:4)) # the block of 2x2 dominates the one line with sep in auto-removed header section
8065+
test(1578.03, fread(input, blank.lines.skip=TRUE), data.table( a=1:2, b=3:4))
80668066
input = "a,b\n\n\n1,3\n2,4"
8067-
test(1578.4, fread(input, blank.lines.skip=TRUE), data.table( a=1:2, b=3:4))
8067+
test(1578.04, fread(input, blank.lines.skip=TRUE), data.table( a=1:2, b=3:4))
80688068
input = "a,b\n\n\n1,3\n\n2,4\n\n"
8069-
test(1578.5, fread(input, blank.lines.skip=TRUE), data.table( a=1:2, b=3:4))
8069+
test(1578.05, fread(input, blank.lines.skip=TRUE), data.table( a=1:2, b=3:4))
80708070

80718071
f = testDir("530_fread.txt")
8072-
test(1578.6, fread(f, skip=47L, verbose=TRUE), data.table(V1=1:2, V2=3:4), output="Positioned on line 48 starting: <<a,b>>")
8073-
test(1578.7, fread(f, skip=49L), data.table(V1=1:2, V2=3:4))
8074-
test(1578.8, fread(f, skip=47L, blank.lines.skip=TRUE), data.table(a=1:2, b=3:4))
8075-
test(1578.9, fread(f, skip=48L), data.table(V1=1:2, V2=3:4)) # start on blank line 49 and skip="auto" to first data row on line 50
8072+
test(1578.06, fread(f, skip=47L, verbose=TRUE), data.table(V1=1:2, V2=3:4), output="Positioned on line 48 starting: <<a,b>>")
8073+
test(1578.07, fread(f, skip=49L), data.table(V1=1:2, V2=3:4))
8074+
test(1578.08, fread(f, skip=47L, blank.lines.skip=TRUE), data.table(a=1:2, b=3:4))
8075+
test(1578.09, fread(f, skip=48L), data.table(V1=1:2, V2=3:4)) # start on blank line 49 and skip="auto" to first data row on line 50
8076+
input = "x y\n\n1 a\n\n2 b\n\n3 c"
8077+
test(1578.10, fread(input), data.table(V1=3L, V2="c"), warning="The rows in this file appear to be separated by blank lines. This resulted in most rows being skipped. If this was not the intended outcome, please consider setting 'blank.lines.skip' to TRUE.")
80768078

80778079
# test 1579 moved to optimize.Rraw
80788080

src/fread.c

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1847,6 +1847,7 @@ int freadMain(freadMainArgs _args)
18471847
int topNumFields = 1; // how many fields that was, to resolve ties
18481848
enum quote_rule_t topQuoteRule = -1; // which quote rule that was
18491849
int topSkip = 0; // how many rows to auto-skip
1850+
// #7707 'topSkip' accumulates as blank lines are encountered; can be used to differentiate between a file where the header and data are separated by a blank line and a file where block(s) of lines or each line is separated by a blank line
18501851
const char *topStart = NULL;
18511852

18521853
for (quoteRule = quote ? QUOTE_RULE_EMBEDDED_QUOTES_DOUBLED : QUOTE_RULE_IGNORE_QUOTES; quoteRule < QUOTE_RULE_COUNT; quoteRule++) { // #loop_counter_not_local_scope_ok
@@ -1950,6 +1951,10 @@ int freadMain(freadMainArgs _args)
19501951
}
19511952
}
19521953
}
1954+
if (!prevStart && topSkip > 1 && !skipEmptyLines)
1955+
{
1956+
DTWARN(_("The rows in this file appear to be separated by blank lines. This resulted in most rows being skipped. If this was not the intended outcome, please consider setting 'blank.lines.skip' to TRUE.\n"));
1957+
}
19531958
if (!firstJumpEnd) {
19541959
if (verbose) DTPRINT(_(" No sep and quote rule found a block of 2x2 or greater. Single column input.\n"));
19551960
topNumFields = 1;

0 commit comments

Comments
 (0)