Skip to content

rbindlist: unprotect longestLevels inside the loop#7794

Merged
ben-schwen merged 7 commits into
masterfrom
fix7793
Jun 23, 2026
Merged

rbindlist: unprotect longestLevels inside the loop#7794
ben-schwen merged 7 commits into
masterfrom
fix7793

Conversation

@aitap

@aitap aitap commented Jun 18, 2026

Copy link
Copy Markdown
Member

Similar to how other protected objects are managed inside the per-column loop (cn, getAttrib(..., R_ClassSymbol)), marks, thisCol), keep longestLevels at a known position in the protection stack and unprotect at the end of the loop iteration when it reaches the top of the stack. This requires moving coercedForFactor deeper in the protection stack, because it needs to remain protected between the loop iterations. As a result, nprotect remains independent from the input size.

Fixes: #7793

@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.85%. Comparing base (0b57d38) to head (ba8a7a7).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #7794   +/-   ##
=======================================
  Coverage   98.85%   98.85%           
=======================================
  Files          87       87           
  Lines       17137    17140    +3     
=======================================
+ Hits        16941    16944    +3     
  Misses        196      196           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown
  • HEAD=fix7793 much slower for transform improved in #5493
  • HEAD=fix7793 faster P<0.001 for DT[by,verbose=TRUE] improved in #6296
    Comparison Plot

Generated via commit ba8a7a7

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 3 minutes and 38 seconds
Installing different package versions 21 seconds
Running and plotting the test cases 5 minutes and 21 seconds

Comment thread inst/tests/tests.Rraw Outdated
test(2375.6, print(data.table(x=rep("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 1e6)), topn=1), output="1000000: ABCDEFGHIJKLM...", options=list(width=25, datatable.prettyprint.char=NULL))

# rbindlist() used to put O(ncol(...)) > R_PPStackSize = 50000 items on the protect stack, #7793
x = as.data.table(as.list(1:50001))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per ?Memory, max-ppsize can be set up to 500,000, and I'm not aware of a way to poll it (at a glance R_PPStackSize appears to be non-API)

Probably best to just point to ?Memory and mention here "this test assumes the default value of 50,000", which has been the default for 15+ years, but if it does change in the future, it would be good to know the reason 50,001 was chosen.

@ben-schwen ben-schwen Jun 22, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I start a fresh R session with R --max-ppsize=500000, I can query it with commandArgs(). Is there any other API safe version to set it?

So maybe something like

max_ppsize = local({
  arg = grep("^--max-ppsize=", commandArgs(FALSE), value = TRUE)
  if (length(arg)) {
    v = suppressWarnings(as.integer(sub("^--max-ppsize=", "", tail(arg, 1L))))
    if (!is.na(v)) return(v)
  }
  50000L
})

and then guard the test with max_ppsize?

@MichaelChirico

Copy link
Copy Markdown
Member

Does atime find a performance / memory improvement from the change? Does bench::mark() find less gc() thrashing from a loop like for (j in 10000:20000) rbind(<j-col DT>, <j-col DT>)?

@MichaelChirico MichaelChirico left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code change LGTM

@aitap

aitap commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

Unless the columns are ordered factors with progressively longer levels attributes, the protection stack is filled with R_NilValues. The effect from having to mark and sweep ~50000 additional R_NilValues can be detected at the margins of the time distributions:

parse('.ci/atime/tests.R') |> _[[6]][[3]][[3]] |> eval() -> pkg.edit.fun
atime_versions(
 '.', seq(5000, 49900, 500),
 { gc(full=TRUE); x <- list(as.list(1:N)) }, # need a gc() between runs
 data.table::rbindlist(list(x)),
 sha.vec=c(fix7793='fix7793', master='master'),
 pkg.edit.fun = pkg.edit.fun, seconds.limit=10, times = 100
) -> res

Co-Authored-By: Michael Chirico <chiricom@google.com>

@ben-schwen ben-schwen left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change LGTM.

Maybe guard with max_ppsize as suggested in comment above

Co-Authored-By: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>
@ben-schwen ben-schwen merged commit deb9acc into master Jun 23, 2026
15 checks passed
@ben-schwen ben-schwen deleted the fix7793 branch June 23, 2026 14:25
venom1204 pushed a commit that referenced this pull request Jun 23, 2026
* test case

* Protect coercedForFactor deeper in the stack

* Unprotect longestLevels inside the loop

* NEWS entry

* Keep listNames protected

* Clarify test comment

Co-Authored-By: Michael Chirico <chiricom@google.com>

* Detect --max-ppsize

Co-Authored-By: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>

---------

Co-authored-by: Michael Chirico <chiricom@google.com>
Co-authored-by: Benjamin Schwendinger <52290390+ben-schwen@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rbindlist throws errors when combining tables with many columns

3 participants