Skip to content

Simplified and Extended "Updating by Reference" Section in Joins Vignette#6847

Merged
MichaelChirico merged 23 commits into
Rdatatable:masterfrom
venom1204:issue6846
Jun 23, 2025
Merged

Simplified and Extended "Updating by Reference" Section in Joins Vignette#6847
MichaelChirico merged 23 commits into
Rdatatable:masterfrom
venom1204:issue6846

Conversation

@venom1204
Copy link
Copy Markdown
Contributor

@venom1204 venom1204 commented Mar 3, 2025

Simplified and extended the Updating by Reference section in the Joins in data.table vignette.
closes #6846
in this i

  • Added a simpler one-to-one update example for clarity.
  • Provided an efficient solution for updating multiple columns using mget(cols), reducing unnecessary copies.
  • Clarified RIGHT JOIN limitations and how := modifies x in place but does not update i.
  • Mentioned performance improvements observed in large datasets.

@jangorecki can you please review this when you have time and suggest me if any changes are needed
thank you

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 3, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.69%. Comparing base (f80867b) to head (ef8d081).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6847   +/-   ##
=======================================
  Coverage   98.69%   98.69%           
=======================================
  Files          79       79           
  Lines       14677    14677           
=======================================
  Hits        14486    14486           
  Misses        191      191           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jangorecki
Copy link
Copy Markdown
Member

I think it is best to ping author of this vignette. I haven't had yet opportunity to read the original version.

Comment thread vignettes/datatable-joins.Rmd Outdated
Comment thread vignettes/datatable-joins.Rmd Outdated
- The on = .(id = product_id) ensures that updates happen based on matching IDs.
- This method modifies Products in place, avoiding unnecessary copies.

2) If we need to get the latest price and date (instead of all matches), we can still use := efficiently:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we've cleanly explained the difference between this join and the previous one. The only difference I see is tail() vs. last(), am I missing something?

IIUC the difference between tail() and last() would be that last() can skip NA values, right?

Comment thread vignettes/datatable-joins.Rmd Outdated
- We add a new `last_updated` column to track when the price was last changed.
- The `by = .EACHI` ensures that the `tail` function is applied for each product in `ProductPriceHistory`.
#### Let's update our `Products` table with the latest price from `ProductPriceHistory`:
```{r Simple One-to-One Update}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these valid {knitr} chunk names? regardless, please use machine-readable names (a la https://style.tidyverse.org/files.html)

Comment thread vignettes/datatable-joins.Rmd Outdated
Comment thread vignettes/datatable-joins.Rmd Outdated
@venom1204 venom1204 requested a review from MichaelChirico March 7, 2025 08:39
@venom1204
Copy link
Copy Markdown
Contributor Author

@lucasmation @MichaelChirico
I've refined and modified the section to align with the issue requirements. Kindly review and share any suggestions when you have time.

thanks.

@Advaitgaur004
Copy link
Copy Markdown

Advaitgaur004 commented Mar 18, 2025

@venom1204, what is the update on this, have you tested the code .

data.table::last(c(1, NA))
# [1] NA

data.table::data.table(a = c(1, 1, 2, 2), b = c(1, NA, 2, NA))[, last(b), by=a]
#        a    V1
#    <num> <num>
# 1:     1    NA
# 2:     2    NA

@venom1204
Copy link
Copy Markdown
Contributor Author

@MichaelChirico I’ve minimized the changes I introduced while ensuring they contain sufficient information and remain easy to understand. Could you review them and let me know if any further simplification is needed, or if they are good as they are?

@venom1204
Copy link
Copy Markdown
Contributor Author

@iagogv3 I’ve incorporated your suggestions into my latest version. Could you review the updates and let me know if any further improvements are needed? Your feedback would be greatly appreciated!
thank you.

@MichaelChirico
Copy link
Copy Markdown
Member

cc @lucasmation, could you take a look at the updated vignette and see if it helps your understanding?

https://github.com/venom1204/data.table/blob/5a3f19cb1ce3926a0758164412dad7fa8ede2282/vignettes/datatable-joins.Rmd#L701-L775

Comment thread vignettes/datatable-joins.Rmd Outdated
`:=`(price = last(i.price), last_updated = last(i.date)),
by = .EACHI]
```
- `by = .EACHI` groups by i's rows (1 group per Products row).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be 1 group per ProductPriceHistory row?

Comment thread vignettes/datatable-joins.Rmd Outdated
Products[ProductPriceHistory, on = .(id = product_id),
(my_var_name) := i.price]
```
- This approach allows flexibility in specifying columns programmatically.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples in https://stackoverflow.com/a/44592473 include what I believe is another interesting point, that the on = option may also depend on the my_var_name (on = paste0(my_var_name, "==from"))

Comment thread vignettes/datatable-joins.Rmd Outdated
update_cols <- intersect(c("price", "category", "stock"), names(ProductPriceHistory))

for (col in update_cols) {
Products[ProductPriceHistory, on = .(id = product_id), (col) := get(paste0("i.", col))]}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iagogv3
Copy link
Copy Markdown
Contributor

iagogv3 commented Apr 6, 2025

Thank you! @venom1204 I added some comments inline.

@venom1204
Copy link
Copy Markdown
Contributor Author

@iagogv3, @MichaelChirico

I’ve updated the entire section accordingly.
When you have some time, could you please review it?

Apologies for being away for a while — I was engaged elsewhere.

Comment thread vignettes/datatable-joins.Rmd Outdated
Comment thread vignettes/datatable-joins.Rmd Outdated
Comment thread vignettes/datatable-joins.Rmd Outdated
Comment thread vignettes/datatable-joins.Rmd Outdated
Comment thread vignettes/datatable-joins.Rmd Outdated
Comment thread vignettes/datatable-joins.Rmd Outdated
Comment thread vignettes/datatable-joins.Rmd Outdated
Comment thread vignettes/datatable-joins.Rmd Outdated
Copy link
Copy Markdown
Member

@MichaelChirico MichaelChirico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think this is in a "good enough" state now.

@MichaelChirico MichaelChirico merged commit 0f166cb into Rdatatable:master Jun 23, 2025
8 checks passed
@trobx
Copy link
Copy Markdown

trobx commented Jun 25, 2025

@MichaelChirico Sorry to chime in - I don't know whether it escaped notice (possibly down to my hopeless github skills) but I'd commented that the by=.EACHI update example might be better left out because:

  1. the example just does what mult is for but very much (500x) slower
  2. it's confusing imo to introduce by=.EACHI in the context of an update join as it ends up being effectively "by=.EACHX"

Mentioned at #6846 (comment) and the benchmark at #6997 showing a mult/fifelse solution that takes ~0.5 secs vs ~2.5 mins for tail/by=.EACHI.

@MichaelChirico
Copy link
Copy Markdown
Member

@MichaelChirico Sorry to chime in - I don't know whether it escaped notice (possibly down to my hopeless github skills) but I'd commented that the by=.EACHI update example might be better left out because:

  1. the example just does what mult is for but very much (500x) slower
  2. it's confusing imo to introduce by=.EACHI in the context of an update join as it ends up being effectively "by=.EACHX"

Mentioned at #6846 (comment) and the benchmark at #6997 showing a mult/fifelse solution that takes ~0.5 secs vs ~2.5 mins for tail/by=.EACHI.

could you file a new issue & ideally a follow-up PR? thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Joins in data.table vignette. Simplify and extend the "update by reference section"

6 participants