Skip to content
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 39 additions & 17 deletions vignettes/datatable-joins.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -702,23 +702,45 @@ Products[!"popcorn",

The `:=` operator in `data.table` is used for updating or adding columns by reference. This means it modifies the original `data.table` without creating a copy, which is very memory-efficient, especially for large datasets. When used inside a `data.table`, `:=` allows you to **add new columns** or **modify existing ones** as part of your query.

Let's update our `Products` table with the latest price from `ProductPriceHistory`:

```{r}
copy(Products)[ProductPriceHistory,
on = .(id = product_id),
j = `:=`(price = tail(i.price, 1),
last_updated = tail(i.date, 1)),
by = .EACHI][]
```

In this operation:

- The function copy creates a ***deep*** copy of the `Products` table, preventing modifications made by `:=` from changing the original table by reference.
- We join `Products` with `ProductPriceHistory` based on `id` and `product_id`.
- We update the `price` column with the latest price from `ProductPriceHistory`.
- We add a new `last_updated` column to track when the price was last changed.
- The `by = .EACHI` ensures that the `tail` function is applied for each product in `ProductPriceHistory`.
#### Let's update our `Products` table with the latest price from `ProductPriceHistory`:
```{r Simple One-to-One Update}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these valid {knitr} chunk names? regardless, please use machine-readable names (a la https://style.tidyverse.org/files.html)

Products[ProductPriceHistory, on = .(id = product_id), price := i.price]
```
- The price column in Products is updated using the price column from ProductPriceHistory.
Comment thread
venom1204 marked this conversation as resolved.
Outdated
- The on = .(id = product_id) ensures that updates happen based on matching IDs.
- This method modifies Products in place, avoiding unnecessary copies.

#### If we need to get the latest price and date (instead of all matches), we can still use := efficiently:
```{r Updating with the Latest Record}
Products[ProductPriceHistory,
on = .(id = product_id),
`:=`(price = last(i.price), last_updated = last(i.date)),
by = .EACHI]
```
- last(i.price) ensures that only the latest price is selected.
- last_updated column is added to track the last update date.
- by = .EACHI ensures that the last price is picked for each product.

#### Understanding last() vs. tail()

- The key difference between last() and tail() is:
- last(x): Returns the last element of x. Skips NAs when used on a data.table column.
Comment thread
venom1204 marked this conversation as resolved.
Outdated
- tail(x, 1): Returns the last row, including NA if present.

In this case, last(i.price) ensures we get the latest non-NA price, whereas tail(i.price, 1) would return the last row even if it contains NA.

Comment thread
MichaelChirico marked this conversation as resolved.
Outdated
#### When we need to update Products with multiple columns from ProductPriceHistory
```{r Efficient Right Join Update }
cols <- setdiff(names(ProductPriceHistory), 'product_id')
Products[ProductPriceHistory,
on = .(id = product_id),
(cols) := mget(cols)]
```
- Efficiently updates multiple columns in Products from ProductPriceHistory.
- mget(cols) retrieves multiple matching columns dynamically.
- This method is faster and more memory-efficient than Products <- ProductPriceHistory[Products, on=...].
- Note: := updates Products in place, but does not modify ProductPriceHistory.
- Unlike traditional RIGHT JOIN, data.table does not allow i (right table) to be updated directly.

***

Expand Down