data.table/vignettes/datatable-reference-semantics.Rmd at dcf710dcb7ad201a6cc7a370a5fc9af88d8fe5f4 · Rdatatable/data.table · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
---
title: "Reference semantics"
date: "`{r} Sys.Date()`"
output:
  litedown::html_format
vignette: >
  %\VignetteIndexEntry{Reference semantics}
  %\VignetteEngine{litedown::vignette}
  \usepackage[utf8]{inputenc}
---

```{r, echo=FALSE, file='_translation_links.R'}
```
`{r} .write.translation.links("Translations of this document are available in: %s")`

```{r, echo = FALSE, message = FALSE}
library(data.table)
litedown::reactor(comment = "# ")
.old.th = setDTthreads(1)
```

This vignette discusses *data.table*'s reference semantics which allows to *add/update/delete* columns of a *data.table by reference*, and also combine them with `i` and `by`. It is aimed at those who are already familiar with *data.table* syntax, its general form, how to subset rows in `i`, select and compute on columns, and perform aggregations by group. If you're not familiar with these concepts, please read the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette first.

***

## Data {#data}

We will use the same `flights` data as in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette.

```{r, echo = FALSE}
options(width = 100L)
```

```{r}
flights <- fread("flights14.csv")
flights
dim(flights)
```

## Introduction

In this vignette, we will

1. first discuss reference semantics briefly and look at the two different forms in which the `:=` operator can be used

2. then see how we can *add/update/delete* columns *by reference* in `j` using the `:=` operator and how to combine with `i` and `by`.

3. and finally we will look at using `:=` for its *side-effect* and how we can avoid the side effects using `copy()`.

## 1. Reference semantics

All the operations we have seen so far in the previous vignette resulted in a new data set. We will see how to *add* new column(s), *update* or *delete* existing column(s) on the original data.

### a) Background

Before we look at *reference semantics*, consider the *data.frame* shown below:

```{r}
DF = data.frame(ID = c("b","b","b","a","a","c"), a = 1:6, b = 7:12, c = 13:18)
DF
```

When we did:

```r
DF$c <- 18:13               # (1) -- replace entire column
# or
DF$c[DF$ID == "b"] <- 15:13 # (2) -- subassign in column 'c'
```

both (1) and (2) resulted in deep copy of the entire data.frame in versions of `R < 3.1`. [It copied more than once](https://stackoverflow.com/q/23898969/559784). To improve performance by avoiding these redundant copies, *data.table* utilised the [available but unused `:=` operator in R](https://stackoverflow.com/q/7033106/559784).

Great performance improvements were made in `R v3.1` as a result of which only a *shallow* copy is made for (1) and not *deep* copy. However, for (2) still, the entire column is *deep* copied even in `R v3.1+`. This means the more columns one subassigns to in the *same query*, the more *deep* copies R does.

#### *shallow* vs *deep* copy

A *shallow* copy is just a copy of the vector of column pointers (corresponding to the columns in a *data.frame* or *data.table*). The actual data is not physically copied in memory.

A *deep* copy on the other hand copies the entire data to another location in memory.

When subsetting a *data.table* using `i` (e.g., `DT[1:10]`), a *deep* copy is made. However, when `i` is not provided or equals `TRUE`, a *shallow* copy is made.

#
With *data.table's* `:=` operator, absolutely no copies are made in *both* (1) and (2), irrespective of R version you are using. This is because `:=` operator updates *data.table* columns *in-place* (by reference).

### b) The `:=` operator

It can be used in `j` in two ways:

(a) The `LHS := RHS` form

```r
DT[, c("colA", "colB", ...) := list(valA, valB, ...)]

# when you have only one column to assign to you
# can drop the quotes and list(), for convenience
DT[, colA := valA]
```

(b) The functional form

```r
DT[, `:=`(colA = valA, # valA is assigned to colA
          colB = valB, # valB is assigned to colB
          ...
)]
```

Note that the code above explains how `:=` can be used. They are not working examples. We will start using them on `flights` *data.table* from the next section.

#

* In (a), `LHS` takes a character vector of column names and `RHS` a *list of values*. `RHS` just needs to be a `list`, irrespective of how its generated (e.g., using `lapply()`, `list()`, `mget()`, `mapply()` etc.). This form is usually easy to program with and is particularly useful when you don't know the columns to assign values to in advance.

* On the other hand, (b) is handy if you would like to jot some comments down for later.

* The result is returned *invisibly*.

* Since `:=` is available in `j`, we can combine it with `i` and `by` operations just like the aggregation operations we saw in the previous vignette.

#

In the two forms of `:=` shown above, note that we don't assign the result back to a variable. Because we don't need to. The input *data.table* is modified by reference. Let's go through examples to understand what we mean by this.

For the rest of the vignette, we will work with `flights` *data.table*.

## 2. Add/update/delete columns *by reference*

### a) Add columns by reference {#ref-j}

#### -- How can we add columns *speed* and *total delay* of each flight to `flights` *data.table*?

```{r}
flights[, `:=`(speed = distance / (air_time/60), # speed in mph (mi/h)
               delay = arr_delay + dep_delay)]   # delay in minutes
head(flights)

## alternatively, using the 'LHS := RHS' form
# flights[, c("speed", "delay") := list(distance/(air_time/60), arr_delay + dep_delay)]
```

#### Note that

* We did not have to assign the result back to `flights`.

* The `flights` *data.table* now contains the two newly added columns. This is what we mean by *added by reference*.

* We used the functional form so that we could add comments on the side to explain what the computation does. You can also see the `LHS := RHS` form (commented).

### b) Update some rows of columns by reference - *sub-assign* by reference  {#ref-i-j}

Let's take a look at all the `hours` available in the `flights` *data.table*:

```{r}
# get all 'hours' in flights
flights[, sort(unique(hour))]
```

We see that there are totally `25` unique values in the data. Both *0* and *24* hours seem to be present. Let's go ahead and replace *24* with *0*.

#### -- Replace those rows where `hour == 24` with the value `0`

```{r}
# subassign by reference
flights[hour == 24L, hour := 0L]
```

* We can use `i` along with `:=` in `j` the very same way as we have already seen in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette.

* Column `hour` is replaced with `0` only on those *row indices* where the condition `hour == 24L` specified in `i` evaluates to `TRUE`.

* `:=` returns the result invisibly. Sometimes it might be necessary to see the result after the assignment. We can accomplish that by adding an empty `[]` at the end of the query as shown below:

    ```{r}
    flights[hour == 24L, hour := 0L][]
    ```

#
Let's look at all the `hours` to verify.

```{r}
# check again for '24'
flights[, sort(unique(hour))]
```

#### Exercise: {#update-by-reference-question}

What is the difference between `flights[hour == 24L, hour := 0L]` and `flights[hour == 24L][, hour := 0L]`? Hint: The latter needs an assignment (`<-`) if you would want to use the result later.

If you can't figure it out, have a look at the `Note` section of `?":="`.

### c) Delete column by reference

#### -- Remove `delay` column

```{r}
flights[, c("delay") := NULL]
head(flights)

## or using the functional form
# flights[, `:=`(delay = NULL)]
```

#### {#delete-convenience}

* Assigning `NULL` to a column *deletes* that column. And it happens *instantly*.

* We can also pass column numbers instead of names in the `LHS`, although it is good programming practice to use column names.

* When there is just one column to delete, we can drop the `c()` and double quotes and just use the column name *unquoted*, for convenience. That is:

    ```r
    flights[, delay := NULL]
    ```

    is equivalent to the code above.

### d) `:=` along with grouping using `by` {#ref-j-by}

We have already seen the use of `i` along with `:=` in [Section 2b](#ref-i-j). Let's now see how we can use `:=` along with `by`.

#### -- How can we add a new column which contains for each `orig,dest` pair the maximum speed?

```{r}
flights[, max_speed := max(speed), by = .(origin, dest)]
head(flights)
```

* We add a new column `max_speed` using the `:=` operator by reference.

* We provide the columns to group by the same way as shown in the *Introduction to data.table* vignette. For each group, `max(speed)` is computed, which returns a single value. That value is recycled to fit the length of the group. Once again, no copies are being made at all. `flights` *data.table* is modified *in-place*.

* We could have also provided `by` with a *character vector* as we saw in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette, e.g., `by = c("origin", "dest")`.

#### Note on zero-length RHS and `by`

* If the `RHS` of a `:=` assignment evaluates to a zero-length vector, an error is normally raised. When `:=` is used with `by`, however, a zero-length result for a group is treated as a no-op for that group and no error is thrown. This allows grouped operations to continue even when some groups produce no result.

#

### e) Multiple columns and `:=`

#### -- How can we add two more columns computing `max()` of `dep_delay` and `arr_delay` for each month, using `.SD`?

```{r}
in_cols  = c("dep_delay", "arr_delay")
out_cols = c("max_dep_delay", "max_arr_delay")
flights[, c(out_cols) := lapply(.SD, max), by = month, .SDcols = in_cols]
head(flights)
```

* We use the `LHS := RHS` form. We store the input column names and the new columns to add in separate variables and provide them to `.SDcols` and for `LHS` (for better readability).

* Note that since we allow assignment by reference without quoting column names when there is only one column as explained in [Section 2c](#delete-convenience), we can not do `out_cols := lapply(.SD, max)`. That would result in adding one new column named `out_cols`. Instead we should do either `c(out_cols)` or simply `(out_cols)`. Wrapping the variable name with `(` is enough to differentiate between the two cases.

* The `LHS := RHS` form allows us to operate on multiple columns. In the RHS, to compute the `max` on columns specified in `.SDcols`, we make use of the base function `lapply()` along with `.SD` in the same way as we have seen before in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette. It returns a list of two elements, containing the maximum value corresponding to `dep_delay` and `arr_delay` for each group.

#
Before moving on to the next section, let's clean up the newly created columns `speed`, `max_speed`, `max_dep_delay` and `max_arr_delay`.

```{r}
# RHS gets automatically recycled to length of LHS
flights[, c("speed", "max_speed", "max_dep_delay", "max_arr_delay") := NULL]
head(flights)
```

#### -- How can we update multiple existing columns in place using `.SD`?

```{r}
flights[, names(.SD) := lapply(.SD, as.factor), .SDcols = is.character]
```
Let's clean up again and convert our newly-made factor columns back into character columns. This time we will make use of `.SDcols` accepting a function to decide which columns to include. In this case, `is.factor()` will return the columns which are factors. For more on the **S**ubset of the **D**ata, there is also an [SD Usage vignette](https://cran.r-project.org/package=data.table/vignettes/datatable-sd-usage.html).

Sometimes, it is also nice to keep track of columns that we transform. That way, even after we convert our columns we would be able to call the specific columns we were updating.
```{r}
factor_cols <- sapply(flights, is.factor)
flights[, names(.SD) := lapply(.SD, as.character), .SDcols = factor_cols]
str(flights[, ..factor_cols])
```

#### {.bs-callout .bs-callout-info}

* We also could have used `(factor_cols)` on the `LHS` instead of `names(.SD)`.

## 3. `:=` and `copy()`

`:=` modifies the input object by reference. Apart from the features we have discussed already, sometimes we might want to use the update by reference feature for its side effect. And at other times it may not be desirable to modify the original object, in which case we can use `copy()` function, as we will see in a moment.

### a) `:=` for its side effect

Let's say we would like to create a function that would return the *maximum speed* for each month. But at the same time, we would also like to add the column `speed` to *flights*. We could write a simple function as follows:

```{r}
foo <- function(DT) {
  DT[, speed := distance / (air_time/60)]
  DT[, .(max_speed = max(speed)), by = month]
}
ans = foo(flights)
head(flights)
head(ans)
```

* Note that the new column `speed` has been added to `flights` *data.table*. This is because `:=` performs operations by reference. Since `DT` (the function argument) and `flights` refer to the same object in memory, modifying `DT` also reflects on `flights`.

* And `ans` contains the maximum speed for each month.

### b) The `copy()` function

In the previous section, we used `:=` for its side effect. But of course this may not be always desirable. Sometimes, we would like to pass a *data.table* object to a function, and might want to use the `:=` operator, but *wouldn't* want to update the original object. We can accomplish this using the function `copy()`.

The `copy()` function *deep* copies the input object and therefore any subsequent update by reference operations performed on the copied object will not affect the original object.

#

There are two particular places where `copy()` function is essential:

1. Contrary to the situation we have seen in the previous point, we may not want the input data.table to a function to be modified *by reference*. As an example, let's consider the task in the previous section, except we don't want to modify `flights` by reference.

    Let's first delete the `speed` column we generated in the previous section.

    ```{r}
    flights[, speed := NULL]
    ```
    Now, we could accomplish the task as follows:

    ```{r}
    foo <- function(DT) {
      DT <- copy(DT)                              ## deep copy
      DT[, speed := distance / (air_time/60)]     ## doesn't affect 'flights'
      DT[, .(max_speed = max(speed)), by = month]
    }
    ans <- foo(flights)
    head(flights)
    head(ans)
    ```

* Using `copy()` function did not update `flights` *data.table* by reference. It doesn't contain the column `speed`.

* And `ans` contains the maximum speed corresponding to each month.

However we could improve this functionality further by *shallow* copying instead of *deep* copying. In fact, we would very much like to [provide this functionality for `v1.9.8`](https://github.com/Rdatatable/data.table/issues/617). We will touch up on this again in the *data.table design* vignette.

#

2. When we store the column names on to a variable, e.g., `DT_n = names(DT)`, and then *add/update/delete* column(s) *by reference*. It would also modify `DT_n`, unless we do `copy(names(DT))`.

    ```{r}
    DT = data.table(x = 1L, y = 2L)
    DT_n = names(DT)
    DT_n

    ## add a new column by reference
    DT[, z := 3L]

    ## DT_n also gets updated
    DT_n

    ## use `copy()`
    DT_n = copy(names(DT))
    DT[, w := 4L]

    ## DT_n doesn't get updated
    DT_n
    ```
### c) Selecting columns: `$` / `[[...]]` vs `[, col]`

When you extract a single column as a vector, there is a subtle but important difference between standard R methods ($ and [[...]]) and data.table's j expression. DT$col and DT[['col']] may return a reference to the column, while DT[, col] always returns a copy.

A short example:
```{r}
DT = data.table(a = 1:3)

# three ways to get the column
x_ref = DT$a        # may be a reference
y_cpy = DT[, a]     # always a copy
z_cpy = copy(DT$a)  # forced copy

# modify DT by reference
DT[, a := a + 10L]

# observe results
x_ref   # may show 11 12 13
y_cpy   # 1 2 3
z_cpy   # 1 2 3
```

To select a single column as a vector, remember:
- `DT[, mycol]` is safer as it always returns a new, independent copy.
- `DT$mycol` is fast but may return a reference. Use `copy(DT$mycol)` to guarantee independence.

### d) Side effects and testing

Because `:=` modifies by reference, changes to a data.table persist after evaluation. If the same *data.table* is reused for example, in a loop or when using `test()` with multiple `optimization` levels, subsequent runs will start with the modified table from the previous run. Use `copy()` to ensure each run starts with the same data.

```{r}
DT = data.table(a = 1L)
# Subsequent runs are cumulative
DT[, a := a + 1L][]
DT[, a := a + 1L][]

# Use copy() to isolate runs
test_expr = function(x) copy(x)[, a := a + 1L][]
test_expr(DT)
test_expr(DT)
```

## Summary

#### The `:=` operator

* It is used to *add/update/delete* columns by reference.

* We have also seen how to use `:=` along with `i` and `by` the same way as we have seen in the [`vignette("datatable-intro", package="data.table")`](datatable-intro.html) vignette. We can in the same way use `keyby`, chain operations together, and pass expressions to `by` as well all in the same way. The syntax is *consistent*.

* We can use `:=` for its side effect or use `copy()` to not modify the original object while updating by reference.

```{r, echo=FALSE}
setDTthreads(.old.th)
```

#

So far we have seen a whole lot in `j`, and how to combine it with `by` and little of `i`. Let's turn our attention back to `i` in the [next vignette (`vignette("datatable-keys-fast-subset", package="data.table")`)](datatable-keys-fast-subset.html) to perform *blazing fast subsets* by *keying data.tables*.

***