BST430 Lecture 06-B
Reference based data manipulation with data.table
R objects are immutable
When we modify a data.frame
(or tibble
), a copy is made.
This is inefficient for large datasets.
We may want to pass around the data structure to functions where it modifies in place without making a copy.
data.frame
example
DF = data.frame (x = 1 : 5 , y = letters[1 : 5 ])
DF
x y
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
tracemem (DF) # This will print memory address of DF when it is copied.
tracemem[0x7fa1cd0a5a08 -> 0x7fa1cd485848]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers withCallingHandlers handle_error process_file <Anonymous> <Anonymous> execute .main
tracemem[0x7fa1cd485848 -> 0x7fa1cd485808]: $<-.data.frame $<- eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers withCallingHandlers handle_error process_file <Anonymous> <Anonymous> execute .main
data.table
and :=
operator
An R package that provides an enhanced version of data.frame
; it allows modify in place.
library (data.table)
DT = data.table (x = 1 : 5 , y = letters[1 : 5 ])
tracemem (DT)
x y
<int> <char>
1: 1 A
2: 2 B
3: 3 C
4: 4 D
5: 5 E
No memory copy.
Passing by reference
data.table
can be passed to functions that perform modification in place.
set.seed (123 )
num_elements <- 10000
DT = data.table (x = sample (letters, num_elements, replace = TRUE ),
y = sample (letters, num_elements, replace = TRUE ))
DT
x y
<char> <char>
1: o h
2: s m
3: n f
4: c v
5: j o
---
9996: y r
9997: i f
9998: w n
9999: f a
10000: w d
Passing by reference
modify_DT = function (DT) {
DT[x == 'i' & y == 'j' , y : = "ij" ]
}
tracemem (DT)
modify_DT (DT) # Does not return anything.
DT[x== 'i' & y == 'ij' ]
x y
<char> <char>
1: i ij
2: i ij
3: i ij
4: i ij
5: i ij
6: i ij
7: i ij
8: i ij
9: i ij
10: i ij
11: i ij
12: i ij
13: i ij
14: i ij
15: i ij
16: i ij
Passing by reference
Check that the memory address of DT remains the same after modification.
Modifying a subset of rows
DF = data.frame (x = sample (letters, num_elements, replace = TRUE ), y = sample (letters, num_elements, replace = TRUE ))
tracemem (DF)
DF[DF$ x== 'i' & DF$ y == 'i' ,'y' ] <- "ij"
tracemem[0x7fa1d82ed048 -> 0x7fa1d82eeb48]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers withCallingHandlers handle_error process_file <Anonymous> <Anonymous> execute .main
tracemem[0x7fa1d82eeb48 -> 0x7fa1d82eea88]: [<-.data.frame [<- eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers withCallingHandlers handle_error process_file <Anonymous> <Anonymous> execute .main
Delete a column
x
<char>
1: o
2: s
3: n
4: c
5: j
---
9996: y
9997: i
9998: w
9999: f
10000: w
Primary keys
data.table
supports primary keys for fast subsetting.
Set a key using setkey()
.
This allows fast subsetting via binary search O(log n)
instead of O(n)
.
Notes
Fully supports features needed for data manipulation:
table joins.
reshaping (longer and wider).
can be used with dplyr
and ggplot2
– much of what we learned for tibble
applies to data.table
tidyverse
yields more readable code. data.table
syntax is a bit more terse and cryptic.