BST430 Lecture 06-B

Reference based data manipulation with data.table

Seong-Hwan Jun

R objects are immutable

  • When we modify a data.frame (or tibble), a copy is made.
  • This is inefficient for large datasets.
  • We may want to pass around the data structure to functions where it modifies in place without making a copy.

data.frame example

DF = data.frame(x = 1:5, y = letters[1:5])
DF
  x y
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
tracemem(DF) # This will print memory address of DF when it is copied.
[1] "<0x7fa1cd0a5a08>"
DF$y <- toupper(DF$y)
tracemem[0x7fa1cd0a5a08 -> 0x7fa1cd485848]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers withCallingHandlers handle_error process_file <Anonymous> <Anonymous> execute .main 
tracemem[0x7fa1cd485848 -> 0x7fa1cd485808]: $<-.data.frame $<- eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers withCallingHandlers handle_error process_file <Anonymous> <Anonymous> execute .main 

data.table and := operator

An R package that provides an enhanced version of data.frame; it allows modify in place.

library(data.table)
DT = data.table(x = 1:5, y = letters[1:5])
tracemem(DT)
[1] "<0x7fa1d6b1c000>"
DT[,y:= toupper(y)]
DT
       x      y
   <int> <char>
1:     1      A
2:     2      B
3:     3      C
4:     4      D
5:     5      E

No memory copy.

Passing by reference

data.table can be passed to functions that perform modification in place.

set.seed(123)
num_elements <- 10000
DT = data.table(x = sample(letters, num_elements, replace = TRUE), 
                y = sample(letters, num_elements, replace = TRUE))
DT
            x      y
       <char> <char>
    1:      o      h
    2:      s      m
    3:      n      f
    4:      c      v
    5:      j      o
   ---              
 9996:      y      r
 9997:      i      f
 9998:      w      n
 9999:      f      a
10000:      w      d

Passing by reference

modify_DT = function(DT) {
  DT[x == 'i' & y == 'j', y := "ij"]
}
tracemem(DT)
[1] "<0x7fa1d799fa00>"
modify_DT(DT) # Does not return anything.
DT[x=='i' & y == 'ij']
         x      y
    <char> <char>
 1:      i     ij
 2:      i     ij
 3:      i     ij
 4:      i     ij
 5:      i     ij
 6:      i     ij
 7:      i     ij
 8:      i     ij
 9:      i     ij
10:      i     ij
11:      i     ij
12:      i     ij
13:      i     ij
14:      i     ij
15:      i     ij
16:      i     ij

Passing by reference

Check that the memory address of DT remains the same after modification.

lobstr::obj_addr(DT)
[1] "0x7fa1d799fa00"

Modifying a subset of rows

DF = data.frame(x = sample(letters, num_elements, replace = TRUE), y = sample(letters, num_elements, replace = TRUE))
tracemem(DF)
[1] "<0x7fa1d82ed048>"
DF[DF$x=='i' & DF$y == 'i','y'] <- "ij"
tracemem[0x7fa1d82ed048 -> 0x7fa1d82eeb48]: eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers withCallingHandlers handle_error process_file <Anonymous> <Anonymous> execute .main 
tracemem[0x7fa1d82eeb48 -> 0x7fa1d82eea88]: [<-.data.frame [<- eval eval withVisible withCallingHandlers eval eval with_handlers doWithOneRestart withOneRestart withRestartList doWithOneRestart withOneRestart withRestartList withRestarts <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers withCallingHandlers handle_error process_file <Anonymous> <Anonymous> execute .main 
lobstr::obj_addr(DF)
[1] "0x7fa1d82eea88"

Delete a column

DT[, y := NULL]
DT
            x
       <char>
    1:      o
    2:      s
    3:      n
    4:      c
    5:      j
   ---       
 9996:      y
 9997:      i
 9998:      w
 9999:      f
10000:      w

Input and output

DT <- fread("l05/data/trump/approval_polllist.csv")
temp <- DT[,.(avg_approval=mean(approve)),by=.(population)]
fwrite(temp, "l05/data/trump/approval_by_population.csv")

Their input and output functions are pretty versatile and just “works” in most cases.

Primary keys

  • data.table supports primary keys for fast subsetting.
  • Set a key using setkey().
  • This allows fast subsetting via binary search O(log n) instead of O(n).

Notes

Fully supports features needed for data manipulation:

  • table joins.
  • reshaping (longer and wider).
  • can be used with dplyr and ggplot2 – much of what we learned for tibble applies to data.table

tidyverse yields more readable code. data.table syntax is a bit more terse and cryptic.

References