Data.table: why data.table is faster with vectorized column subset than list column subset

Created on 27 Mar 2019  ·  3Comments  ·  Source: Rdatatable/data.table

I like this data.table stuff, evenly for its execution speed and for its parsimonious way of scripting.
I use it even on small tables as well.
I regularly subset tables this way: DT[, .(id1, id5)]
and not this way: DT[, c("id1", "id5")]

Today I measured speed of the two and I have been astonished of the speed difference on small tables. The parsimonious method is way slower.

Is this difference something intended?

Is there aspiration to make the parsimonious way to converge in terms of execution speed to the other one?
(It counts when I have to subset several small tables in a repetitive way.)

Ubuntu 18.04
R version 3.5.3 (2019-03-11)
data.table 1.12.0
RAM 32GB
Intel® Core™ i7-8565U CPU @ 1.80GHz × 8

library(data.table)
library(microbenchmark)
N  <- 2e8
K  <- 100
set.seed(1)
DT <- data.table(
  id1 = sample(sprintf("id%03d", 1:K), N, TRUE),               # large groups (char)
  id2 = sample(sprintf("id%03d", 1:K), N, TRUE),               # large groups (char)
  id3 = sample(sprintf("id%010d", 1:(N/K)), N, TRUE),       # small groups (char)
  id4 = sample(K,   N, TRUE),                                           # large groups (int)
  id5 = sample(K,   N, TRUE),                                           # large groups (int)
  id6 = sample(N/K, N, TRUE),                                          # small groups (int)
  v1 =  sample(5,   N, TRUE),                                           # int in range [1,5]
  v2 =  sample(5,   N, TRUE),                                           # int in range [1,5]
  v3 =  sample(round(runif(100, max = 100), 4), N, TRUE) # numeric e.g. 23.5749
)

microbenchmark(
  DT[, .(id1, id5)],
  DT[, c("id1", "id5")]
)
Unit: seconds
                  expr      min       lq     mean   median       uq      max neval
     DT[, .(id1, id5)] 1.588367 1.614645 1.929348 1.626847 1.659698 12.33872   100
 DT[, c("id1", "id5")] 1.592154 1.613800 1.937548 1.628082 2.184456 11.74581   100


N  <- 2e5
DT2 <- data.table(
  id1 = sample(sprintf("id%03d", 1:K), N, TRUE),                 # large groups (char)
  id2 = sample(sprintf("id%03d", 1:K), N, TRUE),                 # large groups (char)
  id3 = sample(sprintf("id%010d", 1:(N/K)), N, TRUE),         # small groups (char)
  id4 = sample(K,   N, TRUE),                                             # large groups (int)
  id5 = sample(K,   N, TRUE),                                             # large groups (int)
  id6 = sample(N/K, N, TRUE),                                            # small groups (int)
  v1 =  sample(5,   N, TRUE),                                             # int in range [1,5]
  v2 =  sample(5,   N, TRUE),                                             # int in range [1,5]
  v3 =  sample(round(runif(100, max = 100), 4), N, TRUE)   # numeric e.g. 23.5749
)

microbenchmark(
  DT2[, .(id1, id5)],
  DT2[, c("id1", "id5")]
)
Unit: microseconds
                   expr      min       lq      mean    median        uq      max neval
DT2[, .(id1, id5)] 1405.042 1461.561 1525.5314 1491.7885 1527.8955 2220.860   100
DT2[, c("id1", "id5")]  614.624  640.617  666.2426  659.0175  676.9355  906.966   100

Most helpful comment

You can fix the formatting of your post by using a single line of three backticks before and after the code chunk:

```
code
```

It counts when I have to subset several small tables in a repetitive way.

I guess repeatedly selecting columns from small tables is something that should, and in most cases can, be avoided...? Because j in DT[i, j, by] supports and optimizes such a wide variety of inputs, I think that it is natural that there is some overhead in parsing it.


Regarding other ways to approach your problem (and maybe this would be a better fit for Stack Overflow if you want to talk about it more) ... Depending on what else you want to do with the table, you could just delete the other cols, DT[, setdiff(names(DT), cols) := NULL] and continue using DT directly.

If you still prefer to take the subset, grabbing column pointers is much faster than either option you considered here, though this way edits to the result will affect the original table:

library(data.table)
library(microbenchmark)
N <- 2e8
K <- 100
set.seed(1)
DT <- data.table(
id1 = sprintf("id%03d", 1:K), # large groups (char)
id2 = sprintf("id%03d", 1:K), # large groups (char)
id3 = sprintf("id%010d", 1:(N/K)), # small groups (char)
id4 = sample(K), # large groups (int)
id5 = sample(K), # large groups (int)
id6 = sample(N/K), # small groups (int)
v1 = sample(5), # int in range [1,5]
v2 = sample(5), # int in range [1,5]
v3 = round(runif(100, max = 100), 4), # numeric e.g. 23.5749
row = seq_len(N)
)

cols = c("id1", "id5")
microbenchmark(times = 3,
  expression = DT[, .(id1, id5)],
  index = DT[, c("id1", "id5")],
  dotdot = DT[, ..cols],
  oddball = setDT(lapply(setNames(cols, cols), function(x) DT[[x]]))[],
  oddball2 = setDT(unclass(DT)[cols])[]
)

Unit: microseconds
       expr         min           lq         mean      median           uq         max neval
 expression 1249753.580 1304355.3415 1417166.9297 1358957.103 1500873.6045 1642790.106     3
      index 1184056.302 1191334.4835 1396372.3483 1198612.665 1502530.3715 1806448.078     3
     dotdot 1084521.234 1240062.2370 1439680.6980 1395603.240 1617260.4300 1838917.620     3
    oddball      92.659     171.8635     568.5317     251.068     806.4680    1361.868     3
   oddball2      66.582     125.9505     150.7337     185.319     192.8095     200.300     3

(I took randomization out of your example and reduced # times in the benchmark because I was impatient.)

I've never found a way to directly call R's list subset (which gets used after the unclass above).

Regarding "edits to the result will modify the original table", I mean:

myDT = data.table(a = 1:2, b = 3:4)

# standard way
res <- myDT[, "a"]
res[, a := 0]
myDT
#    a b
# 1: 1 3
# 2: 2 4

# oddball, grabbing pointers
res2 <- setDT(unclass(myDT)["a"])
res2[, a := 0]
myDT
#    a b
# 1: 0 3
# 2: 0 4

All 3 comments

You can fix the formatting of your post by using a single line of three backticks before and after the code chunk:

```
code
```

It counts when I have to subset several small tables in a repetitive way.

I guess repeatedly selecting columns from small tables is something that should, and in most cases can, be avoided...? Because j in DT[i, j, by] supports and optimizes such a wide variety of inputs, I think that it is natural that there is some overhead in parsing it.


Regarding other ways to approach your problem (and maybe this would be a better fit for Stack Overflow if you want to talk about it more) ... Depending on what else you want to do with the table, you could just delete the other cols, DT[, setdiff(names(DT), cols) := NULL] and continue using DT directly.

If you still prefer to take the subset, grabbing column pointers is much faster than either option you considered here, though this way edits to the result will affect the original table:

library(data.table)
library(microbenchmark)
N <- 2e8
K <- 100
set.seed(1)
DT <- data.table(
id1 = sprintf("id%03d", 1:K), # large groups (char)
id2 = sprintf("id%03d", 1:K), # large groups (char)
id3 = sprintf("id%010d", 1:(N/K)), # small groups (char)
id4 = sample(K), # large groups (int)
id5 = sample(K), # large groups (int)
id6 = sample(N/K), # small groups (int)
v1 = sample(5), # int in range [1,5]
v2 = sample(5), # int in range [1,5]
v3 = round(runif(100, max = 100), 4), # numeric e.g. 23.5749
row = seq_len(N)
)

cols = c("id1", "id5")
microbenchmark(times = 3,
  expression = DT[, .(id1, id5)],
  index = DT[, c("id1", "id5")],
  dotdot = DT[, ..cols],
  oddball = setDT(lapply(setNames(cols, cols), function(x) DT[[x]]))[],
  oddball2 = setDT(unclass(DT)[cols])[]
)

Unit: microseconds
       expr         min           lq         mean      median           uq         max neval
 expression 1249753.580 1304355.3415 1417166.9297 1358957.103 1500873.6045 1642790.106     3
      index 1184056.302 1191334.4835 1396372.3483 1198612.665 1502530.3715 1806448.078     3
     dotdot 1084521.234 1240062.2370 1439680.6980 1395603.240 1617260.4300 1838917.620     3
    oddball      92.659     171.8635     568.5317     251.068     806.4680    1361.868     3
   oddball2      66.582     125.9505     150.7337     185.319     192.8095     200.300     3

(I took randomization out of your example and reduced # times in the benchmark because I was impatient.)

I've never found a way to directly call R's list subset (which gets used after the unclass above).

Regarding "edits to the result will modify the original table", I mean:

myDT = data.table(a = 1:2, b = 3:4)

# standard way
res <- myDT[, "a"]
res[, a := 0]
myDT
#    a b
# 1: 1 3
# 2: 2 4

# oddball, grabbing pointers
res2 <- setDT(unclass(myDT)["a"])
res2[, a := 0]
myDT
#    a b
# 1: 0 3
# 2: 0 4

Ok, I have been learnt something new and speedy (the oddballs) today and I have been taking note of that there is a trade-off between speed and parsimonious coding. So the glass is half full! Thanks!

I guess #852 related

Was this page helpful?
0 / 5 - 0 ratings