I like this data.table stuff, evenly for its execution speed and for its parsimonious way of scripting.
I use it even on small tables as well.
I regularly subset tables this way: DT[, .(id1, id5)]
and not this way: DT[, c("id1", "id5")]
Today I measured speed of the two and I have been astonished of the speed difference on small tables. The parsimonious method is way slower.
Is this difference something intended?
Is there aspiration to make the parsimonious way to converge in terms of execution speed to the other one?
(It counts when I have to subset several small tables in a repetitive way.)
Ubuntu 18.04
R version 3.5.3 (2019-03-11)
data.table 1.12.0
RAM 32GB
Intel® Core™ i7-8565U CPU @ 1.80GHz × 8
library(data.table)
library(microbenchmark)
N <- 2e8
K <- 100
set.seed(1)
DT <- data.table(
id1 = sample(sprintf("id%03d", 1:K), N, TRUE), # large groups (char)
id2 = sample(sprintf("id%03d", 1:K), N, TRUE), # large groups (char)
id3 = sample(sprintf("id%010d", 1:(N/K)), N, TRUE), # small groups (char)
id4 = sample(K, N, TRUE), # large groups (int)
id5 = sample(K, N, TRUE), # large groups (int)
id6 = sample(N/K, N, TRUE), # small groups (int)
v1 = sample(5, N, TRUE), # int in range [1,5]
v2 = sample(5, N, TRUE), # int in range [1,5]
v3 = sample(round(runif(100, max = 100), 4), N, TRUE) # numeric e.g. 23.5749
)
microbenchmark(
DT[, .(id1, id5)],
DT[, c("id1", "id5")]
)
Unit: seconds
expr min lq mean median uq max neval
DT[, .(id1, id5)] 1.588367 1.614645 1.929348 1.626847 1.659698 12.33872 100
DT[, c("id1", "id5")] 1.592154 1.613800 1.937548 1.628082 2.184456 11.74581 100
N <- 2e5
DT2 <- data.table(
id1 = sample(sprintf("id%03d", 1:K), N, TRUE), # large groups (char)
id2 = sample(sprintf("id%03d", 1:K), N, TRUE), # large groups (char)
id3 = sample(sprintf("id%010d", 1:(N/K)), N, TRUE), # small groups (char)
id4 = sample(K, N, TRUE), # large groups (int)
id5 = sample(K, N, TRUE), # large groups (int)
id6 = sample(N/K, N, TRUE), # small groups (int)
v1 = sample(5, N, TRUE), # int in range [1,5]
v2 = sample(5, N, TRUE), # int in range [1,5]
v3 = sample(round(runif(100, max = 100), 4), N, TRUE) # numeric e.g. 23.5749
)
microbenchmark(
DT2[, .(id1, id5)],
DT2[, c("id1", "id5")]
)
Unit: microseconds
expr min lq mean median uq max neval
DT2[, .(id1, id5)] 1405.042 1461.561 1525.5314 1491.7885 1527.8955 2220.860 100
DT2[, c("id1", "id5")] 614.624 640.617 666.2426 659.0175 676.9355 906.966 100
You can fix the formatting of your post by using a single line of three backticks before and after the code chunk:
```
code
```
It counts when I have to subset several small tables in a repetitive way.
I guess repeatedly selecting columns from small tables is something that should, and in most cases can, be avoided...? Because j
in DT[i, j, by]
supports and optimizes such a wide variety of inputs, I think that it is natural that there is some overhead in parsing it.
Regarding other ways to approach your problem (and maybe this would be a better fit for Stack Overflow if you want to talk about it more) ... Depending on what else you want to do with the table, you could just delete the other cols, DT[, setdiff(names(DT), cols) := NULL]
and continue using DT directly.
If you still prefer to take the subset, grabbing column pointers is much faster than either option you considered here, though this way edits to the result will affect the original table:
library(data.table)
library(microbenchmark)
N <- 2e8
K <- 100
set.seed(1)
DT <- data.table(
id1 = sprintf("id%03d", 1:K), # large groups (char)
id2 = sprintf("id%03d", 1:K), # large groups (char)
id3 = sprintf("id%010d", 1:(N/K)), # small groups (char)
id4 = sample(K), # large groups (int)
id5 = sample(K), # large groups (int)
id6 = sample(N/K), # small groups (int)
v1 = sample(5), # int in range [1,5]
v2 = sample(5), # int in range [1,5]
v3 = round(runif(100, max = 100), 4), # numeric e.g. 23.5749
row = seq_len(N)
)
cols = c("id1", "id5")
microbenchmark(times = 3,
expression = DT[, .(id1, id5)],
index = DT[, c("id1", "id5")],
dotdot = DT[, ..cols],
oddball = setDT(lapply(setNames(cols, cols), function(x) DT[[x]]))[],
oddball2 = setDT(unclass(DT)[cols])[]
)
Unit: microseconds
expr min lq mean median uq max neval
expression 1249753.580 1304355.3415 1417166.9297 1358957.103 1500873.6045 1642790.106 3
index 1184056.302 1191334.4835 1396372.3483 1198612.665 1502530.3715 1806448.078 3
dotdot 1084521.234 1240062.2370 1439680.6980 1395603.240 1617260.4300 1838917.620 3
oddball 92.659 171.8635 568.5317 251.068 806.4680 1361.868 3
oddball2 66.582 125.9505 150.7337 185.319 192.8095 200.300 3
(I took randomization out of your example and reduced # times in the benchmark because I was impatient.)
I've never found a way to directly call R's list subset (which gets used after the unclass
above).
Regarding "edits to the result will modify the original table", I mean:
myDT = data.table(a = 1:2, b = 3:4)
# standard way
res <- myDT[, "a"]
res[, a := 0]
myDT
# a b
# 1: 1 3
# 2: 2 4
# oddball, grabbing pointers
res2 <- setDT(unclass(myDT)["a"])
res2[, a := 0]
myDT
# a b
# 1: 0 3
# 2: 0 4
Ok, I have been learnt something new and speedy (the oddballs) today and I have been taking note of that there is a trade-off between speed and parsimonious coding. So the glass is half full! Thanks!
I guess #852 related
Most helpful comment
You can fix the formatting of your post by using a single line of three backticks before and after the code chunk:
I guess repeatedly selecting columns from small tables is something that should, and in most cases can, be avoided...? Because
j
inDT[i, j, by]
supports and optimizes such a wide variety of inputs, I think that it is natural that there is some overhead in parsing it.Regarding other ways to approach your problem (and maybe this would be a better fit for Stack Overflow if you want to talk about it more) ... Depending on what else you want to do with the table, you could just delete the other cols,
DT[, setdiff(names(DT), cols) := NULL]
and continue using DT directly.If you still prefer to take the subset, grabbing column pointers is much faster than either option you considered here, though this way edits to the result will affect the original table:
(I took randomization out of your example and reduced # times in the benchmark because I was impatient.)
I've never found a way to directly call R's list subset (which gets used after the
unclass
above).Regarding "edits to the result will modify the original table", I mean: