Data.table: data.table์ด ๋ชฉ๋ก ์—ด ํ•˜์œ„ ์ง‘ํ•ฉ๋ณด๋‹ค ๋ฒกํ„ฐํ™”๋œ ์—ด ํ•˜์œ„ ์ง‘ํ•ฉ์—์„œ ๋” ๋น ๋ฅธ ์ด์œ 

์— ๋งŒ๋“  2019๋…„ 03์›” 27์ผ  ยท  3์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: Rdatatable/data.table

๋‚˜๋Š” ์ด data.table ํ•ญ๋ชฉ์„ ์ข‹์•„ํ•ฉ๋‹ˆ๋‹ค. ์‹คํ–‰ ์†๋„์™€ ์Šคํฌ๋ฆฝํŒ…์˜ ๊ฐ„๊ฒฐํ•œ ๋ฐฉ๋ฒ•์„ ๊ณ ๋ฅด๊ฒŒ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
์ž‘์€ ํ…Œ์ด๋ธ”์—๋„ ์ž˜ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์–ด์š”.
๋‚˜๋Š” ์ •๊ธฐ์ ์œผ๋กœ DT[, .(id1, id5)]์™€ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ํ…Œ์ด๋ธ”์˜ ํ•˜์œ„ ์ง‘ํ•ฉ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
์ด ๋ฐฉ๋ฒ•์ด ์•„๋‹™๋‹ˆ๋‹ค: DT[, c("id1", "id5")]

์˜ค๋Š˜ ๋‚˜๋Š” ๋‘ ์‚ฌ๋žŒ์˜ ์†๋„๋ฅผ ์ธก์ •ํ–ˆ๊ณ  ๋‚˜๋Š” ์ž‘์€ ํ…Œ์ด๋ธ”์˜ ์†๋„ ์ฐจ์ด์— ๋†€๋ž์Šต๋‹ˆ๋‹ค. ๊ฐ„๊ฒฐํ•œ ๋ฐฉ๋ฒ•์€ ํ›จ์”ฌ ๋Š๋ฆฝ๋‹ˆ๋‹ค.

์ด ์ฐจ์ด๊ฐ€ ์˜๋„๋œ ๊ฒƒ์ž…๋‹ˆ๊นŒ?

์‹คํ–‰ ์†๋„ ์ธก๋ฉด์—์„œ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ ์ˆ˜๋ ดํ•˜๋Š” ๊ฐ„๊ฒฐํ•œ ๋ฐฉ๋ฒ•์„ ๋งŒ๋“ค๊ณ ์ž ํ•˜๋Š” ์—ด๋ง์ด ์žˆ์Šต๋‹ˆ๊นŒ?
(๋ฐ˜๋ณต์ ์ธ ๋ฐฉ์‹์œผ๋กœ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ž‘์€ ํ…Œ์ด๋ธ”์„ ํ•˜์œ„ ์ง‘ํ•ฉ์œผ๋กœ ๋งŒ๋“ค์–ด์•ผ ํ•  ๋•Œ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.)

์šฐ๋ถ„ํˆฌ 18.04
R ๋ฒ„์ „ 3.5.3(2019-03-11)
data.table 1.12.0
๋žจ 32GB
Intelยฎ Coreโ„ข i7-8565U CPU @ 1.80GHz ร— 8

library(data.table)
library(microbenchmark)
N  <- 2e8
K  <- 100
set.seed(1)
DT <- data.table(
  id1 = sample(sprintf("id%03d", 1:K), N, TRUE),               # large groups (char)
  id2 = sample(sprintf("id%03d", 1:K), N, TRUE),               # large groups (char)
  id3 = sample(sprintf("id%010d", 1:(N/K)), N, TRUE),       # small groups (char)
  id4 = sample(K,   N, TRUE),                                           # large groups (int)
  id5 = sample(K,   N, TRUE),                                           # large groups (int)
  id6 = sample(N/K, N, TRUE),                                          # small groups (int)
  v1 =  sample(5,   N, TRUE),                                           # int in range [1,5]
  v2 =  sample(5,   N, TRUE),                                           # int in range [1,5]
  v3 =  sample(round(runif(100, max = 100), 4), N, TRUE) # numeric e.g. 23.5749
)

microbenchmark(
  DT[, .(id1, id5)],
  DT[, c("id1", "id5")]
)
Unit: seconds
                  expr      min       lq     mean   median       uq      max neval
     DT[, .(id1, id5)] 1.588367 1.614645 1.929348 1.626847 1.659698 12.33872   100
 DT[, c("id1", "id5")] 1.592154 1.613800 1.937548 1.628082 2.184456 11.74581   100


N  <- 2e5
DT2 <- data.table(
  id1 = sample(sprintf("id%03d", 1:K), N, TRUE),                 # large groups (char)
  id2 = sample(sprintf("id%03d", 1:K), N, TRUE),                 # large groups (char)
  id3 = sample(sprintf("id%010d", 1:(N/K)), N, TRUE),         # small groups (char)
  id4 = sample(K,   N, TRUE),                                             # large groups (int)
  id5 = sample(K,   N, TRUE),                                             # large groups (int)
  id6 = sample(N/K, N, TRUE),                                            # small groups (int)
  v1 =  sample(5,   N, TRUE),                                             # int in range [1,5]
  v2 =  sample(5,   N, TRUE),                                             # int in range [1,5]
  v3 =  sample(round(runif(100, max = 100), 4), N, TRUE)   # numeric e.g. 23.5749
)

microbenchmark(
  DT2[, .(id1, id5)],
  DT2[, c("id1", "id5")]
)
Unit: microseconds
                   expr      min       lq      mean    median        uq      max neval
DT2[, .(id1, id5)] 1405.042 1461.561 1525.5314 1491.7885 1527.8955 2220.860   100
DT2[, c("id1", "id5")]  614.624  640.617  666.2426  659.0175  676.9355  906.966   100

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

์ฝ”๋“œ ์ฒญํฌ ์•ž๋’ค์— ๋ฐฑํ‹ฑ ์„ธ ์ค„์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒŒ์‹œ๋ฌผ ํ˜•์‹์„ ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

```
code
```

์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ž‘์€ ํ…Œ์ด๋ธ”์„ ๋ฐ˜๋ณต์ ์ธ ๋ฐฉ์‹์œผ๋กœ ๋ถ€๋ถ„์ง‘ํ•ฉํ•ด์•ผ ํ•  ๋•Œ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.

์ž‘์€ ํ…Œ์ด๋ธ”์—์„œ ๋ฐ˜๋ณต์ ์œผ๋กœ ์—ด์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์€ ํ”ผํ•ด์•ผ ํ•˜๊ณ  ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ ํ”ผํ•  ์ˆ˜ ์žˆ๋Š” ์ผ์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค...? j in DT[i, j, by] ๋Š” ์ด์ฒ˜๋Ÿผ ๋‹ค์–‘ํ•œ ์ž…๋ ฅ์„ ์ง€์›ํ•˜๊ณ  ์ตœ์ ํ™”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํŒŒ์‹ฑ์— ์•ฝ๊ฐ„์˜ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์žˆ๋Š” ๊ฒƒ์€ ๋‹น์—ฐํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.


๋ฌธ์ œ์— ์ ‘๊ทผํ•˜๋Š” ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๊ณผ ๊ด€๋ จํ•˜์—ฌ(๊ทธ๋ฆฌ๊ณ  ๋” ์ž์„ธํžˆ ์ด์•ผ๊ธฐํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ์ด๊ฒƒ์ด ์Šคํƒ ์˜ค๋ฒ„ํ”Œ๋กœ์— ๋” ์ ํ•ฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค) ... ํ…Œ์ด๋ธ”์—์„œ ์ˆ˜ํ–‰ํ•˜๋ ค๋Š” ๋‹ค๋ฅธ ์ž‘์—…์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ์—ด์„ ์‚ญ์ œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. , DT[, setdiff(names(DT), cols) := NULL] ๋ฐ DT๋ฅผ ์ง์ ‘ ๊ณ„์† ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์—ฌ์ „ํžˆ ํ•˜์œ„ ์ง‘ํ•ฉ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ์„ ํ˜ธํ•˜๋Š” ๊ฒฝ์šฐ ์—ด ํฌ์ธํ„ฐ๋ฅผ ์žก๋Š” ๊ฒƒ์ด ์—ฌ๊ธฐ์—์„œ ๊ณ ๋ คํ•œ ๋‘ ์˜ต์…˜๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฅด์ง€๋งŒ ์ด ๋ฐฉ๋ฒ•์œผ๋กœ ๊ฒฐ๊ณผ๋ฅผ ํŽธ์ง‘ํ•˜๋ฉด ์›๋ณธ ํ…Œ์ด๋ธ”์— ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๋‹ค.

library(data.table)
library(microbenchmark)
N <- 2e8
K <- 100
set.seed(1)
DT <- data.table(
id1 = sprintf("id%03d", 1:K), # large groups (char)
id2 = sprintf("id%03d", 1:K), # large groups (char)
id3 = sprintf("id%010d", 1:(N/K)), # small groups (char)
id4 = sample(K), # large groups (int)
id5 = sample(K), # large groups (int)
id6 = sample(N/K), # small groups (int)
v1 = sample(5), # int in range [1,5]
v2 = sample(5), # int in range [1,5]
v3 = round(runif(100, max = 100), 4), # numeric e.g. 23.5749
row = seq_len(N)
)

cols = c("id1", "id5")
microbenchmark(times = 3,
  expression = DT[, .(id1, id5)],
  index = DT[, c("id1", "id5")],
  dotdot = DT[, ..cols],
  oddball = setDT(lapply(setNames(cols, cols), function(x) DT[[x]]))[],
  oddball2 = setDT(unclass(DT)[cols])[]
)

Unit: microseconds
       expr         min           lq         mean      median           uq         max neval
 expression 1249753.580 1304355.3415 1417166.9297 1358957.103 1500873.6045 1642790.106     3
      index 1184056.302 1191334.4835 1396372.3483 1198612.665 1502530.3715 1806448.078     3
     dotdot 1084521.234 1240062.2370 1439680.6980 1395603.240 1617260.4300 1838917.620     3
    oddball      92.659     171.8635     568.5317     251.068     806.4680    1361.868     3
   oddball2      66.582     125.9505     150.7337     185.319     192.8095     200.300     3

(๋‚˜๋Š” ์ฐธ์„์„ฑ์ด ์—†์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ท€ํ•˜์˜ ์˜ˆ์‹œ์—์„œ ๋ฌด์ž‘์œ„ ์ถ”์ถœ์„ ์„ ํƒํ•˜๊ณ  ๋ฒค์น˜๋งˆํฌ์—์„œ #๋ฐฐ๋ฅผ ์ค„์˜€์Šต๋‹ˆ๋‹ค.)

R์˜ ๋ชฉ๋ก ํ•˜์œ„ ์ง‘ํ•ฉ(์œ„์˜ unclass ๋‹ค์Œ์— ์‚ฌ์šฉ๋จ)์„ ์ง์ ‘ ํ˜ธ์ถœํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ฐพ์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค.

"๊ฒฐ๊ณผ๋ฅผ ํŽธ์ง‘ํ•˜๋ฉด ์›๋ž˜ ํ…Œ์ด๋ธ”์ด ์ˆ˜์ •๋ฉ๋‹ˆ๋‹ค"์™€ ๊ด€๋ จํ•˜์—ฌ ๋‹ค์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

myDT = data.table(a = 1:2, b = 3:4)

# standard way
res <- myDT[, "a"]
res[, a := 0]
myDT
#    a b
# 1: 1 3
# 2: 2 4

# oddball, grabbing pointers
res2 <- setDT(unclass(myDT)["a"])
res2[, a := 0]
myDT
#    a b
# 1: 0 3
# 2: 0 4

๋ชจ๋“  3 ๋Œ“๊ธ€

์ฝ”๋“œ ์ฒญํฌ ์•ž๋’ค์— ๋ฐฑํ‹ฑ ์„ธ ์ค„์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒŒ์‹œ๋ฌผ ํ˜•์‹์„ ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

```
code
```

์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ž‘์€ ํ…Œ์ด๋ธ”์„ ๋ฐ˜๋ณต์ ์ธ ๋ฐฉ์‹์œผ๋กœ ๋ถ€๋ถ„์ง‘ํ•ฉํ•ด์•ผ ํ•  ๋•Œ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.

์ž‘์€ ํ…Œ์ด๋ธ”์—์„œ ๋ฐ˜๋ณต์ ์œผ๋กœ ์—ด์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์€ ํ”ผํ•ด์•ผ ํ•˜๊ณ  ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ ํ”ผํ•  ์ˆ˜ ์žˆ๋Š” ์ผ์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค...? j in DT[i, j, by] ๋Š” ์ด์ฒ˜๋Ÿผ ๋‹ค์–‘ํ•œ ์ž…๋ ฅ์„ ์ง€์›ํ•˜๊ณ  ์ตœ์ ํ™”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํŒŒ์‹ฑ์— ์•ฝ๊ฐ„์˜ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์žˆ๋Š” ๊ฒƒ์€ ๋‹น์—ฐํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.


๋ฌธ์ œ์— ์ ‘๊ทผํ•˜๋Š” ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๊ณผ ๊ด€๋ จํ•˜์—ฌ(๊ทธ๋ฆฌ๊ณ  ๋” ์ž์„ธํžˆ ์ด์•ผ๊ธฐํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ์ด๊ฒƒ์ด ์Šคํƒ ์˜ค๋ฒ„ํ”Œ๋กœ์— ๋” ์ ํ•ฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค) ... ํ…Œ์ด๋ธ”์—์„œ ์ˆ˜ํ–‰ํ•˜๋ ค๋Š” ๋‹ค๋ฅธ ์ž‘์—…์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ์—ด์„ ์‚ญ์ œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. , DT[, setdiff(names(DT), cols) := NULL] ๋ฐ DT๋ฅผ ์ง์ ‘ ๊ณ„์† ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์—ฌ์ „ํžˆ ํ•˜์œ„ ์ง‘ํ•ฉ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ์„ ํ˜ธํ•˜๋Š” ๊ฒฝ์šฐ ์—ด ํฌ์ธํ„ฐ๋ฅผ ์žก๋Š” ๊ฒƒ์ด ์—ฌ๊ธฐ์—์„œ ๊ณ ๋ คํ•œ ๋‘ ์˜ต์…˜๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฅด์ง€๋งŒ ์ด ๋ฐฉ๋ฒ•์œผ๋กœ ๊ฒฐ๊ณผ๋ฅผ ํŽธ์ง‘ํ•˜๋ฉด ์›๋ณธ ํ…Œ์ด๋ธ”์— ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๋‹ค.

library(data.table)
library(microbenchmark)
N <- 2e8
K <- 100
set.seed(1)
DT <- data.table(
id1 = sprintf("id%03d", 1:K), # large groups (char)
id2 = sprintf("id%03d", 1:K), # large groups (char)
id3 = sprintf("id%010d", 1:(N/K)), # small groups (char)
id4 = sample(K), # large groups (int)
id5 = sample(K), # large groups (int)
id6 = sample(N/K), # small groups (int)
v1 = sample(5), # int in range [1,5]
v2 = sample(5), # int in range [1,5]
v3 = round(runif(100, max = 100), 4), # numeric e.g. 23.5749
row = seq_len(N)
)

cols = c("id1", "id5")
microbenchmark(times = 3,
  expression = DT[, .(id1, id5)],
  index = DT[, c("id1", "id5")],
  dotdot = DT[, ..cols],
  oddball = setDT(lapply(setNames(cols, cols), function(x) DT[[x]]))[],
  oddball2 = setDT(unclass(DT)[cols])[]
)

Unit: microseconds
       expr         min           lq         mean      median           uq         max neval
 expression 1249753.580 1304355.3415 1417166.9297 1358957.103 1500873.6045 1642790.106     3
      index 1184056.302 1191334.4835 1396372.3483 1198612.665 1502530.3715 1806448.078     3
     dotdot 1084521.234 1240062.2370 1439680.6980 1395603.240 1617260.4300 1838917.620     3
    oddball      92.659     171.8635     568.5317     251.068     806.4680    1361.868     3
   oddball2      66.582     125.9505     150.7337     185.319     192.8095     200.300     3

(๋‚˜๋Š” ์ฐธ์„์„ฑ์ด ์—†์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ท€ํ•˜์˜ ์˜ˆ์‹œ์—์„œ ๋ฌด์ž‘์œ„ ์ถ”์ถœ์„ ์„ ํƒํ•˜๊ณ  ๋ฒค์น˜๋งˆํฌ์—์„œ #๋ฐฐ๋ฅผ ์ค„์˜€์Šต๋‹ˆ๋‹ค.)

R์˜ ๋ชฉ๋ก ํ•˜์œ„ ์ง‘ํ•ฉ(์œ„์˜ unclass ๋‹ค์Œ์— ์‚ฌ์šฉ๋จ)์„ ์ง์ ‘ ํ˜ธ์ถœํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ฐพ์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค.

"๊ฒฐ๊ณผ๋ฅผ ํŽธ์ง‘ํ•˜๋ฉด ์›๋ž˜ ํ…Œ์ด๋ธ”์ด ์ˆ˜์ •๋ฉ๋‹ˆ๋‹ค"์™€ ๊ด€๋ จํ•˜์—ฌ ๋‹ค์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

myDT = data.table(a = 1:2, b = 3:4)

# standard way
res <- myDT[, "a"]
res[, a := 0]
myDT
#    a b
# 1: 1 3
# 2: 2 4

# oddball, grabbing pointers
res2 <- setDT(unclass(myDT)["a"])
res2[, a := 0]
myDT
#    a b
# 1: 0 3
# 2: 0 4

์ข‹์•„, ๋‚˜๋Š” ์˜ค๋Š˜ ์ƒˆ๋กญ๊ณ  ๋น ๋ฅธ(์ด์ƒํ•œ) ๋ฌด์–ธ๊ฐ€๋ฅผ ๋ฐฐ์› ๊ณ  ์†๋„์™€ ๊ฐ„๊ฒฐํ•œ ์ฝ”๋”ฉ ์‚ฌ์ด์— ๊ท ํ˜•์ด ์žˆ๋‹ค๋Š” ์ ์— ์ฃผ๋ชฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ž”์ด ๋ฐ˜์ฏค ์ฐจ ์žˆ์Šต๋‹ˆ๋‹ค! ๊ฐ์‚ฌ ํ•ด์š”!

#852 ๊ด€๋ จ์ด ์žˆ๋Š” ๊ฒƒ ๊ฐ™์•„์š”

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰