Data.table: Keyby / by๋Š” ๋ถ€๋ถ„ ์ง‘ํ•ฉ์œผ๋กœ ๊ณ ์œ  ํ•œ ๊ทธ๋ฃน์„ ๋ฐ˜ํ™˜ํ•˜์ง€ ์•Š์Œ

์— ๋งŒ๋“  2018๋…„ 03์›” 31์ผ  ยท  4์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: Rdatatable/data.table

์•„๋ž˜๋Š” keyby (๋˜ํ•œ by)๊ฐ€ ํ•˜์œ„ ์ง‘ํ•ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ์œ  ํ•œ ๊ทธ๋ฃน์„ ๋ฐ˜ํ™˜ํ•˜์ง€ ์•Š๋Š” ๊ฐ„๋‹จํ•œ ์˜ˆ์ž…๋‹ˆ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ ๋ถ€๋ถ„ ์„ค์ •์ด ์ œ๊ฑฐ๋˜๋ฉด keyby๊ฐ€ ์ œ๋Œ€๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2018-03-21 23:49:00 UTC; travis
#  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
#  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
#  Release notes, videos and slides: http://r-datatable.com

# small dataset
dat <- data.table(Group = rep(c("All", "Not All"), times = 4), count = 1:8, ID = rep(1:2, each = 4))

# keyby returning non unique IDs with subset
dat[Group == "All" ,lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= c("count"), keyby = ID, verbose = TRUE]
# Creating new index 'Group'
# Creating index Group done in ... 0.001sec 
# Optimized subsetting with index 'Group'
# on= matches existing index, using index
# Starting bmerge ...done in 0.000sec 
# i clause present and columns used in by detected, only these subset: ID 
# Finding groups using forderv ... 0.000sec 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec 
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
#   collecting discontiguous groups took 0.000s for 2 groups
#   eval(j) took 0.000s for 2 calls
# 0.000sec 
#    ID count
# 1:  1     4
# 2:  1    12

# keyby working fine without subset
dat[,lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= c("count"), keyby = ID] 
# Finding groups using forderv ... 0.000sec 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec 
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
#   memcpy contiguous groups took 0.000s for 2 groups
#   eval(j) took 0.000s for 2 calls
# 0.000sec 
#    ID count
# 1:  1    10
# 2:  2    26

sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.4

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

๋ฒ„๊ทธ๋ผ๊ณ  ๋™์˜ํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ๋ก์„ ์œ„ํ•ด์ด ๊ฒฝ์šฐ ๊ถŒ์žฅ๋˜๋Š” ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

dat[Group == "All", lapply(.SD, sum, na.rm = TRUE), .SDcols= c("count"), keyby = ID]

์ด ๋ฒ„์ „์€ GForce ํ™œ์„ฑํ™”ํ•˜๊ณ ์ด ๊ฒฝ์šฐ ๋ฒ„๊ทธ๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์ •๋‹ต์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋ฌผ๋ก  ์‹ค์ œ ์ฝ”๋“œ๋ฅผ ์ด์™€ ๊ฐ™์ด ์ฒ˜๋ฆฌ ํ•  ์ˆ˜ โ€‹โ€‹์—†๋‹ค๋ฉด ์ด๊ฒƒ์€ ๋„์›€์ด๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

ํฅ๋ฏธ๋กญ๊ฒŒ๋„ ํ•˜์œ„ ์ง‘ํ•ฉ ํ–‰์„ ์ง์ ‘ ์ „๋‹ฌํ•˜๋ฉด ์ฝ”๋“œ๊ฐ€ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

dat[c(1, 3, 5, 7),
    lapply(.SD, function(x) sum(x, na.rm = TRUE)),
    .SDcols= "count", keyby = ID, verbose = TRUE]
# i clause present and columns used in by detected, only these subset: ID 
# Finding groups using forderv ... 0.000sec 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec 
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
#   collecting discontiguous groups took 0.000s for 2 groups
#   eval(j) took 0.000s for 2 calls
# 0.000sec 
#    ID count
# 1:  1     4
# 2:  2    12

verbose ์ถœ๋ ฅ์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ฐจ์ด์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์ธ๋ฑ์Šค '๊ทธ๋ฃน'์œผ๋กœ ์ตœ์ ํ™” ๋œ ๋ถ€๋ถ„ ์ง‘ํ•ฉ ํ™”

์ด๋กœ ์ธํ•ด CRAN์—์„œ ์„ค์น˜ํ•˜๊ฒŒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ฝ”๋“œ๋Š” 1.10.4-3 ์—์„œ ์˜ค๋ฅ˜์—†์ด ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ ์ด๊ฒƒ์ด @MarkusBonsch ์˜ ํ•˜์œ„ ์ง‘ํ•ฉ ์ตœ์ ํ™” ์ž‘์—…์—์„œ

์กฐ์ธ์„ ๋ช…์‹œ ์ ์œผ๋กœ ๋งŒ๋“ค๋ฉด ๋™์ผํ•œ ์˜ค๋ฅ˜๊ฐ€ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค.

dat[.('All'), on = 'Group',
    lapply(.SD, function(x) sum(x, na.rm = TRUE)),
    .SDcols= "count", keyby = ID]
#    ID count
# 1:  1     4
# 2:  1    12

๊ทธ๋Ÿฌ๋‚˜ ํ‚ค ๋ฒ„์ „์€ ๊ดœ์ฐฎ์Šต๋‹ˆ๋‹ค.

setkey(dat, Group)
dat[.('All'), 
    lapply(.SD, function(x) sum(x, na.rm = TRUE)),
    .SDcols= "count", keyby = ID]#    ID count
# 1:  1     4
# 2:  2    12

๋ชจ๋“  4 ๋Œ“๊ธ€

๋ฒ„๊ทธ๋ผ๊ณ  ๋™์˜ํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ๋ก์„ ์œ„ํ•ด์ด ๊ฒฝ์šฐ ๊ถŒ์žฅ๋˜๋Š” ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

dat[Group == "All", lapply(.SD, sum, na.rm = TRUE), .SDcols= c("count"), keyby = ID]

์ด ๋ฒ„์ „์€ GForce ํ™œ์„ฑํ™”ํ•˜๊ณ ์ด ๊ฒฝ์šฐ ๋ฒ„๊ทธ๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์ •๋‹ต์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋ฌผ๋ก  ์‹ค์ œ ์ฝ”๋“œ๋ฅผ ์ด์™€ ๊ฐ™์ด ์ฒ˜๋ฆฌ ํ•  ์ˆ˜ โ€‹โ€‹์—†๋‹ค๋ฉด ์ด๊ฒƒ์€ ๋„์›€์ด๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

ํฅ๋ฏธ๋กญ๊ฒŒ๋„ ํ•˜์œ„ ์ง‘ํ•ฉ ํ–‰์„ ์ง์ ‘ ์ „๋‹ฌํ•˜๋ฉด ์ฝ”๋“œ๊ฐ€ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

dat[c(1, 3, 5, 7),
    lapply(.SD, function(x) sum(x, na.rm = TRUE)),
    .SDcols= "count", keyby = ID, verbose = TRUE]
# i clause present and columns used in by detected, only these subset: ID 
# Finding groups using forderv ... 0.000sec 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec 
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
#   collecting discontiguous groups took 0.000s for 2 groups
#   eval(j) took 0.000s for 2 calls
# 0.000sec 
#    ID count
# 1:  1     4
# 2:  2    12

verbose ์ถœ๋ ฅ์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ฐจ์ด์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์ธ๋ฑ์Šค '๊ทธ๋ฃน'์œผ๋กœ ์ตœ์ ํ™” ๋œ ๋ถ€๋ถ„ ์ง‘ํ•ฉ ํ™”

์ด๋กœ ์ธํ•ด CRAN์—์„œ ์„ค์น˜ํ•˜๊ฒŒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ฝ”๋“œ๋Š” 1.10.4-3 ์—์„œ ์˜ค๋ฅ˜์—†์ด ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ ์ด๊ฒƒ์ด @MarkusBonsch ์˜ ํ•˜์œ„ ์ง‘ํ•ฉ ์ตœ์ ํ™” ์ž‘์—…์—์„œ

์กฐ์ธ์„ ๋ช…์‹œ ์ ์œผ๋กœ ๋งŒ๋“ค๋ฉด ๋™์ผํ•œ ์˜ค๋ฅ˜๊ฐ€ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค.

dat[.('All'), on = 'Group',
    lapply(.SD, function(x) sum(x, na.rm = TRUE)),
    .SDcols= "count", keyby = ID]
#    ID count
# 1:  1     4
# 2:  1    12

๊ทธ๋Ÿฌ๋‚˜ ํ‚ค ๋ฒ„์ „์€ ๊ดœ์ฐฎ์Šต๋‹ˆ๋‹ค.

setkey(dat, Group)
dat[.('All'), 
    lapply(.SD, function(x) sum(x, na.rm = TRUE)),
    .SDcols= "count", keyby = ID]#    ID count
# 1:  1     4
# 2:  2    12

๋ณด๊ณ  ํ•ด ์ฃผ์‹  @cathine ๊ณผ ์กฐ์‚ฌํ•ด @MichaelChirico ์—๊ฒŒ ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค.
๊ทผ๋ณธ ์›์ธ์€ Michael์ด ์ง€์ ํ•œ ์กฐ์ธ ๋ฒ„์ „์˜ ๋ฒ„๊ทธ ๋™์ž‘์ž…๋‹ˆ๋‹ค.
dat[.('All'), on = 'Group', lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= "count", keyby = ID]

์ด ๋ฌธ์ œ # 2591์ด ํ•ด๊ฒฐ๋˜๋ฉด ์•„๋งˆ๋„ ํ•ด๊ฒฐ ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
์ƒˆ๋กœ์šด ํ•˜์œ„ ์ง‘ํ•ฉ ์ตœ์ ํ™”์—์„œ ํ•˜์œ„ ์ง‘ํ•ฉ์€ data.table ์˜ ๊ฒฐํ•ฉ ๋ถ€๋ถ„์œผ๋กœ ๋ฆฌ๋””๋ ‰์…˜๋˜๋ฏ€๋กœ ์ด์ œ์ด ๋ฒ„๊ทธ๋Š” ์ด์ œ ๊ฒฐํ•ฉ๋ฟ ์•„๋‹ˆ๋ผ ํ•˜์œ„ ์ง‘ํ•ฉ์—๋„ ์˜ํ–ฅ์„์ค๋‹ˆ๋‹ค. ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์œผ๋ฉด ์ตœ๋Œ€ํ•œ ๋นจ๋ฆฌ ์กฐ์‚ฌํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.
๊ทธ๋•Œ๊นŒ์ง€๋Š”
์˜ˆ๋ฅผ ๋“ค์–ด dat[Group == "All"][ ,lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= c("count"), keyby = ID, verbose = TRUE] .
๋ถˆํŽธ์„ ๋“œ๋ ค ์ฃ„์†กํ•ฉ๋‹ˆ๋‹ค.

@cathine ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค! ์ด๊ฒƒ์ด ๊ฐœ๋ฐœ ์ „์šฉ์ด๋ฉฐ ๋ฌธ์ œ๊ฐ€ ๋ ˆ๋ฒจ 3 ์ตœ์ ํ™”์—์žˆ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด๊ธฐ ๋•Œ๋ฌธ์— options(datatable.optimize=2) ๋กœ ์™„ํ™” ํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‚˜๋Š” ์ด๊ฒƒ์ด ์–ด๋–ป๊ฒŒ ํ…Œ์ŠคํŠธ๋ฅผ ํ†ต๊ณผํ–ˆ๋Š”์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค!
๋ณด๊ณ  ํ•œ ๋‹ค๋ฅธ ์—ฐ๋ฝ์ฒ˜์˜ ๋” ๊ฐ„๋‹จํ•œ ์˜ˆ :

> DT = data.table(
    id = c("a","a","a","b","b","c","c","d","d"),
    group = c(1,1,1,1,1,2,2,2,2),
    num = 1)
> DT[, uniqueN(id), by=group]          # ok 
   group    V1
   <num> <int>
1:     1     2
2:     2     2
> DT[num==1, uniqueN(id), by=group]    # group column wrong
   group    V1
   <num> <int>
1:     1     2
2:     1     2
> options(datatable.optimize=2)
> DT[num==1, uniqueN(id), by=group]    # ok
   group    V1
   <num> <int>
1:     1     2
2:     2     2
> options(datatable.optimize=3)        # not ok
> DT[num==1, uniqueN(id), by=group]
   group    V1
   <num> <int>
1:     1     2
2:     1     2
> DT[num==1, sum(num), by=group]       # ok
   group    V1
   <num> <num>
1:     1     7
2:     2     4
> DT[num==1, length(num), by=group]    # not ok
   group    V1
   <num> <int>
1:     1     7
2:     1     4
> options(datatable.optimize=2)        # ok
> DT[num==1, length(num), by=group]
   group    V1
   <num> <int>
1:     1     7
2:     2     4
> 

์™œ ํ…Œ์ŠคํŠธ๋ฅผ ํ†ต๊ณผ ํ–ˆ์Šต๋‹ˆ๊นŒ? ๊ทธ๋ฃนํ™” ์—ด์ด ์ •๋ ฌ ๋œ ๊ฒฝ์šฐ์—๋งŒ ๋ฐœ์ƒํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค (์•„๋ž˜ ์ฝ”๋“œ ์ฐธ์กฐ)! ํŠน๋ณ„ํžˆ ์ •๋ ฌ ๋œ ์—ด์— ๋Œ€ํ•œ ๊ทธ๋ฃนํ™”๋ฅผ ํ™•์ธํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

library(data.table)
DT = data.table(
  id = c("a","a","a","b","b","c","c","d","d"),
  group = c(1,1,1,1,1,2,2,2,2),
  group2 = c(1,1,1,1,1,2,2,2,1),
  num = 1)
DT[, uniqueN(id), by=group]          # ok 
# group    V1
# <num> <int>
# 1:     1     2
# 2:     2     2
DT[num==1, uniqueN(id), by=group]    # group column wrong
# group    V1
# <num> <int>
# 1:     1     2
# 2:     1     2
DT[num==1, uniqueN(id), by=group2]    # ok with other group column that is not sorted
# group2 V1
# 1:      1  3
# 2:      2  2

setkey(DT, group2)
DT[num==1, uniqueN(id), by=group2]    # not ok anymore since the group column is sorted now
# group2 V1
# 1:      1  3
# 2:      1  2
์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰