์๋๋ keyby (๋ํ by)๊ฐ ํ์ ์งํฉ์ ์ฌ์ฉํ์ฌ ๊ณ ์ ํ ๊ทธ๋ฃน์ ๋ฐํํ์ง ์๋ ๊ฐ๋จํ ์์
๋๋ค.
๊ทธ๋ฌ๋ ๋ถ๋ถ ์ค์ ์ด ์ ๊ฑฐ๋๋ฉด keyby๊ฐ ์ ๋๋ก ์๋ํฉ๋๋ค.
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2018-03-21 23:49:00 UTC; travis
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
# small dataset
dat <- data.table(Group = rep(c("All", "Not All"), times = 4), count = 1:8, ID = rep(1:2, each = 4))
# keyby returning non unique IDs with subset
dat[Group == "All" ,lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= c("count"), keyby = ID, verbose = TRUE]
# Creating new index 'Group'
# Creating index Group done in ... 0.001sec
# Optimized subsetting with index 'Group'
# on= matches existing index, using index
# Starting bmerge ...done in 0.000sec
# i clause present and columns used in by detected, only these subset: ID
# Finding groups using forderv ... 0.000sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ...
# collecting discontiguous groups took 0.000s for 2 groups
# eval(j) took 0.000s for 2 calls
# 0.000sec
# ID count
# 1: 1 4
# 2: 1 12
# keyby working fine without subset
dat[,lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= c("count"), keyby = ID]
# Finding groups using forderv ... 0.000sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ...
# memcpy contiguous groups took 0.000s for 2 groups
# eval(j) took 0.000s for 2 calls
# 0.000sec
# ID count
# 1: 1 10
# 2: 2 26
sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.4
๋ฒ๊ทธ๋ผ๊ณ ๋์ํฉ๋๋ค.
๊ธฐ๋ก์ ์ํด์ด ๊ฒฝ์ฐ ๊ถ์ฅ๋๋ ์ฝ๋๋ ๋ค์๊ณผ ๊ฐ์ต๋๋ค.
dat[Group == "All", lapply(.SD, sum, na.rm = TRUE), .SDcols= c("count"), keyby = ID]
์ด ๋ฒ์ ์ GForce
ํ์ฑํํ๊ณ ์ด ๊ฒฝ์ฐ ๋ฒ๊ทธ๊ฐ ์กด์ฌํ์ง ์๊ธฐ ๋๋ฌธ์ ์ ๋ต์ ์ ๊ณตํฉ๋๋ค.
๋ฌผ๋ก ์ค์ ์ฝ๋๋ฅผ ์ด์ ๊ฐ์ด ์ฒ๋ฆฌ ํ ์ โโ์๋ค๋ฉด ์ด๊ฒ์ ๋์์ด๋์ง ์์ต๋๋ค.
ํฅ๋ฏธ๋กญ๊ฒ๋ ํ์ ์งํฉ ํ์ ์ง์ ์ ๋ฌํ๋ฉด ์ฝ๋๊ฐ ์๋ํฉ๋๋ค.
dat[c(1, 3, 5, 7),
lapply(.SD, function(x) sum(x, na.rm = TRUE)),
.SDcols= "count", keyby = ID, verbose = TRUE]
# i clause present and columns used in by detected, only these subset: ID
# Finding groups using forderv ... 0.000sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ...
# collecting discontiguous groups took 0.000s for 2 groups
# eval(j) took 0.000s for 2 calls
# 0.000sec
# ID count
# 1: 1 4
# 2: 2 12
verbose
์ถ๋ ฅ์์ ๋ค์๊ณผ ๊ฐ์ ์ฐจ์ด์ ์ด ์์ต๋๋ค.
์ธ๋ฑ์ค '๊ทธ๋ฃน'์ผ๋ก ์ต์ ํ ๋ ๋ถ๋ถ ์งํฉ ํ
์ด๋ก ์ธํด CRAN์์ ์ค์นํ๊ฒ๋์์ต๋๋ค. ์ฝ๋๋ 1.10.4-3
์์ ์ค๋ฅ์์ด ์คํ๋ฉ๋๋ค.
๊ทธ๋์ ์ด๊ฒ์ด @MarkusBonsch ์ ํ์ ์งํฉ ์ต์ ํ ์์ ์์
์กฐ์ธ์ ๋ช ์ ์ ์ผ๋ก ๋ง๋ค๋ฉด ๋์ผํ ์ค๋ฅ๊ฐ ํ์๋ฉ๋๋ค.
dat[.('All'), on = 'Group',
lapply(.SD, function(x) sum(x, na.rm = TRUE)),
.SDcols= "count", keyby = ID]
# ID count
# 1: 1 4
# 2: 1 12
๊ทธ๋ฌ๋ ํค ๋ฒ์ ์ ๊ด์ฐฎ์ต๋๋ค.
setkey(dat, Group)
dat[.('All'),
lapply(.SD, function(x) sum(x, na.rm = TRUE)),
.SDcols= "count", keyby = ID]# ID count
# 1: 1 4
# 2: 2 12
๋ณด๊ณ ํด ์ฃผ์ @cathine ๊ณผ ์กฐ์ฌํด @MichaelChirico ์๊ฒ ๊ฐ์ฌ๋๋ฆฝ๋๋ค.
๊ทผ๋ณธ ์์ธ์ Michael์ด ์ง์ ํ ์กฐ์ธ ๋ฒ์ ์ ๋ฒ๊ทธ ๋์์
๋๋ค.
dat[.('All'), on = 'Group', lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= "count", keyby = ID]
์ด ๋ฌธ์ # 2591์ด ํด๊ฒฐ๋๋ฉด ์๋ง๋ ํด๊ฒฐ ๋ ๊ฒ์
๋๋ค.
์๋ก์ด ํ์ ์งํฉ ์ต์ ํ์์ ํ์ ์งํฉ์ data.table
์ ๊ฒฐํฉ ๋ถ๋ถ์ผ๋ก ๋ฆฌ๋๋ ์
๋๋ฏ๋ก ์ด์ ์ด ๋ฒ๊ทธ๋ ์ด์ ๊ฒฐํฉ๋ฟ ์๋๋ผ ํ์ ์งํฉ์๋ ์ํฅ์์ค๋๋ค. ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ ์ ์์ผ๋ฉด ์ต๋ํ ๋นจ๋ฆฌ ์กฐ์ฌํ๊ฒ ์ต๋๋ค.
๊ทธ๋๊น์ง๋
์๋ฅผ ๋ค์ด dat[Group == "All"][ ,lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= c("count"), keyby = ID, verbose = TRUE]
.
๋ถํธ์ ๋๋ ค ์ฃ์กํฉ๋๋ค.
@cathine ๊ฐ์ฌํฉ๋๋ค! ์ด๊ฒ์ด ๊ฐ๋ฐ ์ ์ฉ์ด๋ฉฐ ๋ฌธ์ ๊ฐ ๋ ๋ฒจ 3 ์ต์ ํ์์๋ ๊ฒ์ฒ๋ผ ๋ณด์ด๊ธฐ ๋๋ฌธ์ options(datatable.optimize=2)
๋ก ์ํ ํ ์ ์์์ ํ์ธํ์ต๋๋ค. ๋๋ ์ด๊ฒ์ด ์ด๋ป๊ฒ ํ
์คํธ๋ฅผ ํต๊ณผํ๋์ง ๊ถ๊ธํฉ๋๋ค!
๋ณด๊ณ ํ ๋ค๋ฅธ ์ฐ๋ฝ์ฒ์ ๋ ๊ฐ๋จํ ์ :
> DT = data.table(
id = c("a","a","a","b","b","c","c","d","d"),
group = c(1,1,1,1,1,2,2,2,2),
num = 1)
> DT[, uniqueN(id), by=group] # ok
group V1
<num> <int>
1: 1 2
2: 2 2
> DT[num==1, uniqueN(id), by=group] # group column wrong
group V1
<num> <int>
1: 1 2
2: 1 2
> options(datatable.optimize=2)
> DT[num==1, uniqueN(id), by=group] # ok
group V1
<num> <int>
1: 1 2
2: 2 2
> options(datatable.optimize=3) # not ok
> DT[num==1, uniqueN(id), by=group]
group V1
<num> <int>
1: 1 2
2: 1 2
> DT[num==1, sum(num), by=group] # ok
group V1
<num> <num>
1: 1 7
2: 2 4
> DT[num==1, length(num), by=group] # not ok
group V1
<num> <int>
1: 1 7
2: 1 4
> options(datatable.optimize=2) # ok
> DT[num==1, length(num), by=group]
group V1
<num> <int>
1: 1 7
2: 2 4
>
์ ํ ์คํธ๋ฅผ ํต๊ณผ ํ์ต๋๊น? ๊ทธ๋ฃนํ ์ด์ด ์ ๋ ฌ ๋ ๊ฒฝ์ฐ์๋ง ๋ฐ์ํ๊ธฐ ๋๋ฌธ์ ๋๋ค (์๋ ์ฝ๋ ์ฐธ์กฐ)! ํน๋ณํ ์ ๋ ฌ ๋ ์ด์ ๋ํ ๊ทธ๋ฃนํ๋ฅผ ํ์ธํ์ง ์์์ต๋๋ค.
library(data.table)
DT = data.table(
id = c("a","a","a","b","b","c","c","d","d"),
group = c(1,1,1,1,1,2,2,2,2),
group2 = c(1,1,1,1,1,2,2,2,1),
num = 1)
DT[, uniqueN(id), by=group] # ok
# group V1
# <num> <int>
# 1: 1 2
# 2: 2 2
DT[num==1, uniqueN(id), by=group] # group column wrong
# group V1
# <num> <int>
# 1: 1 2
# 2: 1 2
DT[num==1, uniqueN(id), by=group2] # ok with other group column that is not sorted
# group2 V1
# 1: 1 3
# 2: 2 2
setkey(DT, group2)
DT[num==1, uniqueN(id), by=group2] # not ok anymore since the group column is sorted now
# group2 V1
# 1: 1 3
# 2: 1 2
๊ฐ์ฅ ์ ์ฉํ ๋๊ธ
๋ฒ๊ทธ๋ผ๊ณ ๋์ํฉ๋๋ค.
๊ธฐ๋ก์ ์ํด์ด ๊ฒฝ์ฐ ๊ถ์ฅ๋๋ ์ฝ๋๋ ๋ค์๊ณผ ๊ฐ์ต๋๋ค.
์ด ๋ฒ์ ์
GForce
ํ์ฑํํ๊ณ ์ด ๊ฒฝ์ฐ ๋ฒ๊ทธ๊ฐ ์กด์ฌํ์ง ์๊ธฐ ๋๋ฌธ์ ์ ๋ต์ ์ ๊ณตํฉ๋๋ค.๋ฌผ๋ก ์ค์ ์ฝ๋๋ฅผ ์ด์ ๊ฐ์ด ์ฒ๋ฆฌ ํ ์ โโ์๋ค๋ฉด ์ด๊ฒ์ ๋์์ด๋์ง ์์ต๋๋ค.
ํฅ๋ฏธ๋กญ๊ฒ๋ ํ์ ์งํฉ ํ์ ์ง์ ์ ๋ฌํ๋ฉด ์ฝ๋๊ฐ ์๋ํฉ๋๋ค.
verbose
์ถ๋ ฅ์์ ๋ค์๊ณผ ๊ฐ์ ์ฐจ์ด์ ์ด ์์ต๋๋ค.์ด๋ก ์ธํด CRAN์์ ์ค์นํ๊ฒ๋์์ต๋๋ค. ์ฝ๋๋
1.10.4-3
์์ ์ค๋ฅ์์ด ์คํ๋ฉ๋๋ค.๊ทธ๋์ ์ด๊ฒ์ด @MarkusBonsch ์ ํ์ ์งํฉ ์ต์ ํ ์์ ์์
์กฐ์ธ์ ๋ช ์ ์ ์ผ๋ก ๋ง๋ค๋ฉด ๋์ผํ ์ค๋ฅ๊ฐ ํ์๋ฉ๋๋ค.
๊ทธ๋ฌ๋ ํค ๋ฒ์ ์ ๊ด์ฐฎ์ต๋๋ค.