ΠΠΈΠΆΠ΅ ΠΏΡΠΈΠ²Π΅Π΄Π΅Π½ ΠΏΡΠΎΡΡΠΎΠΉ ΠΏΡΠΈΠΌΠ΅Ρ, ΠΊΠΎΠ³Π΄Π° keyby (ΡΠ°ΠΊΠΆΠ΅ by) Π½Π΅ Π²ΠΎΠ·Π²ΡΠ°ΡΠ°Π΅Ρ ΡΠ½ΠΈΠΊΠ°Π»ΡΠ½ΡΠ΅ Π³ΡΡΠΏΠΏΡ Ρ ΠΏΠΎΠ΄ΠΌΠ½ΠΎΠΆΠ΅ΡΡΠ²ΠΎΠΌ.
ΠΠ΄Π½Π°ΠΊΠΎ ΠΏΠΎΡΠ»Π΅ ΡΠ΄Π°Π»Π΅Π½ΠΈΡ ΠΏΠΎΠ΄ΠΌΠ½ΠΎΠΆΠ΅ΡΡΠ²Π° keyby ΡΠ°Π±ΠΎΡΠ°Π΅Ρ ΠΏΡΠ°Π²ΠΈΠ»ΡΠ½ΠΎ.
library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2018-03-21 23:49:00 UTC; travis
# The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
# Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
# Release notes, videos and slides: http://r-datatable.com
# small dataset
dat <- data.table(Group = rep(c("All", "Not All"), times = 4), count = 1:8, ID = rep(1:2, each = 4))
# keyby returning non unique IDs with subset
dat[Group == "All" ,lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= c("count"), keyby = ID, verbose = TRUE]
# Creating new index 'Group'
# Creating index Group done in ... 0.001sec
# Optimized subsetting with index 'Group'
# on= matches existing index, using index
# Starting bmerge ...done in 0.000sec
# i clause present and columns used in by detected, only these subset: ID
# Finding groups using forderv ... 0.000sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ...
# collecting discontiguous groups took 0.000s for 2 groups
# eval(j) took 0.000s for 2 calls
# 0.000sec
# ID count
# 1: 1 4
# 2: 1 12
# keyby working fine without subset
dat[,lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= c("count"), keyby = ID]
# Finding groups using forderv ... 0.000sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ...
# memcpy contiguous groups took 0.000s for 2 groups
# eval(j) took 0.000s for 2 calls
# 0.000sec
# ID count
# 1: 1 10
# 2: 2 26
sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.4
Π‘ΠΎΠ³Π»Π°ΡΠΈΡΠ΅ΡΡ, ΡΡΠΎ Π±Π°Π³.
ΠΠ»Ρ Π·Π°ΠΏΠΈΡΠΈ ΡΠ΅ΠΊΠΎΠΌΠ΅Π½Π΄ΡΠ΅ΠΌΡΠΉ ΠΊΠΎΠ΄ Π² ΡΡΠΎΠΌ ΡΠ»ΡΡΠ°Π΅:
dat[Group == "All", lapply(.SD, sum, na.rm = TRUE), .SDcols= c("count"), keyby = ID]
ΠΡΠΎ Π΄Π°Π΅Ρ ΠΏΡΠ°Π²ΠΈΠ»ΡΠ½ΡΠΉ ΠΎΡΠ²Π΅Ρ, ΠΏΠΎΡΠΊΠΎΠ»ΡΠΊΡ ΡΡΠ° Π²Π΅ΡΡΠΈΡ Π°ΠΊΡΠΈΠ²ΠΈΡΡΠ΅Ρ GForce
ΠΈ Π² ΡΡΠΎΠΌ ΡΠ»ΡΡΠ°Π΅ ΠΎΡΠΈΠ±ΠΊΠΈ Π½Π΅Ρ.
ΠΠΎΠ½Π΅ΡΠ½ΠΎ, ΡΡΠΎ Π½Π΅ ΠΏΠΎΠΌΠΎΠΆΠ΅Ρ, Π΅ΡΠ»ΠΈ Π²Π°Ρ ΡΠ΅Π°Π»ΡΠ½ΡΠΉ ΠΊΠΎΠ΄ Π½Π΅ ΠΌΠΎΠΆΠ΅Ρ Π±ΡΡΡ ΠΏΠΎΠ΄ΠΎΠ±Π½ΡΠΌ ΠΎΠ±ΡΠ°Π·ΠΎΠΌ Π·Π°ΠΏΡΡΠ°Π½.
ΠΠ½ΡΠ΅ΡΠ΅ΡΠ½ΠΎ, ΡΡΠΎ Π΅ΡΠ»ΠΈ ΠΌΡ ΠΏΠ΅ΡΠ΅Π΄Π°Π΄ΠΈΠΌ ΡΡΡΠΎΠΊΠΈ ΠΏΠΎΠ΄ΠΌΠ½ΠΎΠΆΠ΅ΡΡΠ²Π° Π½Π°ΠΏΡΡΠΌΡΡ, ΠΊΠΎΠ΄ Π±ΡΠ΄Π΅Ρ ΡΠ°Π±ΠΎΡΠ°ΡΡ:
dat[c(1, 3, 5, 7),
lapply(.SD, function(x) sum(x, na.rm = TRUE)),
.SDcols= "count", keyby = ID, verbose = TRUE]
# i clause present and columns used in by detected, only these subset: ID
# Finding groups using forderv ... 0.000sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ...
# collecting discontiguous groups took 0.000s for 2 groups
# eval(j) took 0.000s for 2 calls
# 0.000sec
# ID count
# 1: 1 4
# 2: 2 12
Π― Π²ΠΈΠΆΡ ΡΠ»Π΅Π΄ΡΡΡΡΡ ΡΠ°Π·Π½ΠΈΡΡ Π² Π²ΡΠ²ΠΎΠ΄Π΅ verbose
:
ΠΠΏΡΠΈΠΌΠΈΠ·ΠΈΡΠΎΠ²Π°Π½Π½ΠΎΠ΅ ΠΏΠΎΠ΄ΠΌΠ½ΠΎΠΆΠ΅ΡΡΠ²ΠΎ Ρ ΠΈΠ½Π΄Π΅ΠΊΡΠΎΠΌ "ΠΡΡΠΏΠΏΠ°"
ΠΡΠΎ ΠΏΡΠΈΠ²Π΅Π»ΠΎ ΠΌΠ΅Π½Ρ ΠΊ ΡΡΡΠ°Π½ΠΎΠ²ΠΊΠ΅ ΠΈΠ· CRAN; ΠΊΠΎΠ΄ ΡΠ°Π±ΠΎΡΠ°Π΅Ρ Π±Π΅Π· ΠΎΡΠΈΠ±ΠΎΠΊ Π½Π° 1.10.4-3
.
ΠΠΎΠ»Π°Π³Π°Ρ, ΡΡΠΎ ΡΡΠΎ-ΡΠΎ ΠΈΠ· ΡΠ°Π±ΠΎΡΡ @MarkusBonsch ΠΏΠΎ ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΈ ΠΏΠΎΠ΄ΠΌΠ½ΠΎΠΆΠ΅ΡΡΠ²Π°?
Π― ΡΠ°ΠΊΠΆΠ΅ Π²ΠΈΠΆΡ ΡΡ ΠΆΠ΅ ΠΎΡΠΈΠ±ΠΊΡ, Π΅ΡΠ»ΠΈ ΠΌΡ ΡΠ΄Π΅Π»Π°Π΅ΠΌ ΡΠΎΠ΅Π΄ΠΈΠ½Π΅Π½ΠΈΠ΅ ΡΠ²Π½ΡΠΌ:
dat[.('All'), on = 'Group',
lapply(.SD, function(x) sum(x, na.rm = TRUE)),
.SDcols= "count", keyby = ID]
# ID count
# 1: 1 4
# 2: 1 12
ΠΠΎ Π²Π΅ΡΡΠΈΡ Ρ ΠΊΠ»ΡΡΠΎΠΌ Π² ΠΏΠΎΡΡΠ΄ΠΊΠ΅:
setkey(dat, Group)
dat[.('All'),
lapply(.SD, function(x) sum(x, na.rm = TRUE)),
.SDcols= "count", keyby = ID]# ID count
# 1: 1 4
# 2: 2 12
Π‘ΠΏΠ°ΡΠΈΠ±ΠΎ @cathine Π·Π° ΡΠΎΠΎΠ±ΡΠ΅Π½ΠΈΠ΅ ΠΈ @MichaelChirico Π·Π° ΡΠ°ΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΠ΅.
ΠΡΠ½ΠΎΠ²Π½Π°Ρ ΠΏΡΠΈΡΠΈΠ½Π° - ΠΎΡΠΈΠ±ΠΎΡΠ½ΠΎΠ΅ ΠΏΠΎΠ²Π΅Π΄Π΅Π½ΠΈΠ΅ Π²Π΅ΡΡΠΈΠΈ ΡΠΎΠ΅Π΄ΠΈΠ½Π΅Π½ΠΈΡ, ΠΊΠ°ΠΊ ΡΠΊΠ°Π·Π°Π» ΠΠ°ΠΉΠΊΠ»:
dat[.('All'), on = 'Group', lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= "count", keyby = ID]
Π―, Π²Π΅ΡΠΎΡΡΠ½ΠΎ, ΡΠ΅ΡΡ, ΠΊΠΎΠ³Π΄Π° Π±ΡΠ΄Π΅Ρ ΡΠ΅ΡΠ΅Π½Π° ΠΏΡΠΎΠ±Π»Π΅ΠΌΠ° β2591.
Π Π½ΠΎΠ²ΠΎΠΉ ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΈ ΠΏΠΎΠ΄ΠΌΠ½ΠΎΠΆΠ΅ΡΡΠ²Π° ΠΏΠΎΠ΄ΠΌΠ½ΠΎΠΆΠ΅ΡΡΠ²Π° ΠΏΠ΅ΡΠ΅Π½Π°ΠΏΡΠ°Π²Π»ΡΡΡΡΡ Π² ΡΠ°ΡΡΡ ΡΠΎΠ΅Π΄ΠΈΠ½Π΅Π½ΠΈΡ data.table
, ΠΏΠΎΡΡΠΎΠΌΡ ΡΡΠ° ΠΎΡΠΈΠ±ΠΊΠ° ΡΠ΅ΠΏΠ΅ΡΡ Π²Π»ΠΈΡΠ΅Ρ Π½Π° ΠΏΠΎΠ΄ΠΌΠ½ΠΎΠΆΠ΅ΡΡΠ²Π°, Π° ΡΠ°ΠΊΠΆΠ΅ Π½Π° ΡΠΎΠ΅Π΄ΠΈΠ½Π΅Π½ΠΈΡ. Π― ΠΏΠΎΡΡΠ°ΡΠ°ΡΡΡ ΠΊΠ°ΠΊ ΠΌΠΎΠΆΠ½ΠΎ ΡΠΊΠΎΡΠ΅Π΅ ΡΠ°Π·ΠΎΠ±ΡΠ°ΡΡΡΡ, ΡΠΌΠΎΠ³Ρ Π»ΠΈ Ρ ΡΠ΅ΡΠΈΡΡ ΠΏΡΠΎΠ±Π»Π΅ΠΌΡ.
Π ΠΏΠΎΠΊΠ° Π²Ρ ΠΌΠΎΠΆΠ΅ΡΠ΅ ΠΏΡΠΈΠ±Π΅Π³Π½ΡΡΡ ΠΊ
ΠΠ°ΠΏΡΠΈΠΌΠ΅Ρ, dat[Group == "All"][ ,lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= c("count"), keyby = ID, verbose = TRUE]
.
ΠΠ·Π²ΠΈΠ½ΠΈΡΠ΅ Π·Π° ΠΏΡΠΈΡΠΈΠ½Π΅Π½Π½ΡΠ΅ Π½Π΅ΡΠ΄ΠΎΠ±ΡΡΠ²Π°.
Π‘ΠΏΠ°ΡΠΈΠ±ΠΎ, @cathine! ΠΠΎΠ΄ΡΠ²Π΅ΡΠΆΠ΄Π΅Π½ΠΎ, ΡΡΠΎ ΡΡΠΎ ΡΠΎΠ»ΡΠΊΠΎ Π΄Π»Ρ ΡΠ°Π·ΡΠ°Π±ΠΎΡΡΠΈΠΊΠΎΠ² ΠΈ ΠΌΠΎΠΆΠ΅Ρ Π±ΡΡΡ ΡΠ΅ΡΠ΅Π½ΠΎ Ρ ΠΏΠΎΠΌΠΎΡΡΡ options(datatable.optimize=2)
ΠΏΠΎΡΠΊΠΎΠ»ΡΠΊΡ ΠΏΡΠΎΠ±Π»Π΅ΠΌΠ°, ΠΏΠΎΡ
ΠΎΠΆΠ΅, ΡΠ²ΡΠ·Π°Π½Π° Ρ ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠ΅ΠΉ ΡΡΠΎΠ²Π½Ρ 3. ΠΠ½ΡΠ΅ΡΠ΅ΡΠ½ΠΎ, ΠΊΠ°ΠΊ ΡΡΠΎ ΡΠ΄Π°Π»ΠΎΡΡ ΠΏΡΠΎΡΠΊΠΎΠ»ΡΠ·Π½ΡΡΡ ΡΠ΅ΡΠ΅Π· ΡΠ΅ΡΡΡ!
ΠΡΠ΅ Π±ΠΎΠ»Π΅Π΅ ΠΏΡΠΎΡΡΡΠ΅ ΠΏΡΠΈΠΌΠ΅ΡΡ ΠΈΠ· Π΄ΡΡΠ³ΠΎΠ³ΠΎ ΠΊΠΎΠ½ΡΠ°ΠΊΡΠ°, ΠΊΠΎΡΠΎΡΡΠΉ ΡΠΎΠΆΠ΅ ΡΠΎΠΎΠ±ΡΠΈΠ»:
> DT = data.table(
id = c("a","a","a","b","b","c","c","d","d"),
group = c(1,1,1,1,1,2,2,2,2),
num = 1)
> DT[, uniqueN(id), by=group] # ok
group V1
<num> <int>
1: 1 2
2: 2 2
> DT[num==1, uniqueN(id), by=group] # group column wrong
group V1
<num> <int>
1: 1 2
2: 1 2
> options(datatable.optimize=2)
> DT[num==1, uniqueN(id), by=group] # ok
group V1
<num> <int>
1: 1 2
2: 2 2
> options(datatable.optimize=3) # not ok
> DT[num==1, uniqueN(id), by=group]
group V1
<num> <int>
1: 1 2
2: 1 2
> DT[num==1, sum(num), by=group] # ok
group V1
<num> <num>
1: 1 7
2: 2 4
> DT[num==1, length(num), by=group] # not ok
group V1
<num> <int>
1: 1 7
2: 1 4
> options(datatable.optimize=2) # ok
> DT[num==1, length(num), by=group]
group V1
<num> <int>
1: 1 7
2: 2 4
>
ΠΠΎΡΠ΅ΠΌΡ ΠΎΠ½ Π½Π΅ ΠΏΡΠΎΡΠ΅Π» ΡΠ΅ΡΡΡ? ΠΠΎΡΠΎΠΌΡ ΡΡΠΎ ΡΡΠΎ ΠΏΡΠΎΠΈΡΡ ΠΎΠ΄ΠΈΡ ΡΠΎΠ»ΡΠΊΠΎ Π² ΡΠΎΠΌ ΡΠ»ΡΡΠ°Π΅, Π΅ΡΠ»ΠΈ ΡΡΠΎΠ»Π±Π΅Ρ Π³ΡΡΠΏΠΏΠΈΡΠΎΠ²ΠΊΠΈ ΠΎΡΡΠΎΡΡΠΈΡΠΎΠ²Π°Π½ (ΡΠΌ. ΠΠΎΠ΄ Π½ΠΈΠΆΠ΅)! ΠΡΡΠΏΠΏΠΈΡΠΎΠ²ΠΊΡ ΠΏΠΎ ΠΎΡΡΠΎΡΡΠΈΡΠΎΠ²Π°Π½Π½ΡΠΌ ΡΡΠΎΠ»Π±ΡΠ°ΠΌ ΡΠΏΠ΅ΡΠΈΠ°Π»ΡΠ½ΠΎ Π½Π΅ ΠΏΡΠΎΠ²Π΅ΡΡΠ».
library(data.table)
DT = data.table(
id = c("a","a","a","b","b","c","c","d","d"),
group = c(1,1,1,1,1,2,2,2,2),
group2 = c(1,1,1,1,1,2,2,2,1),
num = 1)
DT[, uniqueN(id), by=group] # ok
# group V1
# <num> <int>
# 1: 1 2
# 2: 2 2
DT[num==1, uniqueN(id), by=group] # group column wrong
# group V1
# <num> <int>
# 1: 1 2
# 2: 1 2
DT[num==1, uniqueN(id), by=group2] # ok with other group column that is not sorted
# group2 V1
# 1: 1 3
# 2: 2 2
setkey(DT, group2)
DT[num==1, uniqueN(id), by=group2] # not ok anymore since the group column is sorted now
# group2 V1
# 1: 1 3
# 2: 1 2
Π‘Π°ΠΌΡΠΉ ΠΏΠΎΠ»Π΅Π·Π½ΡΠΉ ΠΊΠΎΠΌΠΌΠ΅Π½ΡΠ°ΡΠΈΠΉ
Π‘ΠΎΠ³Π»Π°ΡΠΈΡΠ΅ΡΡ, ΡΡΠΎ Π±Π°Π³.
ΠΠ»Ρ Π·Π°ΠΏΠΈΡΠΈ ΡΠ΅ΠΊΠΎΠΌΠ΅Π½Π΄ΡΠ΅ΠΌΡΠΉ ΠΊΠΎΠ΄ Π² ΡΡΠΎΠΌ ΡΠ»ΡΡΠ°Π΅:
ΠΡΠΎ Π΄Π°Π΅Ρ ΠΏΡΠ°Π²ΠΈΠ»ΡΠ½ΡΠΉ ΠΎΡΠ²Π΅Ρ, ΠΏΠΎΡΠΊΠΎΠ»ΡΠΊΡ ΡΡΠ° Π²Π΅ΡΡΠΈΡ Π°ΠΊΡΠΈΠ²ΠΈΡΡΠ΅Ρ
GForce
ΠΈ Π² ΡΡΠΎΠΌ ΡΠ»ΡΡΠ°Π΅ ΠΎΡΠΈΠ±ΠΊΠΈ Π½Π΅Ρ.ΠΠΎΠ½Π΅ΡΠ½ΠΎ, ΡΡΠΎ Π½Π΅ ΠΏΠΎΠΌΠΎΠΆΠ΅Ρ, Π΅ΡΠ»ΠΈ Π²Π°Ρ ΡΠ΅Π°Π»ΡΠ½ΡΠΉ ΠΊΠΎΠ΄ Π½Π΅ ΠΌΠΎΠΆΠ΅Ρ Π±ΡΡΡ ΠΏΠΎΠ΄ΠΎΠ±Π½ΡΠΌ ΠΎΠ±ΡΠ°Π·ΠΎΠΌ Π·Π°ΠΏΡΡΠ°Π½.
ΠΠ½ΡΠ΅ΡΠ΅ΡΠ½ΠΎ, ΡΡΠΎ Π΅ΡΠ»ΠΈ ΠΌΡ ΠΏΠ΅ΡΠ΅Π΄Π°Π΄ΠΈΠΌ ΡΡΡΠΎΠΊΠΈ ΠΏΠΎΠ΄ΠΌΠ½ΠΎΠΆΠ΅ΡΡΠ²Π° Π½Π°ΠΏΡΡΠΌΡΡ, ΠΊΠΎΠ΄ Π±ΡΠ΄Π΅Ρ ΡΠ°Π±ΠΎΡΠ°ΡΡ:
Π― Π²ΠΈΠΆΡ ΡΠ»Π΅Π΄ΡΡΡΡΡ ΡΠ°Π·Π½ΠΈΡΡ Π² Π²ΡΠ²ΠΎΠ΄Π΅
verbose
:ΠΡΠΎ ΠΏΡΠΈΠ²Π΅Π»ΠΎ ΠΌΠ΅Π½Ρ ΠΊ ΡΡΡΠ°Π½ΠΎΠ²ΠΊΠ΅ ΠΈΠ· CRAN; ΠΊΠΎΠ΄ ΡΠ°Π±ΠΎΡΠ°Π΅Ρ Π±Π΅Π· ΠΎΡΠΈΠ±ΠΎΠΊ Π½Π°
1.10.4-3
.ΠΠΎΠ»Π°Π³Π°Ρ, ΡΡΠΎ ΡΡΠΎ-ΡΠΎ ΠΈΠ· ΡΠ°Π±ΠΎΡΡ @MarkusBonsch ΠΏΠΎ ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΈ ΠΏΠΎΠ΄ΠΌΠ½ΠΎΠΆΠ΅ΡΡΠ²Π°?
Π― ΡΠ°ΠΊΠΆΠ΅ Π²ΠΈΠΆΡ ΡΡ ΠΆΠ΅ ΠΎΡΠΈΠ±ΠΊΡ, Π΅ΡΠ»ΠΈ ΠΌΡ ΡΠ΄Π΅Π»Π°Π΅ΠΌ ΡΠΎΠ΅Π΄ΠΈΠ½Π΅Π½ΠΈΠ΅ ΡΠ²Π½ΡΠΌ:
ΠΠΎ Π²Π΅ΡΡΠΈΡ Ρ ΠΊΠ»ΡΡΠΎΠΌ Π² ΠΏΠΎΡΡΠ΄ΠΊΠ΅: