Data.table: По ΠΊΠ»ΡŽΡ‡Ρƒ / Π½Π΅ возвращая ΡƒΠ½ΠΈΠΊΠ°Π»ΡŒΠ½Ρ‹Π΅ Π³Ρ€ΡƒΠΏΠΏΡ‹ с подмноТСством

Π‘ΠΎΠ·Π΄Π°Π½Π½Ρ‹ΠΉ Π½Π° 31 ΠΌΠ°Ρ€. 2018  Β·  4ΠšΠΎΠΌΠΌΠ΅Π½Ρ‚Π°Ρ€ΠΈΠΈ  Β·  Π˜ΡΡ‚ΠΎΡ‡Π½ΠΈΠΊ: Rdatatable/data.table

НиТС ΠΏΡ€ΠΈΠ²Π΅Π΄Π΅Π½ простой ΠΏΡ€ΠΈΠΌΠ΅Ρ€, ΠΊΠΎΠ³Π΄Π° keyby (Ρ‚Π°ΠΊΠΆΠ΅ by) Π½Π΅ Π²ΠΎΠ·Π²Ρ€Π°Ρ‰Π°Π΅Ρ‚ ΡƒΠ½ΠΈΠΊΠ°Π»ΡŒΠ½Ρ‹Π΅ Π³Ρ€ΡƒΠΏΠΏΡ‹ с подмноТСством.
Однако послС удалСния подмноТСства keyby Ρ€Π°Π±ΠΎΡ‚Π°Π΅Ρ‚ ΠΏΡ€Π°Π²ΠΈΠ»ΡŒΠ½ΠΎ.

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2018-03-21 23:49:00 UTC; travis
#  The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
#  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
#  Release notes, videos and slides: http://r-datatable.com

# small dataset
dat <- data.table(Group = rep(c("All", "Not All"), times = 4), count = 1:8, ID = rep(1:2, each = 4))

# keyby returning non unique IDs with subset
dat[Group == "All" ,lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= c("count"), keyby = ID, verbose = TRUE]
# Creating new index 'Group'
# Creating index Group done in ... 0.001sec 
# Optimized subsetting with index 'Group'
# on= matches existing index, using index
# Starting bmerge ...done in 0.000sec 
# i clause present and columns used in by detected, only these subset: ID 
# Finding groups using forderv ... 0.000sec 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec 
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
#   collecting discontiguous groups took 0.000s for 2 groups
#   eval(j) took 0.000s for 2 calls
# 0.000sec 
#    ID count
# 1:  1     4
# 2:  1    12

# keyby working fine without subset
dat[,lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= c("count"), keyby = ID] 
# Finding groups using forderv ... 0.000sec 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec 
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
#   memcpy contiguous groups took 0.000s for 2 groups
#   eval(j) took 0.000s for 2 calls
# 0.000sec 
#    ID count
# 1:  1    10
# 2:  2    26

sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.4

Π‘Π°ΠΌΡ‹ΠΉ ΠΏΠΎΠ»Π΅Π·Π½Ρ‹ΠΉ ΠΊΠΎΠΌΠΌΠ΅Π½Ρ‚Π°Ρ€ΠΈΠΉ

Π‘ΠΎΠ³Π»Π°ΡΠΈΡ‚Π΅ΡΡŒ, это Π±Π°Π³.

Для записи Ρ€Π΅ΠΊΠΎΠΌΠ΅Π½Π΄ΡƒΠ΅ΠΌΡ‹ΠΉ ΠΊΠΎΠ΄ Π² этом случаС:

dat[Group == "All", lapply(.SD, sum, na.rm = TRUE), .SDcols= c("count"), keyby = ID]

Π­Ρ‚ΠΎ Π΄Π°Π΅Ρ‚ ΠΏΡ€Π°Π²ΠΈΠ»ΡŒΠ½Ρ‹ΠΉ ΠΎΡ‚Π²Π΅Ρ‚, ΠΏΠΎΡΠΊΠΎΠ»ΡŒΠΊΡƒ эта вСрсия Π°ΠΊΡ‚ΠΈΠ²ΠΈΡ€ΡƒΠ΅Ρ‚ GForce ΠΈ Π² этом случаС ошибки Π½Π΅Ρ‚.

ΠšΠΎΠ½Π΅Ρ‡Π½ΠΎ, это Π½Π΅ ΠΏΠΎΠΌΠΎΠΆΠ΅Ρ‚, Ссли ваш Ρ€Π΅Π°Π»ΡŒΠ½Ρ‹ΠΉ ΠΊΠΎΠ΄ Π½Π΅ ΠΌΠΎΠΆΠ΅Ρ‚ Π±Ρ‹Ρ‚ΡŒ ΠΏΠΎΠ΄ΠΎΠ±Π½Ρ‹ΠΌ ΠΎΠ±Ρ€Π°Π·ΠΎΠΌ Π·Π°ΠΏΡƒΡ‚Π°Π½.

Π˜Π½Ρ‚Π΅Ρ€Π΅ΡΠ½ΠΎ, Ρ‡Ρ‚ΠΎ Ссли ΠΌΡ‹ ΠΏΠ΅Ρ€Π΅Π΄Π°Π΄ΠΈΠΌ строки подмноТСства Π½Π°ΠΏΡ€ΡΠΌΡƒΡŽ, ΠΊΠΎΠ΄ Π±ΡƒΠ΄Π΅Ρ‚ Ρ€Π°Π±ΠΎΡ‚Π°Ρ‚ΡŒ:

dat[c(1, 3, 5, 7),
    lapply(.SD, function(x) sum(x, na.rm = TRUE)),
    .SDcols= "count", keyby = ID, verbose = TRUE]
# i clause present and columns used in by detected, only these subset: ID 
# Finding groups using forderv ... 0.000sec 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec 
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
#   collecting discontiguous groups took 0.000s for 2 groups
#   eval(j) took 0.000s for 2 calls
# 0.000sec 
#    ID count
# 1:  1     4
# 2:  2    12

Π― Π²ΠΈΠΆΡƒ ΡΠ»Π΅Π΄ΡƒΡŽΡ‰ΡƒΡŽ Ρ€Π°Π·Π½ΠΈΡ†Ρƒ Π² Π²Ρ‹Π²ΠΎΠ΄Π΅ verbose :

ΠžΠΏΡ‚ΠΈΠΌΠΈΠ·ΠΈΡ€ΠΎΠ²Π°Π½Π½ΠΎΠ΅ подмноТСство с индСксом "Π“Ρ€ΡƒΠΏΠΏΠ°"

Π­Ρ‚ΠΎ ΠΏΡ€ΠΈΠ²Π΅Π»ΠΎ мСня ΠΊ установкС ΠΈΠ· CRAN; ΠΊΠΎΠ΄ Ρ€Π°Π±ΠΎΡ‚Π°Π΅Ρ‚ Π±Π΅Π· ошибок Π½Π° 1.10.4-3 .

Полагаю, это Ρ‡Ρ‚ΠΎ-Ρ‚ΠΎ ΠΈΠ· Ρ€Π°Π±ΠΎΡ‚Ρ‹ @MarkusBonsch ΠΏΠΎ ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΈ подмноТСства?

Π― Ρ‚Π°ΠΊΠΆΠ΅ Π²ΠΈΠΆΡƒ Ρ‚Ρƒ ΠΆΠ΅ ΠΎΡˆΠΈΠ±ΠΊΡƒ, Ссли ΠΌΡ‹ сдСлаСм соСдинСниС явным:

dat[.('All'), on = 'Group',
    lapply(.SD, function(x) sum(x, na.rm = TRUE)),
    .SDcols= "count", keyby = ID]
#    ID count
# 1:  1     4
# 2:  1    12

Но вСрсия с ΠΊΠ»ΡŽΡ‡ΠΎΠΌ Π² порядкС:

setkey(dat, Group)
dat[.('All'), 
    lapply(.SD, function(x) sum(x, na.rm = TRUE)),
    .SDcols= "count", keyby = ID]#    ID count
# 1:  1     4
# 2:  2    12

ВсС 4 ΠšΠΎΠΌΠΌΠ΅Π½Ρ‚Π°Ρ€ΠΈΠΉ

Π‘ΠΎΠ³Π»Π°ΡΠΈΡ‚Π΅ΡΡŒ, это Π±Π°Π³.

Для записи Ρ€Π΅ΠΊΠΎΠΌΠ΅Π½Π΄ΡƒΠ΅ΠΌΡ‹ΠΉ ΠΊΠΎΠ΄ Π² этом случаС:

dat[Group == "All", lapply(.SD, sum, na.rm = TRUE), .SDcols= c("count"), keyby = ID]

Π­Ρ‚ΠΎ Π΄Π°Π΅Ρ‚ ΠΏΡ€Π°Π²ΠΈΠ»ΡŒΠ½Ρ‹ΠΉ ΠΎΡ‚Π²Π΅Ρ‚, ΠΏΠΎΡΠΊΠΎΠ»ΡŒΠΊΡƒ эта вСрсия Π°ΠΊΡ‚ΠΈΠ²ΠΈΡ€ΡƒΠ΅Ρ‚ GForce ΠΈ Π² этом случаС ошибки Π½Π΅Ρ‚.

ΠšΠΎΠ½Π΅Ρ‡Π½ΠΎ, это Π½Π΅ ΠΏΠΎΠΌΠΎΠΆΠ΅Ρ‚, Ссли ваш Ρ€Π΅Π°Π»ΡŒΠ½Ρ‹ΠΉ ΠΊΠΎΠ΄ Π½Π΅ ΠΌΠΎΠΆΠ΅Ρ‚ Π±Ρ‹Ρ‚ΡŒ ΠΏΠΎΠ΄ΠΎΠ±Π½Ρ‹ΠΌ ΠΎΠ±Ρ€Π°Π·ΠΎΠΌ Π·Π°ΠΏΡƒΡ‚Π°Π½.

Π˜Π½Ρ‚Π΅Ρ€Π΅ΡΠ½ΠΎ, Ρ‡Ρ‚ΠΎ Ссли ΠΌΡ‹ ΠΏΠ΅Ρ€Π΅Π΄Π°Π΄ΠΈΠΌ строки подмноТСства Π½Π°ΠΏΡ€ΡΠΌΡƒΡŽ, ΠΊΠΎΠ΄ Π±ΡƒΠ΄Π΅Ρ‚ Ρ€Π°Π±ΠΎΡ‚Π°Ρ‚ΡŒ:

dat[c(1, 3, 5, 7),
    lapply(.SD, function(x) sum(x, na.rm = TRUE)),
    .SDcols= "count", keyby = ID, verbose = TRUE]
# i clause present and columns used in by detected, only these subset: ID 
# Finding groups using forderv ... 0.000sec 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec 
# lapply optimization changed j from 'lapply(.SD, function(x) sum(x, na.rm = TRUE))' to 'list(..FUN1(count))'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
#   collecting discontiguous groups took 0.000s for 2 groups
#   eval(j) took 0.000s for 2 calls
# 0.000sec 
#    ID count
# 1:  1     4
# 2:  2    12

Π― Π²ΠΈΠΆΡƒ ΡΠ»Π΅Π΄ΡƒΡŽΡ‰ΡƒΡŽ Ρ€Π°Π·Π½ΠΈΡ†Ρƒ Π² Π²Ρ‹Π²ΠΎΠ΄Π΅ verbose :

ΠžΠΏΡ‚ΠΈΠΌΠΈΠ·ΠΈΡ€ΠΎΠ²Π°Π½Π½ΠΎΠ΅ подмноТСство с индСксом "Π“Ρ€ΡƒΠΏΠΏΠ°"

Π­Ρ‚ΠΎ ΠΏΡ€ΠΈΠ²Π΅Π»ΠΎ мСня ΠΊ установкС ΠΈΠ· CRAN; ΠΊΠΎΠ΄ Ρ€Π°Π±ΠΎΡ‚Π°Π΅Ρ‚ Π±Π΅Π· ошибок Π½Π° 1.10.4-3 .

Полагаю, это Ρ‡Ρ‚ΠΎ-Ρ‚ΠΎ ΠΈΠ· Ρ€Π°Π±ΠΎΡ‚Ρ‹ @MarkusBonsch ΠΏΠΎ ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΈ подмноТСства?

Π― Ρ‚Π°ΠΊΠΆΠ΅ Π²ΠΈΠΆΡƒ Ρ‚Ρƒ ΠΆΠ΅ ΠΎΡˆΠΈΠ±ΠΊΡƒ, Ссли ΠΌΡ‹ сдСлаСм соСдинСниС явным:

dat[.('All'), on = 'Group',
    lapply(.SD, function(x) sum(x, na.rm = TRUE)),
    .SDcols= "count", keyby = ID]
#    ID count
# 1:  1     4
# 2:  1    12

Но вСрсия с ΠΊΠ»ΡŽΡ‡ΠΎΠΌ Π² порядкС:

setkey(dat, Group)
dat[.('All'), 
    lapply(.SD, function(x) sum(x, na.rm = TRUE)),
    .SDcols= "count", keyby = ID]#    ID count
# 1:  1     4
# 2:  2    12

Бпасибо @cathine Π·Π° сообщСниС ΠΈ @MichaelChirico Π·Π° расслСдованиС.
Основная ΠΏΡ€ΠΈΡ‡ΠΈΠ½Π° - ΠΎΡˆΠΈΠ±ΠΎΡ‡Π½ΠΎΠ΅ ΠΏΠΎΠ²Π΅Π΄Π΅Π½ΠΈΠ΅ вСрсии соСдинСния, ΠΊΠ°ΠΊ ΡƒΠΊΠ°Π·Π°Π» Майкл:
dat[.('All'), on = 'Group', lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= "count", keyby = ID]

Π―, вСроятно, Ρ€Π΅ΡˆΡƒ, ΠΊΠΎΠ³Π΄Π° Π±ΡƒΠ΄Π΅Ρ‚ Ρ€Π΅ΡˆΠ΅Π½Π° ΠΏΡ€ΠΎΠ±Π»Π΅ΠΌΠ° β„–2591.
Π’ Π½ΠΎΠ²ΠΎΠΉ ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΈ подмноТСства подмноТСства ΠΏΠ΅Ρ€Π΅Π½Π°ΠΏΡ€Π°Π²Π»ΡΡŽΡ‚ΡΡ Π² Ρ‡Π°ΡΡ‚ΡŒ соСдинСния data.table , поэтому эта ошибка Ρ‚Π΅ΠΏΠ΅Ρ€ΡŒ влияСт Π½Π° подмноТСства, Π° Ρ‚Π°ΠΊΠΆΠ΅ Π½Π° соСдинСния. Π― ΠΏΠΎΡΡ‚Π°Ρ€Π°ΡŽΡΡŒ ΠΊΠ°ΠΊ ΠΌΠΎΠΆΠ½ΠΎ скорСС Ρ€Π°Π·ΠΎΠ±Ρ€Π°Ρ‚ΡŒΡΡ, смогу Π»ΠΈ я Ρ€Π΅ΡˆΠΈΡ‚ΡŒ ΠΏΡ€ΠΎΠ±Π»Π΅ΠΌΡƒ.
А ΠΏΠΎΠΊΠ° Π²Ρ‹ ΠΌΠΎΠΆΠ΅Ρ‚Π΅ ΠΏΡ€ΠΈΠ±Π΅Π³Π½ΡƒΡ‚ΡŒ ΠΊ
НапримСр, dat[Group == "All"][ ,lapply(.SD, function(x) sum(x, na.rm = TRUE)), .SDcols= c("count"), keyby = ID, verbose = TRUE] .
Π˜Π·Π²ΠΈΠ½ΠΈΡ‚Π΅ Π·Π° ΠΏΡ€ΠΈΡ‡ΠΈΠ½Π΅Π½Π½Ρ‹Π΅ нСудобства.

Бпасибо, @cathine! ΠŸΠΎΠ΄Ρ‚Π²Π΅Ρ€ΠΆΠ΄Π΅Π½ΠΎ, Ρ‡Ρ‚ΠΎ это Ρ‚ΠΎΠ»ΡŒΠΊΠΎ для Ρ€Π°Π·Ρ€Π°Π±ΠΎΡ‚Ρ‡ΠΈΠΊΠΎΠ² ΠΈ ΠΌΠΎΠΆΠ΅Ρ‚ Π±Ρ‹Ρ‚ΡŒ Ρ€Π΅ΡˆΠ΅Π½ΠΎ с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ options(datatable.optimize=2) ΠΏΠΎΡΠΊΠΎΠ»ΡŒΠΊΡƒ ΠΏΡ€ΠΎΠ±Π»Π΅ΠΌΠ°, ΠΏΠΎΡ…ΠΎΠΆΠ΅, связана с ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠ΅ΠΉ уровня 3. Π˜Π½Ρ‚Π΅Ρ€Π΅ΡΠ½ΠΎ, ΠΊΠ°ΠΊ это ΡƒΠ΄Π°Π»ΠΎΡΡŒ ΠΏΡ€ΠΎΡΠΊΠΎΠ»ΡŒΠ·Π½ΡƒΡ‚ΡŒ Ρ‡Π΅Ρ€Π΅Π· тСсты!
Π•Ρ‰Π΅ Π±ΠΎΠ»Π΅Π΅ простыС ΠΏΡ€ΠΈΠΌΠ΅Ρ€Ρ‹ ΠΈΠ· Π΄Ρ€ΡƒΠ³ΠΎΠ³ΠΎ ΠΊΠΎΠ½Ρ‚Π°ΠΊΡ‚Π°, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹ΠΉ Ρ‚ΠΎΠΆΠ΅ сообщил:

> DT = data.table(
    id = c("a","a","a","b","b","c","c","d","d"),
    group = c(1,1,1,1,1,2,2,2,2),
    num = 1)
> DT[, uniqueN(id), by=group]          # ok 
   group    V1
   <num> <int>
1:     1     2
2:     2     2
> DT[num==1, uniqueN(id), by=group]    # group column wrong
   group    V1
   <num> <int>
1:     1     2
2:     1     2
> options(datatable.optimize=2)
> DT[num==1, uniqueN(id), by=group]    # ok
   group    V1
   <num> <int>
1:     1     2
2:     2     2
> options(datatable.optimize=3)        # not ok
> DT[num==1, uniqueN(id), by=group]
   group    V1
   <num> <int>
1:     1     2
2:     1     2
> DT[num==1, sum(num), by=group]       # ok
   group    V1
   <num> <num>
1:     1     7
2:     2     4
> DT[num==1, length(num), by=group]    # not ok
   group    V1
   <num> <int>
1:     1     7
2:     1     4
> options(datatable.optimize=2)        # ok
> DT[num==1, length(num), by=group]
   group    V1
   <num> <int>
1:     1     7
2:     2     4
> 

ΠŸΠΎΡ‡Π΅ΠΌΡƒ ΠΎΠ½ Π½Π΅ ΠΏΡ€ΠΎΡˆΠ΅Π» тСсты? ΠŸΠΎΡ‚ΠΎΠΌΡƒ Ρ‡Ρ‚ΠΎ это происходит Ρ‚ΠΎΠ»ΡŒΠΊΠΎ Π² Ρ‚ΠΎΠΌ случаС, Ссли столбСц Π³Ρ€ΡƒΠΏΠΏΠΈΡ€ΠΎΠ²ΠΊΠΈ отсортирован (см. Код Π½ΠΈΠΆΠ΅)! Π“Ρ€ΡƒΠΏΠΏΠΈΡ€ΠΎΠ²ΠΊΡƒ ΠΏΠΎ отсортированным столбцам ΡΠΏΠ΅Ρ†ΠΈΠ°Π»ΡŒΠ½ΠΎ Π½Π΅ провСрял.

library(data.table)
DT = data.table(
  id = c("a","a","a","b","b","c","c","d","d"),
  group = c(1,1,1,1,1,2,2,2,2),
  group2 = c(1,1,1,1,1,2,2,2,1),
  num = 1)
DT[, uniqueN(id), by=group]          # ok 
# group    V1
# <num> <int>
# 1:     1     2
# 2:     2     2
DT[num==1, uniqueN(id), by=group]    # group column wrong
# group    V1
# <num> <int>
# 1:     1     2
# 2:     1     2
DT[num==1, uniqueN(id), by=group2]    # ok with other group column that is not sorted
# group2 V1
# 1:      1  3
# 2:      2  2

setkey(DT, group2)
DT[num==1, uniqueN(id), by=group2]    # not ok anymore since the group column is sorted now
# group2 V1
# 1:      1  3
# 2:      1  2
Π‘Ρ‹Π»Π° Π»ΠΈ эта страница ΠΏΠΎΠ»Π΅Π·Π½ΠΎΠΉ?
0 / 5 - 0 Ρ€Π΅ΠΉΡ‚ΠΈΠ½Π³ΠΈ