Data.table: shift () en data.table v1.9.6 es lento para muchos grupos

Creado en 11 feb. 2016 · 3Comentarios · Fuente: Rdatatable/data.table

¡Hola!
Para muchos grupos diferentes en by , shift es mucho más lento que el cambio manual.
Ver: http://stackoverflow.com/questions/35179911/shift-in-data-table-v1-9-6-is-slow-for-many-groups
y https://github.com/nachti/datatable_test/blob/master/leadtest.R para ver un ejemplo detallado.
Salud,
Gerhard

GForce performance

Fuente

nachti

Comentario más útil

@ ben519 Fyi, para el caso especial de cuando su código se ve así, hay un atajo:

library(data.table)
dt <- data.table(Grp = rep(seq_len(1e6), each=10L))
dt[, Value := sample(100L, size = .N, replace = TRUE)]

system.time(dt[, PrevValueByGrp := shift(Value, type = "lag"), by = Grp][])
#    user  system elapsed 
#   19.50    0.80   20.34
system.time(dt[, v := shift(Value, type = "lag")][rowid(Grp)==1L, v := NA][])
#    user  system elapsed 
#    1.00    0.87    1.25 

dt[, all.equal(v, PrevValueByGrp)]
# [1] TRUE

franknarf1 en 10 nov. 2018

👍3

Todos 3 comentarios

Eso no es sorprendente. Esto desaparecerá cuando gforce esté optimizado para := . Está en la lista de este lanzamiento, creo.

arunsrinivasan en 4 mar. 2016

+1 para esta mejora del rendimiento. shift() es el principal cuello de botella en gran parte de mi código. Parece que para un número fijo de filas, el tiempo que se tarda en ejecutar shift() es proporcional al número de grupos en los datos.

library(data.table)

# Build table to store timings
timings <- CJ(RowCount = 10^7, Groups = 10^c(0:7))
timings[, SizePerGroup := RowCount/Groups]

# Loop through each experiment
for(i in 1:nrow(dt)){
  print(paste0("Iteration: ", i))

  # Build dataset
  timings_i <- timings[i]
  dt <- data.table(Grp = rep(seq_len(timings_i$Groups), each = timings_i$SizePerGroup))
  dt[, Value := sample(100, size = .N, replace = T)]

  # Measure the time it takes to insert a column indicating the previous value by group
  elapsed <- system.time(dt[, PrevValueByGrp := shift(Value, type = "lag"), by = Grp])["elapsed"]
  timings[i, Elapsed := elapsed]
}

library(ggplot2)
ggplot(timings, aes(x = Groups, y = Elapsed))+geom_line()+geom_point()

screen shot 2018-11-10 at 1 08 15 pm

ben519 en 10 nov. 2018

@ ben519 Fyi, para el caso especial de cuando su código se ve así, hay un atajo:

library(data.table)
dt <- data.table(Grp = rep(seq_len(1e6), each=10L))
dt[, Value := sample(100L, size = .N, replace = TRUE)]

system.time(dt[, PrevValueByGrp := shift(Value, type = "lag"), by = Grp][])
#    user  system elapsed 
#   19.50    0.80   20.34
system.time(dt[, v := shift(Value, type = "lag")][rowid(Grp)==1L, v := NA][])
#    user  system elapsed 
#    1.00    0.87    1.25 

dt[, all.equal(v, PrevValueByGrp)]
# [1] TRUE

franknarf1 en 10 nov. 2018

👍3

¿Fue útil esta página

0 / 5 - 0 calificaciones