Data.table: data.table v1.9.6 中的 shift() 对许多组来说都很慢

创建于 2016-02-11 · 3评论 · 资料来源: Rdatatable/data.table

你好呀！
对于by许多不同组， shift比手动换档慢得多。
请参阅： http :
和https://github.com/nachti/datatable_test/blob/master/leadtest.R的详细示例。
干杯，
格哈德

GForce performance

资料来源

nachti

最有用的评论

@ben519 仅供参考，对于您的代码看起来像这样的特殊情况，有一个快捷方式：

library(data.table)
dt <- data.table(Grp = rep(seq_len(1e6), each=10L))
dt[, Value := sample(100L, size = .N, replace = TRUE)]

system.time(dt[, PrevValueByGrp := shift(Value, type = "lag"), by = Grp][])
#    user  system elapsed 
#   19.50    0.80   20.34
system.time(dt[, v := shift(Value, type = "lag")][rowid(Grp)==1L, v := NA][])
#    user  system elapsed 
#    1.00    0.87    1.25 

dt[, all.equal(v, PrevValueByGrp)]
# [1] TRUE

franknarf1 于 2018-11-10

👍3

所有3条评论

这并不奇怪。当gforce为:=优化时，这种情况就会消失。我相信它在此版本的列表中。

arunsrinivasan 于 2016-03-04

+1 此性能增强。 shift()是我很多代码中的主要瓶颈。似乎对于固定的行数， shift()运行的时间与数据中的组数成正比。

library(data.table)

# Build table to store timings
timings <- CJ(RowCount = 10^7, Groups = 10^c(0:7))
timings[, SizePerGroup := RowCount/Groups]

# Loop through each experiment
for(i in 1:nrow(dt)){
  print(paste0("Iteration: ", i))

  # Build dataset
  timings_i <- timings[i]
  dt <- data.table(Grp = rep(seq_len(timings_i$Groups), each = timings_i$SizePerGroup))
  dt[, Value := sample(100, size = .N, replace = T)]

  # Measure the time it takes to insert a column indicating the previous value by group
  elapsed <- system.time(dt[, PrevValueByGrp := shift(Value, type = "lag"), by = Grp])["elapsed"]
  timings[i, Elapsed := elapsed]
}

library(ggplot2)
ggplot(timings, aes(x = Groups, y = Elapsed))+geom_line()+geom_point()

screen shot 2018-11-10 at 1 08 15 pm

ben519 于 2018-11-10

@ben519 仅供参考，对于您的代码看起来像这样的特殊情况，有一个快捷方式：

library(data.table)
dt <- data.table(Grp = rep(seq_len(1e6), each=10L))
dt[, Value := sample(100L, size = .N, replace = TRUE)]

system.time(dt[, PrevValueByGrp := shift(Value, type = "lag"), by = Grp][])
#    user  system elapsed 
#   19.50    0.80   20.34
system.time(dt[, v := shift(Value, type = "lag")][rowid(Grp)==1L, v := NA][])
#    user  system elapsed 
#    1.00    0.87    1.25 

dt[, all.equal(v, PrevValueByGrp)]
# [1] TRUE

franknarf1 于 2018-11-10

👍3

此页面是否有帮助？

0 / 5 - 0 等级

Data.table: data.table v1.9.6 中的 shift() 对许多组来说都很慢

最有用的评论

所有3条评论

相关问题