Data.table: data.table v1.9.6์˜ shift()๋Š” ๋งŽ์€ ๊ทธ๋ฃน์—์„œ ๋Š๋ฆฝ๋‹ˆ๋‹ค.

์— ๋งŒ๋“  2016๋…„ 02์›” 11์ผ  ยท  3์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: Rdatatable/data.table

์•ˆ๋…•ํ•˜์„ธ์š”!
by ์—ฌ๋Ÿฌ ๊ทธ๋ฃน์— ๋Œ€ํ•ด shift ๋Š” ์ˆ˜๋™ ์ด๋™๋ณด๋‹ค ํ›จ์”ฌ ๋Š๋ฆฝ๋‹ˆ๋‹ค.
์ฐธ์กฐ: http://stackoverflow.com/questions/35179911/shift-in-data-table-v1-9-6-is-slow-for-many-groups
์ž์„ธํ•œ ์˜ˆ๋Š” https://github.com/nachti/datatable_test/blob/master/leadtest.R ์„
๊ฑด๋ฐฐ,
๊ฒŒ๋ฅดํ•˜๋ฅดํŠธ

GForce performance

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

@ben519 ์ฐธ๊ณ ๋กœ, ์ฝ”๋“œ๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํŠน์ˆ˜ํ•œ ๊ฒฝ์šฐ์—๋Š” ๋ฐ”๋กœ ๊ฐ€๊ธฐ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

library(data.table)
dt <- data.table(Grp = rep(seq_len(1e6), each=10L))
dt[, Value := sample(100L, size = .N, replace = TRUE)]

system.time(dt[, PrevValueByGrp := shift(Value, type = "lag"), by = Grp][])
#    user  system elapsed 
#   19.50    0.80   20.34
system.time(dt[, v := shift(Value, type = "lag")][rowid(Grp)==1L, v := NA][])
#    user  system elapsed 
#    1.00    0.87    1.25 

dt[, all.equal(v, PrevValueByGrp)]
# [1] TRUE

๋ชจ๋“  3 ๋Œ“๊ธ€

๋†€๋ผ์šด ์ผ์ด ์•„๋‹™๋‹ˆ๋‹ค. gforce ๊ฐ€ := ์ตœ์ ํ™”๋˜๋ฉด ์ด ๋ฌธ์ œ๊ฐ€ ์‚ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ์ด ๋ฆด๋ฆฌ์Šค์˜ ๋ชฉ๋ก์— ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๋Œ€ํ•ด +1์ž…๋‹ˆ๋‹ค. shift() ๋Š” ๋งŽ์€ ์ฝ”๋“œ์—์„œ ์ฃผ์š” ๋ณ‘๋ชฉ ํ˜„์ƒ์ž…๋‹ˆ๋‹ค. ๊ณ ์ •๋œ ์ˆ˜์˜ ํ–‰์—์„œ shift() ๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๋ฐ ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„์€ ๋ฐ์ดํ„ฐ์˜ ๊ทธ๋ฃน ์ˆ˜์— ๋น„๋ก€ํ•˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

library(data.table)

# Build table to store timings
timings <- CJ(RowCount = 10^7, Groups = 10^c(0:7))
timings[, SizePerGroup := RowCount/Groups]

# Loop through each experiment
for(i in 1:nrow(dt)){
  print(paste0("Iteration: ", i))

  # Build dataset
  timings_i <- timings[i]
  dt <- data.table(Grp = rep(seq_len(timings_i$Groups), each = timings_i$SizePerGroup))
  dt[, Value := sample(100, size = .N, replace = T)]

  # Measure the time it takes to insert a column indicating the previous value by group
  elapsed <- system.time(dt[, PrevValueByGrp := shift(Value, type = "lag"), by = Grp])["elapsed"]
  timings[i, Elapsed := elapsed]
}

library(ggplot2)
ggplot(timings, aes(x = Groups, y = Elapsed))+geom_line()+geom_point()

screen shot 2018-11-10 at 1 08 15 pm

@ben519 ์ฐธ๊ณ ๋กœ, ์ฝ”๋“œ๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํŠน์ˆ˜ํ•œ ๊ฒฝ์šฐ์—๋Š” ๋ฐ”๋กœ ๊ฐ€๊ธฐ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

library(data.table)
dt <- data.table(Grp = rep(seq_len(1e6), each=10L))
dt[, Value := sample(100L, size = .N, replace = TRUE)]

system.time(dt[, PrevValueByGrp := shift(Value, type = "lag"), by = Grp][])
#    user  system elapsed 
#   19.50    0.80   20.34
system.time(dt[, v := shift(Value, type = "lag")][rowid(Grp)==1L, v := NA][])
#    user  system elapsed 
#    1.00    0.87    1.25 

dt[, all.equal(v, PrevValueByGrp)]
# [1] TRUE
์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰