Data.table: GForce๋Š” `:=`์—์„œ๋„ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์— ๋งŒ๋“  2015๋…„ 10์›” 29์ผ  ยท  3์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: Rdatatable/data.table

GForce enhancement performance

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

์ด ๊ธฐ๋Šฅ์„ ํ™œ์„ฑํ™”ํ•˜๋ฉด ์•ฝ๊ฐ„์˜ ์ž‘์—…์ด ํ•„์š”ํ•˜๋”๋ผ๋„ ๋ณต์žกํ•œ ํ‘œํ˜„์— GForce๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ๊ฐ•์กฐํ•˜๊ณ  ์‹ถ์—ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์ด ๊ฒŒ์‹œ๋ฌผ์—์„œ ๋‹ค์Œ ์„ ์œ„ํ•ด ํ™œ์„ฑํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„

slope <- function(x, y) {
  x_ux <- x - mean(x)
  uy <- mean(y)
  sum(x_ux * (y - uy)) / sum(x_ux ^ 2)
}

ํ•จ์œผ๋กœ์จ:

DT <- data.table(grp, x, y)
setkey(DT, grp)
DTsum <- DT[, .(ux=mean(x), uy=mean(y)), keyby=grp]
DT[DTsum, `:=`(x_ux=x - ux, y_uy=y - uy)]
DT[, `:=`(x_ux.y_uy=x_ux * y_uy, x_ux2=x_ux^2)]
DTsum <- DT[, .(x_ux.y_uy=sum(x_ux.y_uy), x_ux2=sum(x_ux2)), keyby=grp]
res.slope.dt2 <- DTsum[, .(grp, V1=x_ux.y_uy / x_ux2)]

๋ฐ˜๋ฉด์— GForce๊ฐ€ := ์—์„œ ์ง€์›๋œ๋‹ค๋ฉด ์šฐ๋ฆฌ๋Š” ๋‹ค์Œ์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

DT <- data.table(grp, x, y)
DT[, `:=`(ux=mean(x), uy=mean(y)), keyby=grp]
DT[, `:=`(x_ux=x - ux, y_uy=y - uy)]
DT[, `:=`(x_ux.y_uy=x_ux * y_uy, x_ux2=x_ux^2)]
DTsum <- DT[, .(x_ux.y_uy=sum(x_ux.y_uy), x_ux2=sum(x_ux2)), keyby=grp]
res.slope.dt3 <- DTsum[, .(grp, x_ux.y_uy/x_ux2)]

๋” ๊นจ๋—ํ•ด ๋ณด์ด๊ณ  ๋” ๋นจ๋ผ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋“  3 ๋Œ“๊ธ€

์˜ค๋Š˜ SO ์— ๋ฌธ์ œ์— ๋ถ€๋”ช์ณค์Šต๋‹ˆ๋‹ค.

actions = data.table(User_id = c("Carl","Carl","Carl","Lisa","Moe"),
                     category = c(1,1,2,2,1),
                     value= c(10,20,30,40,50))
users = actions[, other_var := 1, by=User_id]

# verbose says: the following is not optimized
users[, value_one := 0 ]
users[actions[category==1], value_one := sum(value), on="User_id", by=.EACHI, verbose=TRUE]

# verbose says: the following is optimized
rbind( 
    actions[category==1], 
    unique(actions[,"User_id", with=FALSE])[, value := 0 ],
fill=TRUE)[, sum(value), by=User_id, verbose=TRUE]

๋ณ€์ˆ˜๊ฐ€ ๊ฒฐ๊ตญ users ๋กœ ๋๋‚˜์•ผ ํ•œ๋‹ค๋Š” ์ ์„ ๊ณ ๋ คํ•  ๋•Œ ์ฒซ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์€ ๊ด€์šฉ์ ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.

๊ธฐํƒ€: https://stackoverflow.com/a/47338118/ (gtail)

๋˜ ๋‹ค๋ฅธ https://stackoverflow.com/a/51569126/ ์€ DT[, mx := max(pt), by=Subject][, diff := mx - pt][] ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋ฉ”๋ชจ๋ฆฌ ์„ฑ๋Šฅ์— ํŠนํžˆ ๊ด€์‹ฌ์ด ์žˆ๋Š” ๋˜ ๋‹ค๋ฅธ ํ•ญ๋ชฉ: https://stackoverflow.com/q/52189712 "data.table ์ฐธ์กฐ ์˜๋ฏธ: ๋ชจ๋“  ์—ด์„ ํ†ตํ•œ ๋ฐ˜๋ณต์˜ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ"

๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” scale /demean ๋‹ค์ค‘ ๋ณ€์ˆ˜๋ฅผ ์›ํ•ฉ๋‹ˆ๋‹ค: https://stackoverflow.com/q/52528123

ํ•˜์œ„ ์ง‘ํ•ฉ ์กฐ๊ฑด์œผ๋กœ ๊ทธ๋ฃน๋ณ„๋กœ ์ตœ๋Œ€๊ฐ’์„ ์ทจํ•˜๊ณ  :=(akrun์˜ ๋‹ต๋ณ€ ์ฐธ์กฐ) https://stackoverflow.com/a/54911855/๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ๋„ #971์˜ ์ด๋ฏธ ์™„๋ฃŒ๋œ ๋ถ€๋ถ„๊ณผ ๊ด€๋ จ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๊ธฐ๋Šฅ์„ ํ™œ์„ฑํ™”ํ•˜๋ฉด ์•ฝ๊ฐ„์˜ ์ž‘์—…์ด ํ•„์š”ํ•˜๋”๋ผ๋„ ๋ณต์žกํ•œ ํ‘œํ˜„์— GForce๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ๊ฐ•์กฐํ•˜๊ณ  ์‹ถ์—ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์ด ๊ฒŒ์‹œ๋ฌผ์—์„œ ๋‹ค์Œ ์„ ์œ„ํ•ด ํ™œ์„ฑํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„

slope <- function(x, y) {
  x_ux <- x - mean(x)
  uy <- mean(y)
  sum(x_ux * (y - uy)) / sum(x_ux ^ 2)
}

ํ•จ์œผ๋กœ์จ:

DT <- data.table(grp, x, y)
setkey(DT, grp)
DTsum <- DT[, .(ux=mean(x), uy=mean(y)), keyby=grp]
DT[DTsum, `:=`(x_ux=x - ux, y_uy=y - uy)]
DT[, `:=`(x_ux.y_uy=x_ux * y_uy, x_ux2=x_ux^2)]
DTsum <- DT[, .(x_ux.y_uy=sum(x_ux.y_uy), x_ux2=sum(x_ux2)), keyby=grp]
res.slope.dt2 <- DTsum[, .(grp, V1=x_ux.y_uy / x_ux2)]

๋ฐ˜๋ฉด์— GForce๊ฐ€ := ์—์„œ ์ง€์›๋œ๋‹ค๋ฉด ์šฐ๋ฆฌ๋Š” ๋‹ค์Œ์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

DT <- data.table(grp, x, y)
DT[, `:=`(ux=mean(x), uy=mean(y)), keyby=grp]
DT[, `:=`(x_ux=x - ux, y_uy=y - uy)]
DT[, `:=`(x_ux.y_uy=x_ux * y_uy, x_ux2=x_ux^2)]
DTsum <- DT[, .(x_ux.y_uy=sum(x_ux.y_uy), x_ux2=sum(x_ux2)), keyby=grp]
res.slope.dt3 <- DTsum[, .(grp, x_ux.y_uy/x_ux2)]

๋” ๊นจ๋—ํ•ด ๋ณด์ด๊ณ  ๋” ๋นจ๋ผ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

@MichaelChirico์™€์˜ ํ† ๋ก ์„ ํ†ตํ•ด ์ด ๋ฌธ์ œ์™€ ๋งค์šฐ ๊ฐ€๊นŒ์šด ์‚ฌ์ดŒ์ด ๋‹ค์Œ๊ณผ

>   DT <- data.table(x, y, grp)
>   DT[, .(x, mean(x)), keyby=grp]
Detected that j uses these columns: x 
Finding groups using forderv ... 1.049s elapsed (0.946s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.011s elapsed (0.011s cpu) 
lapply optimization is on, j unchanged as 'list(x, mean(x))'
GForce is on, left j unchanged
Old mean optimization changed j from 'list(x, mean(x))' to 'list(x, .External(Cfastmean, x, FALSE))'
Making each group and running j (GForce FALSE) ... 
  collecting discontiguous groups took 1.293s for 999953 groups
  eval(j) took 1.860s for 999953 calls
5.517s elapsed (3.862s cpu) 
              grp         x        V2
       1:       1 0.2151365 0.5512966
       2:       1 0.5358256 0.5512966
       3:       1 0.8496598 0.5512966
       4:       1 0.8480730 0.5512966
       5:       1 0.3464458 0.5512966
      ---                            
 9999996: 1000000 0.2601940 0.5474986
 9999997: 1000000 0.7940921 0.5474986
 9999998: 1000000 0.3825493 0.5474986
 9999999: 1000000 0.1786861 0.5474986
10000000: 1000000 0.9179119 0.5474986

#523๊ณผ์˜ ๊ต์ฐจ ๋งํฌ.

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰