Dplyr: ๊ธธ์ด๊ฐ€ 0์ธ ๊ทธ๋ฃน ์œ ์ง€

์— ๋งŒ๋“  2014๋…„ 03์›” 20์ผ  ยท  44์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: tidyverse/dplyr

http://stackoverflow.com/questions/22523131

์ด๊ฒƒ์— ๋Œ€ํ•œ ์ธํ„ฐํŽ˜์ด์Šค๊ฐ€ ๋ฌด์—‡์ธ์ง€ ํ™•์‹คํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์•„๋งˆ๋„ ๊ธฐ๋ณธ์ ์œผ๋กœ drop = FALSE๋กœ ์„ค์ •๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

feature wip

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

+1 - ์ด๊ฒƒ์€ ๋งŽ์€ ๋ถ„์„์— ๋Œ€ํ•œ ๊ฑฐ๋ž˜ ์ฐจ๋‹จ๊ธฐ์ž…๋‹ˆ๋‹ค.

๋ชจ๋“  44 ๋Œ“๊ธ€

์ด ํ˜ธ๋ฅผ ์—ด์–ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. Hadley.

:+1: ์˜ค๋Š˜ ๊ฐ™์€ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. drop = FALSE ๋Š” ์ €์—๊ฒŒ ํฐ ๋„์›€์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค!

.drop = FALSE๋ฅผ dplyr์— ๋„ฃ๋Š” ์‹œ๊ฐ„ ํ”„๋ ˆ์ž„์— ๋Œ€ํ•œ ์•„์ด๋””์–ด๊ฐ€ ์žˆ์Šต๋‹ˆ๊นŒ? ํŠน์ • rCharts๊ฐ€ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋ Œ๋”๋งํ•˜๋ ค๋ฉด ์ด๊ฒƒ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๊ทธ ๋™์•ˆ ๋‚˜๋Š” ์ž‘๋™ํ•˜๋Š” ๋งํฌ์—์„œ ๋‹ต์„ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค.
http://stackoverflow.com/questions/22523131

๋‘ ๊ฐœ์˜ ๋ณ€์ˆ˜๋กœ ๊ทธ๋ฃนํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.

๋นˆ ๊ทธ๋ฃน์„ ์‚ญ์ œํ•˜์ง€ ์•Š๋Š” ์˜ต์…˜์— ๋Œ€ํ•ด +1

#486 ๋ฐ #413๊ณผ ์ผ๋ถ€ ์ค‘๋ณต๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋นˆ ๊ทธ๋ฃน์„ ์‚ญ์ œํ•˜์ง€ ์•Š์œผ๋ฉด ๋งค์šฐ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์š”์•ฝ ํ…Œ์ด๋ธ”์„ ์ƒ์„ฑํ•  ๋•Œ ์ข…์ข… ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

+1 - ์ด๊ฒƒ์€ ๋งŽ์€ ๋ถ„์„์— ๋Œ€ํ•œ ๊ฑฐ๋ž˜ ์ฐจ๋‹จ๊ธฐ์ž…๋‹ˆ๋‹ค.

๋‚˜๋Š” ์œ„์˜ ๋ชจ๋“  ๊ฒƒ์— ๋™์˜ํ•ฉ๋‹ˆ๋‹ค. ๋งค์šฐ ์œ ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

@romainfrancois ํ˜„์žฌ build_index_cpp() ๋Š” drop ์†์„ฑ์„ ์กด์ค‘ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

t1 <- data_frame(
  x = runif(10),
  g1 = rep(1:2, each = 5),
  g2 = factor(g1, 1:3)
)
g1 <- grouped_df(t1, list(quote(g2)), drop = FALSE)
attr(g1, "group_size")
# should be c(5L, 5L, 0L)
attr(g1, "indices")
# shoud be list(0:4, 5:9, integer(0))

drop ์†์„ฑ์€ ์š”์ธ๋ณ„๋กœ ๊ทธ๋ฃนํ™”ํ•  ๋•Œ๋งŒ ์ ์šฉ๋˜๋ฉฐ, ์ด ๊ฒฝ์šฐ ์ˆ˜์ค€์ด ์‹ค์ œ๋กœ ๋ฐ์ดํ„ฐ์— ์ ์šฉ๋˜๋Š”์ง€ ์—ฌ๋ถ€์— ๊ด€๊ณ„์—†์ด ์š”์ธ ์ˆ˜์ค€๋‹น ํ•˜๋‚˜์˜ ๊ทธ๋ฃน์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ๋‹จ์ผ ํ…Œ์ด๋ธ” ๋™์‚ฌ์—๋„ ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๋‹ค.

  • select() : ํšจ๊ณผ ์—†์Œ
  • arrange() : ํšจ๊ณผ ์—†์Œ
  • summarise() : 0 ํ–‰ ๊ทธ๋ฃน์— ์ ์šฉ๋˜๋Š” ํ•จ์ˆ˜์—๋Š” 0 ์ˆ˜์ค€ ์ •์ˆ˜๊ฐ€ ์ฃผ์–ด์ ธ์•ผ ํ•ฉ๋‹ˆ๋‹ค. n() ๋Š” 0์„ ๋ฐ˜ํ™˜ํ•ด์•ผ ํ•˜๊ณ , mean(x) ๋Š” NaN์„ ๋ฐ˜ํ™˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • filter() : ์ผ๋ถ€ ๊ทธ๋ฃน์— ํ–‰์ด ์—†๋Š” ๊ฒฝ์šฐ์—๋„ ๊ทธ๋ฃน ์ง‘ํ•ฉ์€ ์ผ์ •ํ•˜๊ฒŒ ์œ ์ง€๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • mutate() : ๋นˆ ๊ทธ๋ฃน์— ๋Œ€ํ•œ ํ‘œํ˜„์‹์„ ํ‰๊ฐ€ํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

๊ฒฐ๊ตญ drop = FALSE ๊ฐ€ ๊ธฐ๋ณธ๊ฐ’์ด ๋˜๋ฉฐ drop = FALSE ์™€ drop = TRUE ๋ธŒ๋žœ์น˜๋ฅผ ๋ชจ๋‘ ์ž‘์„ฑํ•˜๋Š” ๊ฒƒ์ด ๋ฒˆ๊ฑฐ๋กญ๋‹ค๋ฉด ๊ธฐ๊บผ์ด drop = FALSE ์ง€์›์„ ์ค‘๋‹จํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค( ํ•ญ์ƒ ์š”์†Œ์˜ ๋ ˆ๋ฒจ์„ ์ง์ ‘ ์กฐ์ •ํ•˜๊ฑฐ๋‚˜ ๋Œ€์‹  ๋ฌธ์žํ˜• ๋ฒกํ„ฐ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๋ง์ด ๋ผ? ์ž‘์—…์ด ๋งŽ์€ ๊ฒฝ์šฐ 0.4๋กœ ๋ฏธ๋ฃฐ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

@statwonk , @wsurles , @jennybc , @slackline , @mcfrank , @eipi10 ๋„์›€์ด ๋œ๋‹ค๋ฉด ๊ฐ€์žฅ ์ข‹์€ ๋ฐฉ๋ฒ•์€ ๋‹ค์–‘ํ•œ ๋™์‚ฌ๊ฐ€ ์ƒํ˜ธ ์ž‘์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  ๋ฐฉ์‹์„ ์—ฐ์Šตํ•˜๋Š” ํ…Œ์ŠคํŠธ ์ผ€์ด์Šค ์„ธํŠธ์—์„œ ์ž‘์—…ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ธธ์ด๊ฐ€ 0์ธ ๊ทธ๋ฃน์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์•„. drop ๊ฐ€ ๋ฌด์—‡์„ ํ•ด์•ผ ํ•˜๋Š”์ง€ ๋ชฐ๋ž๋˜ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๊ทธ๊ฒƒ์€ ๋ถ„๋ช…ํ•ด์ง‘๋‹ˆ๋‹ค. ๋‚˜๋Š” ๊ทธ๊ฒƒ์ด ๋งŽ์€ ์ผ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜์ง€ ์•Š๋Š”๋‹ค.

์œ„์˜ ๋‹จ์ผ ํ…Œ์ด๋ธ” ๋™์‚ฌ๊ฐ€ ๊ธธ์ด๊ฐ€ 0์ธ ๊ทธ๋ฃน์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋Š”์ง€ ํ…Œ์ŠคํŠธํ•˜๋Š” pull ์š”์ฒญ #833์„ ์—ด์—ˆ์Šต๋‹ˆ๋‹ค. ๋ฌผ๋ก  dplyr์ด ํ˜„์žฌ ํ…Œ์ŠคํŠธ์— ์‹คํŒจํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋Œ€๋ถ€๋ถ„์˜ ํ…Œ์ŠคํŠธ๋Š” ์ฃผ์„ ์ฒ˜๋ฆฌ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

+1, ์—ฌ๊ธฐ์— ์ƒํƒœ ์—…๋ฐ์ดํŠธ๊ฐ€ ์žˆ์Šต๋‹ˆ๊นŒ? ์‚ฌ๋ž‘ ์š”์•ฝ, ๋นˆ ์ˆ˜์ค€์„ ์œ ์ง€ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค!

@ebergelson , ๊ธธ์ด๊ฐ€ 0์ธ ๊ทธ๋ฃน์„ ์–ป๊ธฐ ์œ„ํ•œ ํ˜„์žฌ ํ•ดํ‚น์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋‚ด ๋ง‰๋Œ€ ์ฐจํŠธ๊ฐ€ ์Œ“์ผ ์ˆ˜ ์žˆ๋„๋ก ์ข…์ข… ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ df์—๋Š” ์ด๋ฆ„, ๊ทธ๋ฃน ๋ฐ ์ธก์ •ํ•ญ๋ชฉ์˜ 3๊ฐœ ์—ด์ด ์žˆ์Šต๋‹ˆ๋‹ค.

df2 <- expand.grid(name = unique(df$name), group = unique(df$group)) %>%
    left_join(df, by=c("name","group")) %>%
    mutate(metric = ifelse(is.na(metric),0,metric))

๋น„์Šทํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋ˆ„๋ฝ๋œ ๊ทธ๋ฃน์ด ์žˆ๋Š”์ง€ ํ™•์ธํ•œ ๋‹ค์Œ ๋ชจ๋“  ์กฐํ•ฉ์ด ์žˆ์œผ๋ฉด left_join ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๋ถˆํ–‰ํžˆ๋„, ์ด ๋ฌธ์ œ๊ฐ€ ๋งŽ์€ ์‚ฌ๋ž‘์„ ๋ฐ›๊ณ  ์žˆ๋Š” ๊ฒƒ ๊ฐ™์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค. ์•„๋งˆ๋„ ์ด ๊ฐ„๋‹จํ•œ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์ผ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

@wsurles , @bpbond ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ, ๊ท€ํ•˜๊ฐ€ ์ œ์•ˆํ•œ ๊ฒƒ๊ณผ ์œ ์‚ฌํ•œ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค! .drop๊ณผ ๊ฐ™์€ ๋‚ด์žฅ ์ˆ˜์ • ํ”„๋กœ๊ทธ๋žจ์„ ๋ณด๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค.

์œ„์˜ ๋ชจ๋“  ์‚ฌ๋žŒ์„ ์ถ”๊ฐ€ํ•˜๊ณ  ๋™์˜ํ•˜๊ธฐ ์œ„ํ•ด ์ด๊ฒƒ์€ ๋งŽ์€ ๋ถ„์„์—์„œ ๋งค์šฐ ์ค‘์š”ํ•œ ์ธก๋ฉด์ž…๋‹ˆ๋‹ค. ๊ตฌํ˜„์„ ๋ณด๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์— ํ•„์š”ํ•œ ๋ช‡ ๊ฐ€์ง€ ์ถ”๊ฐ€ ์ •๋ณด:

๋‚ด๊ฐ€ ์ด๊ฒƒ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋ฉด :

> df <- data_frame( x = c(1,1,1,2,2), f = factor( c(1,2,3,1,1) ) )
> df
Source: local data frame [5 x 2]

  x f
1 1 1
2 1 2
3 1 3
4 2 1
5 2 1

๊ทธ๋ฆฌ๊ณ  x ๋‹ค์Œ f ๊ทธ๋ฃนํ™”ํ•˜๋ฉด (2, 2) ๋ฐ (2,3) ์ด ๋น„์–ด ์žˆ๋Š” 6(2x3) ๊ทธ๋ฃน์œผ๋กœ ๋๋‚ฉ๋‹ˆ๋‹ค. ๊ดœ์ฐฎ์•„. ๋‚˜๋Š” ๋‚ด๊ฐ€ ์ƒ๊ฐํ•˜๋Š” ๊ฒƒ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด์ œ ๋‚ด๊ฐ€ ์ด๊ฒƒ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ์š”?

> df <- data_frame( f = factor( c(1,1,2,2), levels = 1:3), x = c(1,2,1,4) )
> df
Source: local data frame [4 x 2]

  f x
1 1 1
2 1 2
3 2 1
4 2 4

f ๋‹ค์Œ x ๋กœ ๊ทธ๋ฃนํ™”ํ•˜๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฃน์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ? @ํ•ด๋“ค๋ฆฌ

stats::aggregate ๋ฐ plyr::ddply ๋Š” ๋ชจ๋‘ ์ด ๊ฒฝ์šฐ 4๊ฐœ์˜ ๊ทธ๋ฃน(1,1; 1,2; 2,1; ๋ฐ 2,4)์„ ๋ฐ˜ํ™˜ํ•˜๋ฏ€๋กœ ๋‹ค์Œ์„ ์ค€์ˆ˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. .

๋Œ€์‹  table() ๋™์˜ํ•ด์•ผ ํ•˜์ง€ ์•Š์Šต๋‹ˆ๊นŒ? ์ฆ‰, 9๊ฐœ์˜ ๊ทธ๋ฃน์„ ๋ฐ˜ํ™˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๊นŒ?

> table(df$f, df$x)
  1 2 4
1 1 1 0
2 1 0 1
3 0 0 0

df %>% group_by(f, x) %>% tally ๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ with(df, as.data.frame(table(f, x))) ๋ฐ ddply(df, .(f, x), nrow, .drop=FALSE) ์™€ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ•ฉ๋‹ˆ๋‹ค.

์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ๋™์ž‘์€ ๊ธธ์ด๊ฐ€ 0์ธ ๊ทธ๋ฃน์„ ์š”์†Œ(์˜ˆ: plyr์˜ .drop)๋กœ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— @huftis ์˜ ์ œ์•ˆ์„ ์›ํ•œ๋‹ค๊ณ  ์ƒ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ๋ณธ ๋™์ž‘์ด ๋ณ€๊ฒฝ๋˜์ง€ ์•Š๋„๋ก ๊ธฐ๋ณธ๊ฐ’์„ drop = TRUE๋กœ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. @bpbond ์˜ ์ œ์•ˆ์ž…๋‹ˆ๋‹ค.

ํ , ์ •ํ™•ํžˆ ์–ด๋–ค ํ–‰๋™์„ ์ทจํ•ด์•ผ ํ•˜๋Š”์ง€ ๋จธ๋ฆฌ๋ฅผ ๊ฐ์‹ธ๊ธฐ๊ฐ€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ์ด ๋งค์šฐ ๊ฐ„๋‹จํ•œ ์‚ฌ๊ณ  ์‹คํ—˜์ด ์ •ํ™•ํ•ด ๋ณด์ž…๋‹ˆ๊นŒ?

df <- data_frame(x = 1, y = factor(1, levels = 2))
df %>% group_by(x) %>% summarise(n())
#> x n
#> 1 1  

df %>% group_by(y) %>% summarise(n())
#> y n
#> 1 1
#> 2 0

df %>% group_by(x, y) %>% summarise(n()
#> x y n
#> 1 1 1
#> 1 2 0

ํ•˜์ง€๋งŒ x ์— ์—ฌ๋Ÿฌ ๊ฐ’์ด ์žˆ์œผ๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ์š”? ์ด๋Ÿฐ ์‹์œผ๋กœ ์ž‘๋™ํ•ด์•ผ ํ•ฉ๋‹ˆ๊นŒ?

df <- data_frame(x = 1:2, y = factor(1, levels = 2))
df %>% group_by(x, y) %>% summarise(n()
#> x y n
#> 1 1 1
#> 2 1 1
#> 1 1 0
#> 2 2 0

๋นˆ ๊ทธ๋ฃน์„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์€ ๋‹จ์ผ ๋ณ€์ˆ˜๋กœ ๊ทธ๋ฃนํ™”ํ•  ๋•Œ๋งŒ ์˜๋ฏธ๊ฐ€ ์žˆ์„๊นŒ์š”? ์˜ˆ๋ฅผ ๋“ค์–ด data_frame(age_group = c(40, 60), sex = factor(M, levels = c("F", "M")) ์™€ ๊ฐ™์ด ์ข€ ๋” ํ˜„์‹ค์ ์œผ๋กœ ๊ตฌ๋„๋ฅผ ์žก์œผ๋ฉด ์—ฌ์„ฑ์˜ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜์‹œ๊ฒ ์Šต๋‹ˆ๊นŒ? ๊ทธ๋Ÿด ๋•Œ๋„ ์žˆ๊ณ  ์•ˆ ํ•  ๋•Œ๋„ ์žˆ์„ ๊ฒƒ ๊ฐ™์•„์š”. ๋ชจ๋“  ์กฐํ•ฉ์„ ํ™•์žฅํ•˜๋Š” ๊ฒƒ์€ ๋‚˜์—๊ฒŒ ๋‹ค์†Œ ๋‹ค๋ฅธ ์ž‘์—…์ฒ˜๋Ÿผ ๋ณด์ž…๋‹ˆ๋‹ค(๊ทธ๋ฆฌ๊ณ  ์š”์ธ์˜ ์‚ฌ์šฉ๊ณผ๋Š” ๋ณ„๊ฐœ๋กœ).

group_by ์—๋Š” drop ๋ฐ expand ์ธ์ˆ˜๊ฐ€ ๋ชจ๋‘ ํ•„์š”ํ• ๊นŒ์š”? drop = FALSE ๋Š” ๋ฐ์ดํ„ฐ์— ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๋Š” ์š”์ธ ์ˆ˜์ค€์— ์˜ํ•ด ์ƒ์„ฑ๋œ ๋ชจ๋“  ํฌ๊ธฐ 0 ๊ทธ๋ฃน์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. expand = TRUE ๋Š” ๋ฐ์ดํ„ฐ์— ํ‘œ์‹œ๋˜์ง€ ์•Š๋Š” ๊ฐ’ ์กฐํ•ฉ์œผ๋กœ ์ƒ์„ฑ๋œ ๋ชจ๋“  ํฌ๊ธฐ 0 ๊ทธ๋ฃน์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

@hadley ๊ท€ํ•˜์˜ ์˜ˆ๋Š” ๋‚˜์—๊ฒŒ ์˜ณ์Šต๋‹ˆ๋‹ค ( levels = 1:2 ๊ฐ€ ์•„๋‹Œ levels = 2 levels = 1:2 ๋ฅผ ์˜๋ฏธํ•œ๋‹ค๊ณ  ๊ฐ€์ •). ๊ทธ๋ฆฌ๊ณ  ์—ฌ๋Ÿฌ ๋ณ€์ˆ˜๋กœ ๊ทธ๋ฃนํ™”ํ•  ๋•Œ๋„ ๋นˆ ๊ทธ๋ฃน์„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด ํ•ฉ๋ฆฌ์ ์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ณ€์ˆ˜๊ฐ€ sex ( male ๋ฐ female ) ๋ฐ answer (์งˆ๋ฌธ์—์„œ disagree , neutral , agree ), ๊ฐ ์„ฑ๋ณ„์— ๋Œ€ํ•œ ๊ฐ ๋‹ต๋ณ€์˜ ๋นˆ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋ ค๋Š” ๊ฒฝ์šฐ(์˜ˆ: ํ…Œ์ด๋ธ” ๋˜๋Š” ๋‚˜์ค‘์— ํ”Œ๋กœํŒ…ํ•˜๊ธฐ ์œ„ํ•ด) ์—ฌ์„ฑ์ด ๋Œ€๋‹ตํ•˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๋˜ํ•œ ์š”์ธ ๋ณ€์ˆ˜๊ฐ€ ๊ฒฐ๊ณผ data_frame (๋ฌธ์ž์—ด๋กœ ๋ณ€ํ™˜๋˜์ง€ ์•Š์Œ) ๋ฐ _์›๋ž˜ ์ˆ˜์ค€_์—์„œ ์š”์ธ ๋ณ€์ˆ˜๋กœ ๋‚จ์„ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ•ฉ๋‹ˆ๋‹ค. (๋”ฐ๋ผ์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋ฆด ๋•Œ ๋‹ต๋ณ€ ๋ฒ”์ฃผ๋Š” ์•ŒํŒŒ๋ฒณ agree , disagree , neutral ์•„๋‹Œ ์˜ฌ๋ฐ”๋ฅธ ์ˆœ์„œ๋กœ ์ •๋ ฌ๋ฉ๋‹ˆ๋‹ค.)

๋งˆ์ง€๋ง‰ ์˜ˆ์˜ ๊ฒฝ์šฐ sex ๋ณ€์ˆ˜๋ฅผ _์–ด๋–ค ๊ฒฝ์šฐ์—๋Š”_ ์‚ญ์ œํ•˜๋Š” ๊ฒƒ์ด ์ž์—ฐ์Šค๋Ÿฝ์Šต๋‹ˆ๋‹ค(์˜ˆ: _์˜๋„์ ์œผ๋กœ_ ์—ฌ์„ฑ์„ ์กฐ์‚ฌํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ). ๋‹ค๋ฅธ ๊ฒฝ์šฐ์—๋Š”_ ๊ทธ๋ ‡์ง€ ์•Š์Šต๋‹ˆ๋‹ค(์˜ˆ: ์„ฑ๋ณ„(๊ทธ๋ฆฌ๊ณ  ์•„๋งˆ๋„ ์—ฐ๋„)). ๊ทธ๋Ÿฌ๋‚˜ ์ด๊ฒƒ์€ _after_ ๋ฐ์ดํ„ฐ ์ง‘๊ณ„๋กœ ์‰ฝ๊ฒŒ ์ฒ˜๋ฆฌ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (๋‹ค๋ฅธ ํ•ด๊ฒฐ์ฑ…์€ _vector-valued_ .drop ์ธ์ˆ˜๋ฅผ ๋ฐ›์•„๋“ค์ด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ์ข‹๊ฒ ์ง€๋งŒ ์ƒํ™ฉ์ด ๋ณต์žกํ•ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.)

(๋‹ค๋ฅธ ํ•ด๊ฒฐ์ฑ…์€ ๋ฒกํ„ฐ ๊ฐ’ .drop ์ธ์ˆ˜๋ฅผ ๋ฐ›์•„๋“ค์ด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ์ข‹๊ฒ ์ง€๋งŒ ๋ฌธ์ œ๊ฐ€ ๋ณต์žกํ•ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.)

์˜ˆ, ์•„๋งˆ๋„ ๋„ˆ๋ฌด ๋ณต์žกํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด @huftis ์˜ ์˜๊ฒฌ์— ๋™์˜ํ•ฉ๋‹ˆ๋‹ค.

@ํ•ด๋“ค๋ฆฌ
์ œ ์ƒ๊ฐ์—๋Š”
YES๋Š” ๊ฐ’์˜ ๋ชจ๋“  ์กฐํ•ฉ์ด ๋ฐ์ดํ„ฐ์— ์žˆ๋Š” ๊ฒฝ์šฐ group_by๋กœ ํ™•์žฅ๋ฉ๋‹ˆ๋‹ค.
NO๋Š” ์กด์žฌํ•˜์ง€ ์•Š๋Š” ์š”์ธ ์ˆ˜์ค€์—์„œ ํ™•์žฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๊ฐ€์žฅ ์ž์ฃผ ์‚ฌ์šฉํ•˜๋Š” ์‚ฌ์šฉ ์‚ฌ๋ก€๋Š” ์ฐจํŠธ์— ๋Œ€ํ•œ ์š”์•ฝ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ค€๋น„ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค(ํƒ์ƒ‰ ์ค‘). ๊ทธ๋ฆฌ๊ณ  ์ฐจํŠธ์—๋Š” ๋ชจ๋“  ๊ฐ’ ์กฐํ•ฉ์ด ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ชจ๋“  ๊ทธ๋ฃน์— ๋Œ€ํ•ด 0์ธ ์š”์ธ ์ˆ˜์ค€์„ ๊ฐ€์งˆ ํ•„์š”๋Š” ์—†์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ๋ชจ๋“  ์กฐํ•ฉ ์—†์ด ๋ง‰๋Œ€ ์ฐจํŠธ๋ฅผ ์Œ“์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฐ์ดํ„ฐ์— ์กด์žฌํ•˜์ง€ ์•Š๋Š” ์š”์†Œ ๊ฐ’์€ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์Œ“์ด๋ฉด 0์ด๊ณ  ๋ฒ”๋ก€์—๋Š” ๋นˆ ๊ฐ’์ด ๋ฉ๋‹ˆ๋‹ค.

๋ชจ๋“  ๊ฐ’์„ group_by๋กœ ํ™•์žฅํ•˜๋Š” ๊ฒƒ์ด ๊ธฐ๋ณธ๊ฐ’์ด์–ด์•ผ ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ํ•„์š”ํ•œ ๊ฒฝ์šฐ ๊ทธ๋ฃน ๋‹ค์Œ์— 0๊ฐœ์˜ ์ผ€์ด์Šค๋ฅผ ํ•„ํ„ฐ๋งํ•˜๋Š” ๊ฒƒ์ด ํ›จ์”ฌ ์‰ฝ๊ณ (ํ›จ์”ฌ ๋” ์ง๊ด€์ ์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.) ๋‚˜๋Š” .drop ์ธ์ˆ˜๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์™œ๋ƒํ•˜๋ฉด ์ดํ›„์— 0๊ฐœ์˜ ๊ฒฝ์šฐ๋ฅผ ํ•„ํ„ฐ๋งํ•˜๊ธฐ์— ์ถฉ๋ถ„ํžˆ ์‰ฝ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋‹ค๋ฅธ ํ•จ์ˆ˜์— ์ถ”๊ฐ€ ์ธ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ์ด๊ฒƒ์ด ํ‹€์„ ๊นจ๋œจ๋ฆด ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ group_by๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ๊ธฐ์กด ๊ฐ’์˜ ๋ชจ๋“  ์ฝค๋ณด์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋ฅผ ํ‘œ์‹œํ•˜๋Š” ๊ฒƒ์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ด๊ฒƒ์ด ์˜ฌ๋ฐ”๋ฅธ ๊ธฐ๋ณธ ๋™์ž‘์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๊ณ ์œ  ํ•ญ๋ชฉ์€ ๋ชจ๋“  ์š”์†Œ ์ˆ˜์ค€์ด ์•„๋‹ˆ๋ผ ์š”์†Œ์˜ ๊ธฐ์กด ๊ฐ’์—์„œ๋งŒ ํ™•์žฅ๋ฉ๋‹ˆ๋‹ค. (์ด๊ฒƒ์€ 0 ๊ฐ’์„ ๋–จ์–ด ๋œจ๋ฆฌ๋Š” group_by๋ฅผ ์‹คํ–‰ ํ•œ ํ›„ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค)

## Expand data so plot groups works correctly
  df2 <- expand.grid(name = unique(df$name), group = unique(df$group)) %>%
    left_join(df, by=c("name","group")) %>%
    mutate(
      measure = ifelse(is.na(measure),0,measure)
    )

๋ชจ๋“  ๊ทธ๋ฃน์— 0์ด ์žˆ๋”๋ผ๋„ ๊ฐ’์„ ์›ํ•˜๋Š” ์œ„์น˜๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋Š” ์œ ์ผํ•œ ๊ฒฝ์šฐ๋Š” ์‹œ๊ฐ„ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค. ์ค‘๊ฐ„์— ๋ฐ์ดํ„ฐ๊ฐ€ ๋ˆ„๋ฝ๋œ ๋‚ ์ด ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์—์„œ ๋‚ ์งœ ๋ฒ”์œ„์— ๋Œ€ํ•œ ํ™•์žฅ ๋ฐ ์กฐ์ธ์€ ์—ฌ์ „ํžˆ โ€‹โ€‹ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์š”์ธ ์ˆ˜์ค€์˜ ๊ฒฝ์šฐ๋Š” ์ ์šฉ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ํฌ๋Ÿฐ์ฒ˜๊ฐ€ ๋ˆ„๋ฝ๋œ ๋‚ ์งœ๋ฅผ ์ž์ฒด์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์ด ๊ณต์ •ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์— ๋Œ€ํ•œ ๋ชจ๋“  ํ›Œ๋ฅญํ•œ ์ž‘์—…์— ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค. ์ œ ์ง์—…์˜ 90%๋Š” dplyr๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. : )

@huftis์— ์ ๊ทน ๋™์˜ํ•ฉ๋‹ˆ๋‹ค.

๋‚˜๋Š” ์ˆ˜์ค€์˜ ํ•˜๋ฝ์ด๋‚˜ ์ˆ˜์ค€์˜ ์กฐํ•ฉ์ด ๋ฐ์ดํ„ฐ์™€ ๊ด€๋ จ์ด ์—†์–ด์•ผ ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ž‘์€ ์ƒ˜ํ”Œ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•จ์ˆ˜๋‚˜ ๊ทธ๋ฆผ์„ ํ”„๋กœํ† ํƒ€์ดํ•‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜๋Š” split-apply-combine ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒฝ์šฐ ๊ฐ ๊ทธ๋ฃน์˜ ์ถœ๋ ฅ์ด ๋‚˜๋จธ์ง€ ๋ชจ๋“  ํ•ญ๋ชฉ๊ณผ ์ผ์น˜ํ•˜๋„๋ก ๋ณด์žฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋‚ด ์ž…์žฅ์„ ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ํ•˜๊ธฐ: ๊ทธ๋ฃนํ™” ๋ณ€์ˆ˜๊ฐ€ ์ด๋ฏธ ์ ์ ˆํ•œ ์š”์†Œ์ผ ๋•Œ์™€ ์š”์†Œ๋กœ ๊ฐ•์ œ๋  ๋•Œ ๊ธฐ๋ณธ ๋™์ž‘์ด ๋‹ฌ๋ผ์•ผ ํ•˜๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ๊ณ ๋ คํ•ด ๋ณผ ๊ฐ€์น˜๊ฐ€ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉํ•˜์ง€ ์•Š์€ ์ˆ˜์ค€์„ ์œ ์ง€ํ•ด์•ผ ํ•˜๋Š” ์˜๋ฌด๊ฐ€ ๊ฐ•์ œ์˜ ๊ฒฝ์šฐ์— ๋” ์ ์„ ์ˆ˜ ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋‚ด๊ฐ€ ๋ญ”๊ฐ€๋ฅผ ์š”์ธ์œผ๋กœ ์„ค์ •ํ•˜๊ณ  ๋ ˆ๋ฒจ์„ ์ œ์–ดํ•˜๊ธฐ ์œ„ํ•ด ์ˆ˜๊ณ ๋ฅผ ํ–ˆ๋‹ค๋ฉด ... ์ผ๋ฐ˜์ ์œผ๋กœ ๊ทธ๋Ÿด๋งŒํ•œ ์ด์œ ๊ฐ€ ์žˆ๊ณ  ๊ทธ๊ฒƒ์„ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋Š์ž„์—†์ด ์‹ธ์šฐ๊ณ  ์‹ถ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์ฐธ๊ณ ๋กœ ์ €๋„ ์ด ๊ธฐ๋Šฅ์„ ๋ณด๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค. @huftis๊ฐ€ ์„ค๋ช…ํ•œ ๊ฒƒ๊ณผ ์œ ์‚ฌํ•œ ์‹œ๋‚˜๋ฆฌ์˜ค๊ฐ€ ์žˆ์œผ๋ฉฐ ํ•„์š”ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์œผ๋ ค๋ฉด ํ›„ํ”„๋ฅผ

SO์—์„œ ์—ฌ๊ธฐ๋กœ ์™”์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด "tidyr"์˜ complete ์ด(๊ฐ€) ๋„์™€์•ผ ํ•˜๋Š” ๊ฒƒ ์•„๋‹™๋‹ˆ๊นŒ?

๋„ค, ๊ทธ๋ ‡์Šต๋‹ˆ๋‹ค. ๋‚˜๋Š” ์‹ค์ œ๋กœ ์ตœ๊ทผ์— '์™„์ „'์— ๋Œ€ํ•ด ๋ฐฐ์› ๊ณ  ์‚ฌ๋ ค ๊นŠ์€ ๋ฐฉ์‹์œผ๋กœ ์ด๊ฒƒ์„ ๋‹ฌ์„ฑํ•˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

SQL ๋ฐฑ์—”๋“œ์— ๋Œ€ํ•ด ์ด๋ฅผ ๊ตฌํ˜„ํ•˜๋Š” ๊ฒƒ์€ ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ชจ๋“  ๊ทธ๋ฃน์„ ์‚ญ์ œํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์–ด๋ ค์›Œ ๋ณด์ž…๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๊ทธ๊ฒƒ์„ ๊ทธ๋Œ€๋กœ ๋‘๊ณ  ์•„๋งˆ๋„ SQL์— ๋Œ€ํ•ด Tidyr::complete()๋ฅผ ๊ตฌํ˜„ํ• ๊นŒ์š”?

์ด ๋ฌธ์ œ๊ฐ€ ์ด๋ฏธ ์กด์žฌํ•œ๋‹ค๋Š” ์‚ฌ์‹ค์„ ๊นจ๋‹ซ์ง€ ๋ชปํ•œ ์ฑ„ ๋ฌธ์ œ #3033์„ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. ์ค‘๋ณต๋œ ์ ์— ๋Œ€ํ•ด ์‚ฌ๊ณผ๋“œ๋ฆฝ๋‹ˆ๋‹ค. ๋‚ด ์ž์‹ ์˜ ๊ฒธ์†ํ•œ ์ œ์•ˆ์„ ์ถ”๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ํ˜„์žฌ ์ด ๋ฌธ์ œ์— ๋Œ€ํ•œ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์œผ๋กœ pull() ๋ฐ forcats::fct_count() ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

fct_count() ๋Š” ํ•ญ์ƒ ์ž…๋ ฅ๊ณผ ๋™์ผํ•œ ์œ ํ˜•์ธ ์ถœ๋ ฅ์„ ๋งŒ๋“œ๋Š” ๊น”๋”ํ•œ ์›๋ฆฌ๋ฅผ ๋ฐฐ๋ฐ˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์—(์ฆ‰, ์ด ํ•จ์ˆ˜๋Š” ๋ฒกํ„ฐ์—์„œ ํ‹ฐ๋ธ”์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค) ์ €๋Š” ์ด ๋ฐฉ๋ฒ•์˜ ํŒฌ์ด ์•„๋‹™๋‹ˆ๋‹ค. ์ถœ๋ ฅ์˜ ์—ด ์ด๋ฆ„์„ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ dplyr::count() ๊ฐ€ ํ•˜๋‚˜๋ฅผ ๋ฎ์„ ์˜ˆ์ •์ธ ๊ฒฝ์šฐ pull() %>% fct_count() %>% rename() ๊ฐœ์˜ ๋‹จ๊ณ„( forcats::fct_count() ์™€ dplyr::count() ๊ฐ€ ์–ด๋–ป๊ฒŒ๋“  ํ•ฉ์ณ์ง€๊ณ  forcats::fct_count() ๋” ์ด์ƒ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค๋ฉด ํ™˜์ƒ์ ์ผ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

tidyr::complete() ์€ ์š”์ธ์— ๋Œ€ํ•ด ์ž‘๋™ํ•ฉ๋‹ˆ๊นŒ?

๋ชจ๋“  ์š”์ธ ์ˆ˜์ค€ ๋ฐ ์š”์ธ ์ˆ˜์ค€ ์กฐํ•ฉ์€ ๊ธฐ๋ณธ์ ์œผ๋กœ ์œ ์ง€๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋™์ž‘์€ drop , expand ๋“ฑ๊ณผ ๊ฐ™์€ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ dplyr::count() ์˜ ๊ธฐ๋ณธ ๋™์ž‘์€ ๋‹ค์Œ๊ณผ ๊ฐ™์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค.

df <- data.frame(x = 1:2, y = factor(c(1, 1), levels = 1:2))
df %>% dplyr::count(x, y)
#>  # A tibble: 4 x 3
#>       x        y       n
#>     <int>   <fct>    <int>
#> 1     1        1       1
#> 2     2        1       1
#> 3     1        2       0
#> 4     2        2       0

๊ธธ์ด๊ฐ€ 0์ธ ๊ทธ๋ฃน(๊ทธ๋ฃน ์กฐํ•ฉ)์€ ๋‚˜์ค‘์— ํ•„ํ„ฐ๋งํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํƒ์ƒ‰์  ๋ถ„์„์„ ์œ„ํ•ด์„œ๋Š” ์ „์ฒด ๊ทธ๋ฆผ์„ ๋ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  1. ์ด ๋ฌธ์ œ์— ๋Œ€ํ•œ ์†”๋ฃจ์…˜์— ๋Œ€ํ•œ ์ƒํƒœ ์—…๋ฐ์ดํŠธ๊ฐ€ ์žˆ์Šต๋‹ˆ๊นŒ?
  2. ์ด ๋ฌธ์ œ๋ฅผ ์™„์ „ํžˆ ํ•ด๊ฒฐํ•  ๊ณ„ํš์ด ์žˆ์Šต๋‹ˆ๊นŒ?

2: ๋„ค ํ™•์‹คํžˆ
1: ์ด ๋ฌธ์ œ์— ๋Œ€ํ•ด ๊ธฐ์ˆ ์ ์ธ ๊ตฌํ˜„์ƒ์˜ ์–ด๋ ค์›€์ด ์žˆ์ง€๋งŒ ์•ž์œผ๋กœ ๋ช‡ ์ฃผ ์•ˆ์— ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‚ฌ์‹ค ์ดํ›„์— ๋ฐ์ดํ„ฐ๋ฅผ ํ™•์žฅํ•˜์—ฌ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

library(tidyverse)

truly_group_by <- function(data, ...){
  dots <- quos(...)
  data <- group_by( data, !!!dots )

  labels <- attr( data, "labels" )
  labnames <- names(labels)
  labels <- mutate( labels, ..index.. =  attr(data, "indices") )

  expanded <- labels %>%
    tidyr::expand( !!!dots ) %>%
    left_join( labels, by = labnames ) %>%
    mutate( ..index.. = map(..index.., ~if(is.null(.x)) integer() else .x ) )

  indices <- pull( expanded, ..index..)
  group_sizes <- map_int( indices, length)
  labels <- select( expanded, -..index..)

  attr(data, "labels")  <- labels
  attr(data, "indices") <- indices
  attr(data, "group_sizes") <- group_sizes

  data
}

df  <- data_frame(
  x = 1:2,
  y = factor(c(1, 1), levels = 1:2)
)
tally( truly_group_by(df, x, y) )
#> # A tibble: 4 x 3
#> # Groups:   x [?]
#>       x y         n
#>   <int> <fct> <int>
#> 1     1 1         1
#> 2     1 2         0
#> 3     2 1         1
#> 4     2 2         0
tally( truly_group_by(df, y, x) )
#> # A tibble: 4 x 3
#> # Groups:   y [?]
#>   y         x     n
#>   <fct> <int> <int>
#> 1 1         1     1
#> 2 1         2     1
#> 3 2         1     0
#> 4 2         2     0

๋ถ„๋ช…ํžˆ ๋ผ์ธ์„ ๋”ฐ๋ผ ์ด๊ฒƒ์€ ๋‚ด๋ถ€์ ์œผ๋กœ ์ฒ˜๋ฆฌ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด๊ฒƒ์€ ์›๋ž˜ ์งˆ๋ฌธ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

> df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
> df$b = factor(df$b, levels=1:3)
> df %>%
+   group_by(b) %>%
+   summarise(count_a=length(a), .drop=FALSE)
# A tibble: 2 x 3
  b     count_a .drop
  <fct>   <int> <lgl>
1 1           6 FALSE
2 2           6 FALSE
> df %>%
+   truly_group_by(b) %>%
+   summarise(count_a=length(a), .drop=FALSE)
# A tibble: 3 x 3
  b     count_a .drop
  <fct>   <int> <lgl>
1 1           6 FALSE
2 2           6 FALSE
3 3           0 FALSE

์—ฌ๊ธฐ์„œ ํ•ต์‹ฌ์€ ์ด๊ฒƒ์ด๋‹ค.

 tidyr::expand( !!!dots ) %>%

๋ณ€์ˆ˜๊ฐ€ ์š”์ธ์ธ์ง€ ์•„๋‹Œ์ง€์— ๊ด€๊ณ„์—†์ด ๋ชจ๋“  ๊ฐ€๋Šฅ์„ฑ์„ ํ™•์žฅํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

๋‚˜๋Š” ์šฐ๋ฆฌ๊ฐ€ ๋งํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค :

  • drop=FALSE ์ผ ๋•Œ ๋ชจ๋‘ ํ™•์žฅ , ์ž ์žฌ์ ์œผ๋กœ 0 ๊ธธ์ด ๊ทธ๋ฃน์ด ๋งŽ์ด ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • drop=TRUE ๊ฒฝ์šฐ ์ง€๊ธˆ ์šฐ๋ฆฌ๊ฐ€ ํ•˜๋Š” ์ผ์„ ํ•˜์‹ญ์‹œ์˜ค

๋“œ๋กญ์„ ํ† ๊ธ€ํ•˜๋Š” ๊ธฐ๋Šฅ์ด ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๊ฒƒ์€ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์กฐ์ž‘๋งŒ ํฌํ•จํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ƒ๋Œ€์ ์œผ๋กœ ์ €๋ ดํ•œ ์ž‘์—…์ด๋ฏ€๋กœ R์—์„œ ๋จผ์ € ์ด ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ๋œ ์œ„ํ—˜ํ• ๊นŒ์š”?

ํ•˜์…จ์Šต๋‹ˆ๊นŒ crossing() ๋Œ€์‹  expand() ?

๋‚ด๋ถ€๋ฅผ ์‚ดํŽด๋ณด๋ฉด build_index_cpp() , ํŠนํžˆ labels ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ ์ƒ์„ฑ์„ "๋งŒ" ๋ณ€๊ฒฝํ•˜๋ฉด ๋œ๋‹ค๋Š” ๋ฐ ๋™์˜ํ•˜์‹ญ๋‹ˆ๊นŒ?

drop = FALSE ์š”์†Œ๋งŒ ํ™•์žฅํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ? ๋‚˜๋Š” "์ž์—ฐ์Šค๋Ÿฌ์šด" ๊ตฌ๋ฌธ์„ ๊ณ ๋ คํ–ˆ์ง€๋งŒ, ์ด๊ฒƒ์€ ๊ฒฐ๊ตญ ๋„ˆ๋ฌด ํ˜ผ๋ž€์Šค๋Ÿฌ์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(๊ทธ๋ฆฌ๊ณ  ์•„๋งˆ๋„ ์ถฉ๋ถ„ํžˆ ๊ฐ•๋ ฅํ•˜์ง€ ์•Š์„ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค):

group_by(data, crossing(col1, col2), col3)

์˜ ๋ชจ๋“  ์กฐํ•ฉ ์‚ฌ์šฉ : ์˜๋ฏธ col1 ๋ฐ col2 , ๊ฑฐ๊ธฐ์— ์กฐํ•ฉ ๊ธฐ์กด col3 .

์˜ˆ, ์ด๊ฒƒ์€ build_index_cpp ๋ฐ labels , indices ๋ฐ group_sizes ์†์„ฑ์˜ ์ƒ์„ฑ์—๋งŒ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค๊ณ  ๋งํ•˜๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค. #3489์˜ ์ผ๋ถ€๋กœ ๊น”๋”ํ•œ ๊ตฌ์กฐ

์ด ๋…ผ์˜์˜ "์œ ์ผํ•œ ํ™•์žฅ ์š”์ธ" ๋ถ€๋ถ„์€ ๋„ˆ๋ฌด ์˜ค๋ž˜ ๊ฑธ๋ ธ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

library(dplyr)

d <- data_frame(
  f1 = factor( rep( c("a", "b"), each = 4 ), levels = c("a", "b", "c") ),
  f2 = factor( rep( c("d", "e", "f", "g"), each = 2 ), levels = c("d", "e", "f", "g", "h") ),
  x  = 1:8,
  y  = rep( 1:4, each = 2)
)

f <- function(data, ...){
  group_by(data, !!!quos(...))  %>%
    tally()
}
f(d, f1, f2, x)
f(d, x, f1, f2)

f(d, f1, f2, x, y)
f(d, x, f1, f2, y)

ํ–‰ ์ˆœ์„œ๊ฐ€ ๋ฌด์‹œ๋˜๋ฉด f(d, f1, f2, x) f(d, x, f1, f2) ์™€ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•ด์•ผ ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ๋‘ ์‚ฌ๋žŒ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€์ž…๋‹ˆ๋‹ค.

๋˜ํ•œ ํฅ๋ฏธ๋กœ์šด ์ :

f(d, f2, x, f1, y)
d %>% sample_frac(0.3) %>% f(...)

์š”์†Œ์— ๋Œ€ํ•ด์„œ๋งŒ ์ „์ฒด ํ™•์žฅ์„ ๊ตฌํ˜„ํ•œ๋‹ค๋Š” ์•„์ด๋””์–ด๊ฐ€ ๋งˆ์Œ์— ๋“ญ๋‹ˆ๋‹ค. ๋ฌธ์ž๊ฐ€ ์•„๋‹Œ ๋ฐ์ดํ„ฐ(๋…ผ๋ฆฌ์  ํฌํ•จ)์˜ ๊ฒฝ์šฐ ํ•ด๋‹น ๋ฐ์ดํ„ฐ ์œ ํ˜•์„ ์ƒ์†ํ•˜๋Š” factor-like ํด๋ž˜์Šค๋ฅผ ์ •์˜/์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋งˆ๋„ forcats์—์„œ ์ œ๊ณตํ•ฉ๋‹ˆ๊นŒ? ์ด๊ฒƒ์€ ๋ฐœ์— ์ด์„ ์˜๋Š” ๊ฒƒ์„ ๋” ์–ด๋ ต๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

#3492์—์„œ ๊ตฌํ˜„ ์ง„ํ–‰ ์ค‘

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data_frame( f = factor( c(1,1,2,2), levels = 1:3), x = c(1,2,1,4) )

( res1 <- tally(group_by(df,f,x, drop = FALSE)) )
#> # A tibble: 9 x 3
#> # Groups:   f [?]
#>   f         x     n
#>   <fct> <dbl> <int>
#> 1 1        1.     1
#> 2 1        2.     1
#> 3 1        4.     0
#> 4 2        1.     1
#> 5 2        2.     0
#> 6 2        4.     1
#> 7 3        1.     0
#> 8 3        2.     0
#> 9 3        4.     0
( res2 <- tally(group_by(df,x,f, drop = FALSE)) )
#> # A tibble: 9 x 3
#> # Groups:   x [?]
#>       x f         n
#>   <dbl> <fct> <int>
#> 1    1. 1         1
#> 2    1. 2         1
#> 3    1. 3         0
#> 4    2. 1         1
#> 5    2. 2         0
#> 6    2. 3         0
#> 7    4. 1         0
#> 8    4. 2         1
#> 9    4. 3         0

all.equal( res1, arrange(res2, f, x) )
#> [1] TRUE

all.equal( filter(res1, n>0), tally(group_by(df, f, x)) )
#> [1] TRUE
all.equal( filter(res2, n>0), tally(group_by(df, x, f)) )
#> [1] TRUE

2018-04-10์— reprex ํŒจํ‚ค์ง€ (v0.2.0)์— ์˜ํ•ด ์ƒ์„ฑ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

complete() ๊ฐ€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š”์ง€ ์—ฌ๋ถ€์— ๊ด€ํ•ด์„œ๋Š” - ์•„๋‹ˆ์š”, ๊ทธ๋ ‡์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์–ด๋–ค ์š”์•ฝ์ด ๊ณ„์‚ฐ๋˜๋“  ๋นˆ ๋ฒกํ„ฐ์— ๋Œ€ํ•œ ํ–‰๋™์€ ์‚ฌ์‹ค ์ดํ›„์— ํŒจ์น˜๋˜์ง€ ์•Š๊ณ  ๋ณด์กด๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด:

data.frame(x=factor(1, levels=1:2), y=4:5) %>%
     group_by(x) %>%
     summarize(min=min(y), sum=sum(y), prod=prod(y))
# Should be:
#> x       min   sum  prod
#> 1         4     9    20
#> 2       Inf     0     1

sum ๋ฐ prod (๊ทธ๋ฆฌ๊ณ  ๋” ์ ์€ ๋ฒ”์œ„๋กœ min ) (๋ฐ ๊ธฐํƒ€ ๋‹ค์–‘ํ•œ ํ•จ์ˆ˜)๋Š” ๋นˆ ๋ฒกํ„ฐ์— ๋Œ€ํ•ด ๋งค์šฐ ์ž˜ ์ •์˜๋œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ, ๋‚˜์ค‘์— complete() ์™€ ํ•จ๊ป˜ ์™€์„œ ์ด๋Ÿฌํ•œ ๋™์ž‘์„ ์žฌ์ •์˜ํ•˜์‹ญ์‹œ์˜ค.

@kenahoo ์ดํ•ด๊ฐ€ ์ž˜ min() ์˜ ๊ฒฝ๊ณ ์ž…๋‹ˆ๋‹ค.

library(dplyr)

data.frame(x=factor(1, levels=1:2), y=4:5) %>%
  group_by(x) %>%
  summarize(min=min(y), sum=sum(y), prod=prod(y))
#> # A tibble: 2 x 4
#>   x       min   sum  prod
#>   <fct> <dbl> <int> <dbl>
#> 1 1         4     9    20
#> 2 2       Inf     0     1

min(integer())
#> Warning in min(integer()): no non-missing arguments to min; returning Inf
#> [1] Inf
sum(integer())
#> [1] 0
prod(integer())
#> [1] 1

2018-05-15์— reprex ํŒจํ‚ค์ง€ (v0.2.0)์— ์˜ํ•ด ์ƒ์„ฑ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

@romainfrancois ์˜ค

์ด ์˜ค๋ž˜๋œ ๋ฌธ์ œ๋Š” ์ž๋™์œผ๋กœ ์ž ๊ฒผ์Šต๋‹ˆ๋‹ค. ๊ด€๋ จ ๋ฌธ์ œ๋ฅผ ๋ฐœ๊ฒฌํ–ˆ๋‹ค๊ณ  ์ƒ๊ฐ๋˜๋ฉด ์ƒˆ ๋ฌธ์ œ(reprex ํฌํ•จ)๋ฅผ ์ œ์ถœํ•˜๊ณ  ์ด ๋ฌธ์ œ์— ๋Œ€ํ•œ ๋งํฌ๋ฅผ ์ž‘์„ฑํ•˜์‹ญ์‹œ์˜ค. https://reprex.tidyverse.org/

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰