Dplyr: 需要能够对组进行抽样

创建于 2014-03-28  ·  9评论  ·  资料来源: tidyverse/dplyr

以及团体中的个人

最有用的评论

@drhagen上面的

sample_n_groups = function(tbl, size, replace = FALSE, weight = NULL) {
  # regroup when done
  grps = tbl %>% groups %>% lapply(as.character) %>% unlist
  # check length of groups non-zero
  keep = tbl %>% summarise() %>% ungroup() %>% sample_n(size, replace, weight)
  # keep only selected groups, regroup because joins change count.
  # regrouping may be unnecessary but joins do something funky to grouping variable
  tbl %>% right_join(keep, by=grps) %>% group_by_(.dots = grps)
}

所有9条评论

species <- iris %.% 
  group_by(Species) %.% 
  summarise(wt = sum(Sepal.Length)) %.%
  sample_n(5, replace = T, weight = wt) %.%
  select(-wt)

inner_join(species, iris)

我想知道为什么这被关闭了? 似乎是一个潜在有用的功能

iris %>%
    group_by(Species) %>%
    sample_n(1)

从随机物种中选择所有数据,例如

我不认为sample_n的行为应该因组而改变,因为组内采样是其直观行为。 然而,能够将组作为一个整体进行采样通常很方便。 这应该是第二个功能。 这是我的实现:

sample_n_groups = function(tbl, size, replace = FALSE, weight=NULL) {
   # regroup when done
   grps = tbl %>% groups %>% unlist %>% as.character
   # check length of groups non-zero
   keep = tbl %>% summarise() %>% sample_n(size, replace, weight)
   # keep only selected groups, regroup because joins change count.
   # regrouping may be unnecessary but joins do something funky to grouping variable
   tbl %>% semi_join(keep) %>% group_by_(grps) 
}

@rcorty的示例工作正常

iris %>% group_by(Species) %>% sample_n_groups(1)

+1

编辑:对dplyr更改破坏了此解决方案;


对于那些通过搜索引擎到达这里寻找此功能的人, @MarcusWalz的实现不会在replace = TRUE时进行替换采样。 实现需要使用right_join (或left_joininner_join )来保留重复项:

sample_n_groups = function(tbl, size, replace = FALSE, weight=NULL) {
  # regroup when done
  grps = tbl %>% groups %>% unlist %>% as.character
  # check length of groups non-zero
  keep = tbl %>% summarise() %>% sample_n(size, replace, weight)
  # keep only selected groups, regroup because joins change count.
  # regrouping may be unnecessary but joins do something funky to grouping variable
  tbl %>% right_join(keep, by=grps) %>% group_by_(grps) 
}

集群引导是此功能的广泛用例。

@drhagen ,在您的实现中,您对如何生成新的唯一组 ID 有什么建议吗?

其实,这很简单:

sample_n_groups = function(tbl, size, replace = FALSE, weight=NULL) {
  # regroup when done
  grps = tbl %>% groups %>% unlist %>% as.character
  # check length of groups non-zero
  keep = tbl %>% summarise() %>% sample_n(size, replace, weight) %>% 
    mutate(unique_id = 1:NROW(.))
  # keep only selected groups, regroup because joins change count.
  # regrouping may be unnecessary but joins do something funky to grouping variable
  tbl %>% right_join(keep, by=grps) %>% group_by_(grps) 
}

@drhagen上面的

sample_n_groups = function(tbl, size, replace = FALSE, weight = NULL) {
  # regroup when done
  grps = tbl %>% groups %>% lapply(as.character) %>% unlist
  # check length of groups non-zero
  keep = tbl %>% summarise() %>% ungroup() %>% sample_n(size, replace, weight)
  # keep only selected groups, regroup because joins change count.
  # regrouping may be unnecessary but joins do something funky to grouping variable
  tbl %>% right_join(keep, by=grps) %>% group_by_(.dots = grps)
}
此页面是否有帮助?
0 / 5 - 0 等级