Dplyr: Need to be able to sample groups

Created on 28 Mar 2014  ·  9Comments  ·  Source: tidyverse/dplyr

As well as individuals within groups

Most helpful comment

The answer above by @drhagen looks like it's out of date. This seems to work now:

sample_n_groups = function(tbl, size, replace = FALSE, weight = NULL) {
  # regroup when done
  grps = tbl %>% groups %>% lapply(as.character) %>% unlist
  # check length of groups non-zero
  keep = tbl %>% summarise() %>% ungroup() %>% sample_n(size, replace, weight)
  # keep only selected groups, regroup because joins change count.
  # regrouping may be unnecessary but joins do something funky to grouping variable
  tbl %>% right_join(keep, by=grps) %>% group_by_(.dots = grps)
}

All 9 comments

species <- iris %.% 
  group_by(Species) %.% 
  summarise(wt = sum(Sepal.Length)) %.%
  sample_n(5, replace = T, weight = wt) %.%
  select(-wt)

inner_join(species, iris)

I wonder why this was closed? Seems like a potentially useful feature

iris %>%
    group_by(Species) %>%
    sample_n(1)

to pick all the data from a random species, e.g.

I don't think that sample_n's behavior should change for groups because sampling within groups is its intuitive behavior. However it's often handy to be able to sample groups as a whole. This should be a second function. Here is my implementation:

sample_n_groups = function(tbl, size, replace = FALSE, weight=NULL) {
   # regroup when done
   grps = tbl %>% groups %>% unlist %>% as.character
   # check length of groups non-zero
   keep = tbl %>% summarise() %>% sample_n(size, replace, weight)
   # keep only selected groups, regroup because joins change count.
   # regrouping may be unnecessary but joins do something funky to grouping variable
   tbl %>% semi_join(keep) %>% group_by_(grps) 
}

The example by @rcorty works just expected

iris %>% group_by(Species) %>% sample_n_groups(1)

+1

Edit: A change to dplyr broke this solution; scroll down for an updated version.


For those of you who arrived here via search engine looking for this functionality, the implementation by @MarcusWalz does not sample with replacement when replace = TRUE. The implementation needs to use right_join (or left_join or inner_join) to keep the duplicates:

sample_n_groups = function(tbl, size, replace = FALSE, weight=NULL) {
  # regroup when done
  grps = tbl %>% groups %>% unlist %>% as.character
  # check length of groups non-zero
  keep = tbl %>% summarise() %>% sample_n(size, replace, weight)
  # keep only selected groups, regroup because joins change count.
  # regrouping may be unnecessary but joins do something funky to grouping variable
  tbl %>% right_join(keep, by=grps) %>% group_by_(grps) 
}

Cluster bootstrapping is a wide use case for this feature.

@drhagen, in your implementation, do you have any suggestions for how to generate a new unique group id?

Actually, this is quite easy:

sample_n_groups = function(tbl, size, replace = FALSE, weight=NULL) {
  # regroup when done
  grps = tbl %>% groups %>% unlist %>% as.character
  # check length of groups non-zero
  keep = tbl %>% summarise() %>% sample_n(size, replace, weight) %>% 
    mutate(unique_id = 1:NROW(.))
  # keep only selected groups, regroup because joins change count.
  # regrouping may be unnecessary but joins do something funky to grouping variable
  tbl %>% right_join(keep, by=grps) %>% group_by_(grps) 
}

The answer above by @drhagen looks like it's out of date. This seems to work now:

sample_n_groups = function(tbl, size, replace = FALSE, weight = NULL) {
  # regroup when done
  grps = tbl %>% groups %>% lapply(as.character) %>% unlist
  # check length of groups non-zero
  keep = tbl %>% summarise() %>% ungroup() %>% sample_n(size, replace, weight)
  # keep only selected groups, regroup because joins change count.
  # regrouping may be unnecessary but joins do something funky to grouping variable
  tbl %>% right_join(keep, by=grps) %>% group_by_(.dots = grps)
}
Was this page helpful?
0 / 5 - 0 ratings