As well as individuals within groups
species <- iris %.%
group_by(Species) %.%
summarise(wt = sum(Sepal.Length)) %.%
sample_n(5, replace = T, weight = wt) %.%
select(-wt)
inner_join(species, iris)
I wonder why this was closed? Seems like a potentially useful feature
iris %>%
group_by(Species) %>%
sample_n(1)
to pick all the data from a random species, e.g.
I don't think that sample_n
's behavior should change for groups because sampling within groups is its intuitive behavior. However it's often handy to be able to sample groups as a whole. This should be a second function. Here is my implementation:
sample_n_groups = function(tbl, size, replace = FALSE, weight=NULL) {
# regroup when done
grps = tbl %>% groups %>% unlist %>% as.character
# check length of groups non-zero
keep = tbl %>% summarise() %>% sample_n(size, replace, weight)
# keep only selected groups, regroup because joins change count.
# regrouping may be unnecessary but joins do something funky to grouping variable
tbl %>% semi_join(keep) %>% group_by_(grps)
}
The example by @rcorty works just expected
iris %>% group_by(Species) %>% sample_n_groups(1)
+1
Edit: A change to dplyr
broke this solution; scroll down for an updated version.
For those of you who arrived here via search engine looking for this functionality, the implementation by @MarcusWalz does not sample with replacement when replace = TRUE
. The implementation needs to use right_join
(or left_join
or inner_join
) to keep the duplicates:
sample_n_groups = function(tbl, size, replace = FALSE, weight=NULL) {
# regroup when done
grps = tbl %>% groups %>% unlist %>% as.character
# check length of groups non-zero
keep = tbl %>% summarise() %>% sample_n(size, replace, weight)
# keep only selected groups, regroup because joins change count.
# regrouping may be unnecessary but joins do something funky to grouping variable
tbl %>% right_join(keep, by=grps) %>% group_by_(grps)
}
Cluster bootstrapping is a wide use case for this feature.
@drhagen, in your implementation, do you have any suggestions for how to generate a new unique group id?
Actually, this is quite easy:
sample_n_groups = function(tbl, size, replace = FALSE, weight=NULL) {
# regroup when done
grps = tbl %>% groups %>% unlist %>% as.character
# check length of groups non-zero
keep = tbl %>% summarise() %>% sample_n(size, replace, weight) %>%
mutate(unique_id = 1:NROW(.))
# keep only selected groups, regroup because joins change count.
# regrouping may be unnecessary but joins do something funky to grouping variable
tbl %>% right_join(keep, by=grps) %>% group_by_(grps)
}
The answer above by @drhagen looks like it's out of date. This seems to work now:
sample_n_groups = function(tbl, size, replace = FALSE, weight = NULL) {
# regroup when done
grps = tbl %>% groups %>% lapply(as.character) %>% unlist
# check length of groups non-zero
keep = tbl %>% summarise() %>% ungroup() %>% sample_n(size, replace, weight)
# keep only selected groups, regroup because joins change count.
# regrouping may be unnecessary but joins do something funky to grouping variable
tbl %>% right_join(keep, by=grps) %>% group_by_(.dots = grps)
}
Most helpful comment
The answer above by @drhagen looks like it's out of date. This seems to work now: