Dplyr: Preserve zero-length groups

Created on 20 Mar 2014  ·  44Comments  ·  Source: tidyverse/dplyr

http://stackoverflow.com/questions/22523131

Not sure what the interface to this should be - probably should default to drop = FALSE.

feature wip

Most helpful comment

+1 - this is a deal-breaker for many analyses

All 44 comments

Thanks for opening up this issue Hadley.

:+1: ran into the same issue today, drop = FALSE would be a big help for me!

Any idea on the time frame for putting a .drop = FALSE equivalent into dplyr? I need this for certain rCharts to render correctly.

In the mean time I DID get the answer in your link to work.
http://stackoverflow.com/questions/22523131

I grouped by two variables.

+1 for option to not drop empty groups

May be some overlap with #486 and #413.

Not dropping empty groups would be very useful. Often needed when creating summary tables.

+1 - this is a deal-breaker for many analyses

I agree with all the above--would be very useful.

@romainfrancois Currently build_index_cpp() doesn't respect the drop attribute:

t1 <- data_frame(
  x = runif(10),
  g1 = rep(1:2, each = 5),
  g2 = factor(g1, 1:3)
)
g1 <- grouped_df(t1, list(quote(g2)), drop = FALSE)
attr(g1, "group_size")
# should be c(5L, 5L, 0L)
attr(g1, "indices")
# shoud be list(0:4, 5:9, integer(0))

The drop attribute only applies when grouping by a factor, in which case we need to have one group per factor level, regardless of whether or not the level actually applies to the data.

This will also affect the single table verbs in the following ways:

  • select(): no effect
  • arrange(): no effect
  • summarise(): functions applied to zero row groups should be given 0-level integers. n() should return 0, mean(x) should return NaN
  • filter(): the set of groups should remain constant, even if some groups now have no rows
  • mutate(): don't need to evaluate expressions for empty groups

Eventually, drop = FALSE will be the default, and if it's a hassle to write both drop = FALSE and drop = TRUE branches, I'd happily drop support for drop = FALSE (since you can always re-level the factor yourself, or use a character vector instead).

Does that make sense? If it's a lot of work, we can push off to 0.4

@statwonk, @wsurles, @jennybc, @slackline, @mcfrank, @eipi10 If you'd like to help, the best thing to do would be to work on a set of test cases that exercises all the ways the different verbs might interact with zero-length groups.

Ah. I think I just did not know what drop was supposed to do. That makes it clear. I don't think it's a lot of work.

I have opened pull request #833 which tests whether the single table verbs above handle zero-length groups correctly. Most of the tests are commented out, because dplyr currently fails them, of course.

+1 , any status updates here? love summarise, need to keep empty levels!

@ebergelson, Here is my current hack to get zero-length groups. I often need this so my bar charts will stack.

Here df has 3 columns: name, group, and metric

df2 <- expand.grid(name = unique(df$name), group = unique(df$group)) %>%
    left_join(df, by=c("name","group")) %>%
    mutate(metric = ifelse(is.na(metric),0,metric))

I do something similar–check for missing groups, then if any generate all combinations and left_join.

Unfortunately, it doesn't seem like this issue is getting much love...perhaps because there is this straightforward workaround.

@wsurles, @bpbond thanks, yes i used a similar workaround to what you suggest! would love to see a built-in fix like .drop.

Just to add and agree with everyone above - this is a super critical aspect of many analyses. Would love to see an implementation.

Some more details needed here:

If I have this:

> df <- data_frame( x = c(1,1,1,2,2), f = factor( c(1,2,3,1,1) ) )
> df
Source: local data frame [5 x 2]

  x f
1 1 1
2 1 2
3 1 3
4 2 1
5 2 1

And I group by x then f, I'd end up with 6 (2x3) groups where the groups (2, 2) and (2,3) are empty. That's ok. I can manage to implement that I think.

now, what if I have this:

> df <- data_frame( f = factor( c(1,1,2,2), levels = 1:3), x = c(1,2,1,4) )
> df
Source: local data frame [4 x 2]

  f x
1 1 1
2 1 2
3 2 1
4 2 4

and I want to group by f then x. What would the groups be ? @hadley

Both stats::aggregate and plyr::ddply return 4 groups in this case (1,1; 1,2; 2,1; and 2,4), so I'd suggest that's the behavior to conform to.

Shouldn’t it agree with table() instead, i.e., return 9 groups?

> table(df$f, df$x)
  1 2 4
1 1 1 0
2 1 0 1
3 0 0 0

I would expect df %>% group_by(f, x) %>% tally to basically give the same result as with(df, as.data.frame(table(f, x))) and ddply(df, .(f, x), nrow, .drop=FALSE).

I thought our desired behavior was to preserve zero-length groups if they are factors (like .drop in plyr), so I would imagine we'd want @huftis's suggestion. I would suggest that the default be drop = TRUE though, so that the default behavior does not change, re @bpbond's suggestion.

Hmmm, it's hard to wrap my head around exactly what the behaviour should be. Do these very simple thought experiments look correct?

df <- data_frame(x = 1, y = factor(1, levels = 2))
df %>% group_by(x) %>% summarise(n())
#> x n
#> 1 1  

df %>% group_by(y) %>% summarise(n())
#> y n
#> 1 1
#> 2 0

df %>% group_by(x, y) %>% summarise(n()
#> x y n
#> 1 1 1
#> 1 2 0

But what if x has multiple values? Should it work like this?

df <- data_frame(x = 1:2, y = factor(1, levels = 2))
df %>% group_by(x, y) %>% summarise(n()
#> x y n
#> 1 1 1
#> 2 1 1
#> 1 1 0
#> 2 2 0

Maybe preserving empty groups only makes sense when grouping by a single variable? If we frame it more realistically, e.g. data_frame(age_group = c(40, 60), sex = factor(M, levels = c("F", "M")) would you really want the counts for females? I think sometimes you would and sometimes you wouldn't. Expanding all combinations seems like a somewhat different operation to me (and independent of the use of factors).

Maybe group_by needs both drop and expand arguments? drop = FALSE would keep all size zero groups generated by factor levels that don't appear in the data. expand = TRUE would keep all size zero groups generated by combinations of values that don't appear in the data.

@hadley Your examples look right to me (assuming you meant levels = 1:2, not levels = 2). And I think preserving empty groups makes sense even when grouping by several variables. For example, if the variables were sex (male and female) and answer (on a questionnaire, with levels disagree, neutral, agree), and you wanted to count the frequency of each answer for each sex (e.g. for a table, or for later plotting), you wouldn’t want to drop an answer category just because no females answered it.

I would also expect the factor variables to remain factor variables in the resulting data_frame (not converted to strings), and with the _original levels_. (So when plotting the data, the answer categories would be in the correct order, not the alphabetical agree, disagree, neutral).

For your last example, it would _in some cases_ be natural to drop the sex variable (e.g., if _intentionally_ no females were surveyed), and _in other cases_ not (e.g., when counting the number of birth defects stratified by sex (and perhaps year)). But this can (and should) easily be dealt with _after_ aggregating the data. (A different solution would be to accept a _vector-valued_ .drop argument. That would be nice, but I guess it might complicate things?)

(A different solution would be to accept a vector-valued .drop argument. That would be nice, but I guess it might complicate things?)

Yes, probably too complicated. Otherwise I agree with @huftis 's comments.

@hadley
I think
YES expand on all combinations of values to the group_by if they exist in the data.
NO do not expand on factor levels that don't exist.

My most often use case is preparing a set of summarized data for a chart (during exploration). And the charts need to have all combinations of values. But they do not need to have factor levels that have 0 for all groups.. e.g. you cannot stack a bar chart without all combinations. But you do not need factor values that don't exist in the data, these would just be 0 when stacked and an empty value in the legend.

I believe expanding on all values to the group_by should be the default because its much easier (and much more intuitive) to filter 0 cases after the group by if needed. I don't think a .drop argument is necessary, because it is easy enough to filter 0 cases after. We don't use additional arguments to any of the other functions so this would break the mold. The default should just be to show results for all combos of existing values based on group_by.

I think this would be the correct default behavior. Here the unique will only expand on existing values in the factor, not all factor levels. (This is what I run after running a group_by that drops 0 values)

## Expand data so plot groups works correctly
  df2 <- expand.grid(name = unique(df$name), group = unique(df$group)) %>%
    left_join(df, by=c("name","group")) %>%
    mutate(
      measure = ifelse(is.na(measure),0,measure)
    )

The only case I can see where you would want a value even though all groups had zero is with time data. Maybe a day of data is missing somewhere in the middle. Here an expand and join on a date range would still be necessary. The factor level case would not apply. I think its fair for the data cruncher to handle the missing dates on their own.

Thanks for all your great work on this library. 90% of my job is using dplyr. : )

I strongly agree with @huftis.

I don't think dropping of levels or combinations of levels should have anything to do with the data. You might be prototyping a function or figure using a small sample. Or doing a split-apply-combine operation, in which case you want a guarantee that the output of each group will be conformable with all the rest.

Softening my position: I think it's worth considering whether default behavior should differ when the grouping variable is already a proper factor vs. when it is being coerced to factor. I can see that the obligation to keep unused levels might be less in the coercion case. But if I have gone to the trouble to set something up as a factor and take control of the levels ... there's usually a good reason and I don't want to constantly fight to preserve that.

FYIW, I would like to see this feature, too. I have a similar scenario as described by @huftis and have to jump through hoops to get the results I need.

Came over here from SO. Isn't this what complete from "tidyr" is supposed to help with?

Yes it does. I actually just learned about 'complete' recently and it seems to accomplish this in a thoughtful way.

Implementing that for SQL backends looks difficult, because they will by default drop all groups. Shall we leave it at that and perhaps implement tidyr::complete() for SQL?

I created issue #3033 not realizing that this issue already existed - apologies for the duplicate. To add my own humble suggestion, I currently use pull() and forcats::fct_count() as a work-around to this issue.

I'm not a fan of this method though because fct_count() betrays the tidyverse principle of making an output that is always the same type as the input (i.e. this function creates a tibble out of a vector), and I have to rename the columns in the output. This creates 3 steps (pull() %>% fct_count() %>% rename()) when dplyr::count() was meant to cover one. It would be fantastic if forcats::fct_count() and dplyr::count() could be amalgamated somehow, and deprecate forcats::fct_count().

Does tidyr::complete() work for factors?

All factors levels and combinations of factors levels must be preserved by default. This behavior can be controled by parameters such as drop, expand, etc. Thus the default behavior of dplyr::count() should be like this:

df <- data.frame(x = 1:2, y = factor(c(1, 1), levels = 1:2))
df %>% dplyr::count(x, y)
#>  # A tibble: 4 x 3
#>       x        y       n
#>     <int>   <fct>    <int>
#> 1     1        1       1
#> 2     2        1       1
#> 3     1        2       0
#> 4     2        2       0

Zero length groups (combinations of groups) can be filtered later. But for exploratory analysis we must see the full picture.

  1. Are there any status updates on the solution to this issue?
  2. Are there any plans to completely solve this issue?

2: yes definitely
1: There are some technical implementation difficulties about this issue, but I'll look into it in the next few weeks.

We might get away with this by expanding the data after the fact, something like this:

library(tidyverse)

truly_group_by <- function(data, ...){
  dots <- quos(...)
  data <- group_by( data, !!!dots )

  labels <- attr( data, "labels" )
  labnames <- names(labels)
  labels <- mutate( labels, ..index.. =  attr(data, "indices") )

  expanded <- labels %>%
    tidyr::expand( !!!dots ) %>%
    left_join( labels, by = labnames ) %>%
    mutate( ..index.. = map(..index.., ~if(is.null(.x)) integer() else .x ) )

  indices <- pull( expanded, ..index..)
  group_sizes <- map_int( indices, length)
  labels <- select( expanded, -..index..)

  attr(data, "labels")  <- labels
  attr(data, "indices") <- indices
  attr(data, "group_sizes") <- group_sizes

  data
}

df  <- data_frame(
  x = 1:2,
  y = factor(c(1, 1), levels = 1:2)
)
tally( truly_group_by(df, x, y) )
#> # A tibble: 4 x 3
#> # Groups:   x [?]
#>       x y         n
#>   <int> <fct> <int>
#> 1     1 1         1
#> 2     1 2         0
#> 3     2 1         1
#> 4     2 2         0
tally( truly_group_by(df, y, x) )
#> # A tibble: 4 x 3
#> # Groups:   y [?]
#>   y         x     n
#>   <fct> <int> <int>
#> 1 1         1     1
#> 2 1         2     1
#> 3 2         1     0
#> 4 2         2     0

obviously down the line, this would be handled internally, sans using tidyr or purrr.

This seems to take care of the original question on so:

> df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
> df$b = factor(df$b, levels=1:3)
> df %>%
+   group_by(b) %>%
+   summarise(count_a=length(a), .drop=FALSE)
# A tibble: 2 x 3
  b     count_a .drop
  <fct>   <int> <lgl>
1 1           6 FALSE
2 2           6 FALSE
> df %>%
+   truly_group_by(b) %>%
+   summarise(count_a=length(a), .drop=FALSE)
# A tibble: 3 x 3
  b     count_a .drop
  <fct>   <int> <lgl>
1 1           6 FALSE
2 2           6 FALSE
3 3           0 FALSE

The key here being this

 tidyr::expand( !!!dots ) %>%

which means expanding all possibilities regardless of the variables being factors or not.

I'd say we either:

  • expand all when drop=FALSE, potentially having lots of 0 length groups
  • do what we do now if drop=TRUE

perhaps have a function to toggle dropness.

This is a relatively cheap operation I'd say because it only involves manipulating the metadata, so perhaps it is less risky to do this in R first ?

Did you mean crossing() instead of expand()?

Looking at the internals, do you agree that we "only" need to change build_index_cpp(), specifically the generation of the labels data frame, to make this happen?

Can we perhaps start with expanding only factors with drop = FALSE? I considered a "natural" syntax, but this may be too confusing in the end (and perhaps even not powerful enough):

group_by(data, crossing(col1, col2), col3)

Semantics: Using all combinations of col1 and col2, and there existing combinations with col3.

Yes, I'd say this only affects build_index_cpp and the generation of the attributes labels, indices and group_sizes which I'd like to squash in a tidy structure as part of #3489

The "only expanding factors" part of this discussion is what took so long.

What would be the results of these:

library(dplyr)

d <- data_frame(
  f1 = factor( rep( c("a", "b"), each = 4 ), levels = c("a", "b", "c") ),
  f2 = factor( rep( c("d", "e", "f", "g"), each = 2 ), levels = c("d", "e", "f", "g", "h") ),
  x  = 1:8,
  y  = rep( 1:4, each = 2)
)

f <- function(data, ...){
  group_by(data, !!!quos(...))  %>%
    tally()
}
f(d, f1, f2, x)
f(d, x, f1, f2)

f(d, f1, f2, x, y)
f(d, x, f1, f2, y)

I think f(d, f1, f2, x) should give the same results as f(d, x, f1, f2), if row order is ignored. Same for the other two.

Also interesting:

f(d, f2, x, f1, y)
d %>% sample_frac(0.3) %>% f(...)

I like the idea of implementing full expansion only for factors. For non-character data (including logicals), we could define/use a factor-like class that inherits the respective data type. Perhaps provided by forcats? This makes it more difficult to shoot yourself in the foot.

implementation in progress in #3492

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data_frame( f = factor( c(1,1,2,2), levels = 1:3), x = c(1,2,1,4) )

( res1 <- tally(group_by(df,f,x, drop = FALSE)) )
#> # A tibble: 9 x 3
#> # Groups:   f [?]
#>   f         x     n
#>   <fct> <dbl> <int>
#> 1 1        1.     1
#> 2 1        2.     1
#> 3 1        4.     0
#> 4 2        1.     1
#> 5 2        2.     0
#> 6 2        4.     1
#> 7 3        1.     0
#> 8 3        2.     0
#> 9 3        4.     0
( res2 <- tally(group_by(df,x,f, drop = FALSE)) )
#> # A tibble: 9 x 3
#> # Groups:   x [?]
#>       x f         n
#>   <dbl> <fct> <int>
#> 1    1. 1         1
#> 2    1. 2         1
#> 3    1. 3         0
#> 4    2. 1         1
#> 5    2. 2         0
#> 6    2. 3         0
#> 7    4. 1         0
#> 8    4. 2         1
#> 9    4. 3         0

all.equal( res1, arrange(res2, f, x) )
#> [1] TRUE

all.equal( filter(res1, n>0), tally(group_by(df, f, x)) )
#> [1] TRUE
all.equal( filter(res2, n>0), tally(group_by(df, x, f)) )
#> [1] TRUE

Created on 2018-04-10 by the reprex package (v0.2.0).

As for whether complete() solves the issue - no, not really. Whatever summaries are being computed, their behaviors on empty vectors need to be preserved, not patched up after the fact. For example:

data.frame(x=factor(1, levels=1:2), y=4:5) %>%
     group_by(x) %>%
     summarize(min=min(y), sum=sum(y), prod=prod(y))
# Should be:
#> x       min   sum  prod
#> 1         4     9    20
#> 2       Inf     0     1

sum and prod (and to a lesser extent, min) (and various other functions) have very well-defined semantics on empty vectors, and it's not great to have to come along afterwards with complete() and re-define those behaviors.

@kenahoo I'm not sure I understand. This is what you get with the current dev version. So the only thing that you don't get is the warning from min()

library(dplyr)

data.frame(x=factor(1, levels=1:2), y=4:5) %>%
  group_by(x) %>%
  summarize(min=min(y), sum=sum(y), prod=prod(y))
#> # A tibble: 2 x 4
#>   x       min   sum  prod
#>   <fct> <dbl> <int> <dbl>
#> 1 1         4     9    20
#> 2 2       Inf     0     1

min(integer())
#> Warning in min(integer()): no non-missing arguments to min; returning Inf
#> [1] Inf
sum(integer())
#> [1] 0
prod(integer())
#> [1] 1

Created on 2018-05-15 by the reprex package (v0.2.0).

@romainfrancois Oh cool, I didn't realize you were already so far along on this implementation. Looks great!

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

Was this page helpful?
0 / 5 - 0 ratings