Some keywords: GROUPING SETS, ROLLUP, CUBE, GROUPING
Some references: postgres, Oracle, SQL Server, groupings combined with arbitrary functions
_Grouping sets_ and friends are useful to pre-calculate various aggregation levels, which is often desired. Api for that feature in data.table is not very friendly, see Aggregating sub totals and grand totals with data.table.
In case of _rollup_ those are aggregations for provided by
from top to bottom. See description from postgres man, and example code below.
ROLLUP ( e1, e2, e3, ... )
is equivalent to:
GROUPING SETS (
( e1, e2, e3, ... ),
...
( e1, e2 )
( e1 )
( )
)
I wonder if there could be cheap speed-up of that process? this is potentially heavy computing task. Would be great to have computation of _grouping sets_ feature developed in C, so all the _rollup/cube_ and other features could be built on top of _grouping sets_ more easily in R still utilizing full speed.
Answers to update when closed:
library(plyr)
grp.cols <- c("vs", "am", "gear", "carb", "cyl")
plyr.r = do.call(
rbind.fill,
lapply(1:length(grp.cols), function(x) ddply(mtcars, grp.cols[1:x], summarize, agg=mean(mpg)))
)
library(data.table) # 1.9.7+
dt.r = rollup(as.data.table(mtcars), j = .(agg=mean(mpg)), by=grp.cols)
all.equal(
as.data.table(plyr.r),
dt.r[-.N], # exclude grand total, not present in BrodieG answer
ignore.row.order = TRUE,
ignore.col.order = TRUE
)
#[1] TRUE
# install.packages("data.table", type = "source", repos = "https://Rdatatable.github.io/data.table")
library(data.table)
set.seed(1)
DT = data.table(
group=sample(letters[1:2],100,replace=TRUE),
year=sample(2010:2012,100,replace=TRUE),
v=runif(100))
cube(DT, mean(v), by=c("group","year"))
# group year V1
#1: a 2011 0.4176346
#2: b 2010 0.5231845
#3: b 2012 0.4306871
#4: b 2011 0.4997119
#5: a 2012 0.4227796
#6: a 2010 0.2926945
#7: NA 2011 0.4463616
#8: NA 2010 0.4278093
#9: NA 2012 0.4271160
#10: a NA 0.3901875
#11: b NA 0.4835788
#12: NA NA 0.4350153
cube(DT, mean(v), by=c("group","year"), id=TRUE)
# grouping group year V1
#1: 0 a 2011 0.4176346
#2: 0 b 2010 0.5231845
#3: 0 b 2012 0.4306871
#4: 0 b 2011 0.4997119
#5: 0 a 2012 0.4227796
#6: 0 a 2010 0.2926945
#7: 2 NA 2011 0.4463616
#8: 2 NA 2010 0.4278093
#9: 2 NA 2012 0.4271160
#10: 1 a NA 0.3901875
#11: 1 b NA 0.4835788
#12: 3 NA NA 0.4350153
# install.packages("data.table", type = "source", repos = "https://Rdatatable.github.io/data.table")
Some other questions can get new answers also:
+1
library(data.table) # version 1.10.5 required
dt = data.table(ggplot2::diamonds)
groupingsets(dt, c(lapply(.SD, mean), list(COUNT = .N)),
by = names(dt)[2:4], .SDcols = 5:10, id = FALSE,
sets = as.list(names(dt)[2:4]))
cut color clarity depth table price x y z COUNT 1: Ideal NA NA 61.70940 55.95167 3457.542 5.507451 5.520080 3.401448 21551 2: Premium NA NA 61.26467 58.74610 4584.258 5.973887 5.944879 3.647124 13791 3: Good NA NA 62.36588 58.69464 3928.864 5.838785 5.850744 3.639507 4906 4: Very Good NA NA 61.81828 57.95615 3981.760 5.740696 5.770026 3.559801 12082 5: Fair NA NA 64.04168 59.05379 4358.758 6.246894 6.182652 3.982770 1610 6: NA E NA 61.66209 57.49120 3076.752 5.411580 5.419029 3.340689 9797 7: NA I NA 61.84639 57.57728 5091.875 6.222826 6.222730 3.845411 5422 8: NA J NA 61.88722 57.81239 5323.818 6.519338 6.518105 4.033251 2808 9: NA H NA 61.83685 57.51781 4486.669 5.983335 5.984815 3.695965 8304 10: NA F NA 61.69458 57.43354 3724.886 5.614961 5.619456 3.464446 9542 11: NA G NA 61.75711 57.28863 3999.136 5.677543 5.680192 3.505021 11292 12: NA D NA 61.69813 57.40459 3169.954 5.417051 5.421128 3.342827 6775 13: NA NA SI2 61.77217 57.92718 5063.029 6.401370 6.397826 3.948478 9194 14: NA NA SI1 61.85304 57.66254 3996.001 5.888383 5.888256 3.639845 13065 15: NA NA VS1 61.66746 57.31515 3839.455 5.572178 5.581828 3.441007 8171 16: NA NA VS2 61.72442 57.41740 3924.989 5.657709 5.658859 3.491478 12258 17: NA NA VVS2 61.66378 57.02499 3283.737 5.218454 5.232118 3.221465 5066 18: NA NA VVS1 61.62465 56.88446 2523.115 4.960364 4.975075 3.061294 3655 19: NA NA I1 62.73428 58.30378 3924.169 6.761093 6.709379 4.207908 741 20: NA NA IF 61.51061 56.50721 2864.839 4.968402 4.989827 3.061659 1790
This is just awesome. Makes working with pivot tables in Shiny way easier.
Most helpful comment
This is just awesome. Makes working with pivot tables in Shiny way easier.