Data.table: rolling functions, rolling aggregates, sliding window, moving average

Created on 21 Apr 2018  ·  39Comments  ·  Source: Rdatatable/data.table

To gather requirements in single place and refresh ~4 years old discussions creating this issue to cover _rolling functions_ feature (also known as _rolling aggregates_, _sliding window_ or _moving average_/_moving aggregates_).

rolling functions

  • [x] rollmean
  • [x] rollsum
  • [ ] rollmin
  • [ ] rollmax
  • [ ] rollmedian
  • [ ] rollprod
  • [ ] rollsd
  • [ ] rollvar
  • [ ] rollrank
  • [x] rollapply (user provided FUN)
  • [ ] rollregression (highly demanded)
  • [ ] rollcor?
  • [ ] rollcov?

features

  • [x] multiple columns at once
  • [x] multiple windows at once
  • [x] multiple columns and multiple windows at once
  • [x] atomic vectors input and single window returns atomic vectors
  • [x] various length list of vectors
  • [x] align: left/center/right
  • [x] handling NAs
  • [x] fill constant
  • [x] long vector support
  • [ ] partial window support (if needed can be found in ea766f2499034cedf6abf872440268b78979147c)
  • [x] adaptive window support
  • [x] use openmp to parallelize calculation of multiple columns/windows
  • [x] rounding error correction
  • [x] timing in verbose mode from parallel region (blocked by ~#3422~, #3423)
feature request

Most helpful comment

@mattdowle answering questions from PR

Why are we doing this inside data.table? Why are we integrating it instead of contributing to existing packages and using them from data.table?

  1. There were 3 different issues created asking for that functionality in data.table. Also multiple SO questions tagged data.table. Users expects that to be in scope of data.table.
  2. data.table fits perfectly for time-series data and rolling aggregates are pretty useful statistic there.

my guess is it comes down to syntax (features only possible or convenient if built into data.table; e.g. inside [...] and optimized) and building data.table internals into the rolling function at C level; e.g. froll* should be aware and use data.table indices and key. If so, more specifics on that are needed; e.g. a simple short example.

For me personally it is about speed and lack of chain of dependencies, nowadays not easy to achieve.
Key/indices could be useful for frollmin/frollmax, but it is unlikely that user will create index on measure variable. It is unlikely that user will make index on measure variable, also we haven't made this optimization for min/max yet. I don't see much sense for GForce optimization because allocated memory is not released after roll* call but returned as answer (as opposed to non-rolling mean, sum, etc.).

If there is no convincing argument for integrating, then we should contribute to the other packages instead.

I listed some above, if you are not convinced I recommend you to fill a question to data.table users, ask on twitter, etc. to check response. This feature was long time requested and by many users. If response won't convince you then you can close this issue.

All 39 comments

Proposed rollmean implementation, simplified.

x = data.table(v1=1:5, v2=1:5)
k = c(2, 3)
i - single column
j - single window
m - int referring to single row
w - current row's sum of rolling window
r - answer for each i, j



md5-be70673ef4a3bb883d4f334bd8fadec9



for i in x
  for j in k
  r = NA_real_
  w = 0
    for m in 1:length(i)
      w = w + i[m]
      w = w - i[m-j]
      r[m] = w / j

yes, and many more rolled functions follow the same basic idea (including
rolling standard deviation/any expectation-based moment, and any function
like rollproduct that uses invertible * instead of + to aggregate within
the window

I always envisioned rolling window functionality as grouping the dataset into multiple overlapping groups (windows). Then the API would look something like this:

DT[i, j,
   by = roll(width=5, align="center")]

Then if j contains, say, mean(A), we can internally replace it with rollmean(A) -- exactly like we are doing with gmean() right now. Or j can contain an arbitrarily complicated functionality (say, run a regression for each window), in which case we'd supply .SD data.table to it -- exactly like we do with groups right now.

This way there's no need to introduce 10+ new functions, just one. And it feels data.table-y in spirit too.

yes, agree

On Sat, Apr 21, 2018, 3:38 PM Pasha Stetsenko notifications@github.com
wrote:

I always envisioned rolling window functionality as grouping the
dataset into multiple overlapping groups (windows). Then the API would look
something like this:

DT[i, j,
by = roll(width=5, align="center")]

Then if j contains, say, mean(A), we can internally replace it with
rollmean(A) -- exactly like we are doing with gmean() right now. Or j can
contain an arbitrarily complicated functionality (say, run a regression for
each window), in which case we'd supply .SD data.table to it -- exactly
like we do with groups right now.

This way there's no need to introduce 10+ new functions, just one. And it
feels data.table-y in spirit too.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/2778#issuecomment-383275134,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdbADiE4aAI1qPxPnFXUM5gR-0w2Tks5tquH8gaJpZM4TeTQf
.

@st-pasha interesting idea, looks like data.table-y spirit, but it will impose many limitations, and isn't really appropriate for this category of functions.


  • how to perform rollmean by group
DT[, rollmean(V1, 3), by=V2]
  • how to calculate different window sizes for different columns
DT[, .(rollmean(V1, 3), rollmean(V2, 100))]
  • how to calculate rollmean outside of [.data.table as we now allow for shift
rollmean(rnorm(10), 3)
  • how to support queries like
DT[, .(rollmean(list(V1, V2), c(5, 20)), rollmean(list(V2, V3), c(10, 30)))]
  • how to call mean and rollmean in same j call
DT[, .(rollmean(V1, 3), mean(V1)), by=V2]

Usually when using by we aggregate data to smaller number of rows, while rolling functions always returns vector of same length as input. This types of functions in SQL have API in the following style:

SELECT AVG(value) OVER (ROWS BETWEEN 99 PRECEDING AND CURRENT ROW)
FROM tablename;

You can still combine it with GROUP BY as follows:

SELECT AVG(value) OVER (ROWS BETWEEN 99 PRECEDING AND CURRENT ROW)
FROM tablename
GROUP BY group_columns;

So in SQL those functions stays in SELECT which refers to j in DT.
In DT we could achieve the same with:

DT[, rollmean(value, 100)]
DT[, rollmean(value, 100), group_columns]

Rolling functions fits into same category of functions as shift which also returns same number of rows as it got on input.
Shift in SQL looks like:

SELECT LAG(value, 1) OVER ()
FROM tablename;

mean and rollmean are not just different functions, they are different categories of functions. One meant to aggregate according to group, another not aggregate at all. This is easily visible in SQL where we don't use GROUP BY for rolling functions type but we do need to use GROUP BY for aggregates like mean (eventually getting grant total when grouping clause is not present).
I don't see strong reasoning to try to apply same optimizations rules as we do for mean, especially when it doesn't really fit to use case, and all that just for the sake of data.table-y spirit. Current proposal is data.table-y spirit too, it can easily combined with :=, same as shift. It just adds set of new functions, currently not available in data.table.

@jangorecki Thanks, these are all valid considerations. Of course different people have different experiences, and different views as to what should be considered "natural".

It is possible to perform rollmean by group: this is just a 2-level grouping: DT[, mean(V1), by=.(V2, roll(3))]. However I don't see how to make different window sizes on different columns with my syntax...

I must admit I never seen SQL syntax for rolling joins before. It's interesting that they use standard aggregator such as AVG yet apply the windowing specification to it. Looking at the Transact-SQL documentation there are some interesting ideas there, for example the distinction between logical/physical row selection. They do allow different "OVER" operators on different columns, however in all examples they give, it is the same OVER clause repeated multiple times. So it suggests that this use-case is much more common, and hence using a single roll() group would result in less repetition.

Also, this SO question provides an interesting insight why the OVER syntax was introduced in SQL at all:

You can use GROUP BY SalesOrderID. The difference is, with GROUP BY you can only have the aggregated values for the columns that are not included in GROUP BY. In contrast, using windowed aggregate functions instead of GROUP BY, you can retrieve both aggregated and non-aggregated values. That is, although you are not doing that in your example query, you could retrieve both individual OrderQty values and their sums, counts, averages etc. over groups of same SalesOrderIDs.

So it appears that the syntax is designed to circumvent the limitation of standard SQL where group-by results could not be combined with unaggregated values (i.e. selecting both A and mean(A) in the same expression). However data.table does not have such a limitation, so it has more freedom in its choice of syntax.


Now, if we want to really get ahead of the curve, we need to think in a broader perspective: what are the "rolling" functions, what are they used for, how they can be extended, etc. Here's my take this, coming from a statistician's point-of-view:

"Rolling mean" function is used to smooth some noisy input. Say, if you have observations over time and you want to have some notion of "average quantity", which would nevertheless vary over time although very slowly. In this case "rolling mean over last 100 observations" or "rolling mean over all previous observations" can be considered. Similarly, if you observe certain quantity over a range of inputs, you may smooth it out by applying "rolling mean over ±50 observations".

  • So, the first extension is to look at "smooth windows": imagine a mean over past observations where the further an observation in the past, the less its contribution is. Or an average of nearby observations over a Gaussian kernel.
  • Second are adaptive windows. For example, if you have a noisy input defined over an interval [0, 1], then smoothing it using a ±N window produces a biased result near the edges. An unbiased estimator would adapt the shape of the window based on the distance from the edge.
  • Resample smoothing: the restriction to produce exactly as many observations as in the source data is too limiting. If you think of your data as a noisy observations of some underlying function, then it is perfectly reasonable to ask to compute the smoothed values of that function on a mesh that is coarser / finer than the original input. Or perhaps the original data is spaced irregularly and you want to resample it onto a regular grid.
  • Jackknife: for each observation you want to compute average/regression over all observations except the current.
  • K-fold split: view data as multiple groups where each group excludes only a small portion of the observations.

All of these can be implemented as extended grouping operators, with rolling windows being just one of the elements on this list. That being said, I don't why we can't have it both ways.

I must admit I never seen SQL syntax for rolling joins before.

I assume you mean rolling functions, issue has nothing to do with rolling joins.

They do allow different "OVER" operators on different columns, however in all examples they give, it is the same OVER clause repeated multiple times. So it suggests that this use-case is much more common, and hence using a single roll() group would result in less repetition.

It is just a matter of use case, if you are calling same OVER() many time you may find it more performant to use GROUP BY, build lookup table and re-use in other queries. Whatever examples are there, repeating OVER() is required to retain the feature of locality for each measure provided. My uses cases from Data Warehouses where not as simple as those from Microsoft docs.

In contrast, using windowed aggregate functions instead of GROUP BY, you can retrieve both aggregated and non-aggregated values.

In data.table we do := and by in one query to achieve it.

So it appears that the syntax is designed to circumvent the limitation of standard SQL where group-by results could not be combined with unaggregated values (i.e. selecting both A and mean(A) in the same expression). However data.table does not have such a limitation, so it has more freedom in its choice of syntax.

It isn't much limitation of SQL but just design of GROUP BY, that it will aggregate, the same way that our by aggregates. New API was required to cover new window functionalities. Grouping for SQL window function can be provided for each function call using FUN() OVER (PARTITION BY ...) where _partition by_ is like local grouping for single measure. So to achieve flexibility of SQL we would need to use j = mean(V1, roll=5) or j = over(mean(V1), roll=5) keeping that API in j. Still this approach will not allow to support all use cases mentioned above.

you may smooth it out by applying "rolling mean over ±50 observations".

This is what align argument is used for.

So, the first extension is to look at "smooth windows": imagine a mean over past observations where the further an observation in the past, the less its contribution is. Or an average of nearby observations over a Gaussian kernel.

There are many variants (virtually unlimited number) of moving averages, the most common smoothing window function (other than rollmean/SMA) is exponential moving average (EMA). Which should be included, and which not, is not trivial to decide, and actually best to make that decision according to feature requests that will come from users, so far none like this was requested.

All of these can be implemented as extended grouping operators, with rolling windows being just one of the elements on this list.

Surely they can, but if you will look at SO, and issues created in our repo, you will see that those few rolling functions here are responsible for 95+% of requests from users. I am happy to work on EMA and other MAs (although I am not sure if data.table is best place for those), but as a separate issue. Some users, me included, are waiting for just simple moving average in data.table for 4 years already.

Here's my take this, coming from a statistician's point-of-view

My point-of-view comes from Data Warehousing (where I used window function, at least once a week) and price trend analysis (where I used tens of different moving averages).

rollmean draft is pushed to roll branch. I found most of other packages that implements rolling mean are not able to dealt well with na.rm=FALSE and NAs present in input. This implementation handles NA consistently to mean, which impose some extra overhead because of ISNAN calls. We could allow API to faster but less safe version if user is sure there are no NAs in input.
PR is in #2795

@mattdowle answering questions from PR

Why are we doing this inside data.table? Why are we integrating it instead of contributing to existing packages and using them from data.table?

  1. There were 3 different issues created asking for that functionality in data.table. Also multiple SO questions tagged data.table. Users expects that to be in scope of data.table.
  2. data.table fits perfectly for time-series data and rolling aggregates are pretty useful statistic there.

my guess is it comes down to syntax (features only possible or convenient if built into data.table; e.g. inside [...] and optimized) and building data.table internals into the rolling function at C level; e.g. froll* should be aware and use data.table indices and key. If so, more specifics on that are needed; e.g. a simple short example.

For me personally it is about speed and lack of chain of dependencies, nowadays not easy to achieve.
Key/indices could be useful for frollmin/frollmax, but it is unlikely that user will create index on measure variable. It is unlikely that user will make index on measure variable, also we haven't made this optimization for min/max yet. I don't see much sense for GForce optimization because allocated memory is not released after roll* call but returned as answer (as opposed to non-rolling mean, sum, etc.).

If there is no convincing argument for integrating, then we should contribute to the other packages instead.

I listed some above, if you are not convinced I recommend you to fill a question to data.table users, ask on twitter, etc. to check response. This feature was long time requested and by many users. If response won't convince you then you can close this issue.

I found sparklyr can support rolling functions very well in a very large scale dataset.

@harryprince could put a little bit more light by providing example code of how you do it in sparklyr?
According to "Window functions" dplyr vignette

Rolling aggregates operate in a fixed width window. You won’t find them in base R or in dplyr, but there are many implementations in other packages, such as RcppRoll.

AFAIU you use custom spark API via sparklyr for which dplyr interface is not implemented, correct?

This issue is about rolling aggregates, other "types" of window functions are already in data.table for a long time.

Providing some example so we can compare (in-memory) performance vs sparklyr/SparkR would also be helpful.

It just occurred to me that this question:

how to calculate different window sizes for different columns?

has in fact a broader scope, and does not apply to rolling functions only.

For example, it seems to be perfectly reasonable to ask how to select the average product price by date, and then by week, and then maybe by week+category -- all within the same query. If we ever to implement such functionality, the natural syntax for it could be

DT[, .( mean(price, by=date), 
        mean(price, by=week), 
        mean(price, by=c(week, category)) )]

Now, if this functionality was already implemented, then it would have been a simple leap from there to rolling means:

DT[, .( mean(price, roll=5), 
        mean(price, roll=20), 
        mean(price, roll=100) )]

Not saying that this is unequivocally better than rollmean(price, 5) -- just throwing in some alternatives to consider...

@st-pasha

how to select the average product price by date, and then by week, and then maybe by week+category -- all within the same query.

AFAIU this is already possible using ?groupingsets, but not hooked into [.data.table yet.

@jangorecki , @st-pasha , and Co. -- Thanks for all your work on this! I'm curious why partial window support was removed from the scope, is there any potential for that functionality to make it back on the roadmap? Would come in handy for me sometimes, and fill in a functionality gap that to my knowledge hasn't been filled in either zoo or RcppRoll.

This Stack Overflow Question is a good example of a rolling application that could benefit from a partial = TRUE argument.

@msummersgill Thanks for feedback. In the first post I explicitly linked commit sha where partial window feature code can be found. The implementation that is there was later removed to reduce complexity of code. It was also imposing small performance cost even when not using that feature. This feature can (and probably should) be implemented the other way, first complete as is, and then just fill up the missing partial window using extra loop of 1:window_size. So the overhead of that feature is only noticeable when you use it. Nevertheless we do provide that functionality via adaptive argument, where partial feature is just a special case of adaptive rolling mean. There is example how to achieve partial using adaptive in ?froll manual. Pasting it here:

d = as.data.table(list(1:6/2, 3:8/4))
an = function(n, len) c(seq.int(n), rep(n, len-n))
n = an(3, nrow(d))
frollmean(d, n, adaptive=TRUE)

Of course it will not be as efficient as non-adaptive rolling function using extra loop to fill up just partial window.
AFAIK zoo has partial feature.

Do you guys have any plan of adding rolling regression functions to data.table?

@waynelapierre if there will be a demand for that, then yes. You have my +1

thanks this is great. Just one question though. I only see simple rolling aggregates, like a rolling mean or rolling median. Are you also implementing more refined rolling functions such as rolling DT dataframes? Say, create a rolling DT using the last 10 obs and run a lm regression on it.

Thanks!

@randomgambit I would say it is out of scope, unless there will be high demand for that. It wouldn't be very difficult to do it to be faster than base R/zoo just by handling nested loop in C. But we should try to implement it using "online" algorithm, to avoid nested loop. This is more tricky, and we could eventually do it for any statistic, so we have to cut off those statistics at some point.

@jangorecki interesting thanks. That means I will keep using tsibble to embed... DATA.TABLES in a tibble! mind blown :D

Tried to use frollmean to calculate a nonparametric "logistic curve" which shows P[y | x] for binary y using nearest neighbors of x. Was surprised y stored as logical was not cast automatically to integer:

DT = data.table(x = rnorm(1000), y = runif(1000) > .5)
DT[order(x), .(x, p_y = frollmean(y, 50L))]

Error in froll(fun = "mean", x = x, n = n, fill = fill, algo = algo, align = align, :
x must be of type numeric

An example of how vectorized x/n arguments can impact performance.
https://github.com/AdrianAntico/RemixAutoML/commit/d8370712591323be01d0c66f34a70040e2867636#r34784427
less loops, code easier to read, much faster (10x-36x speedup).

frollapply ready: https://github.com/Rdatatable/data.table/pull/3600

    ### fun             mean     sum  median
    # rollfun          8.815   5.151  60.175
    # zoo::rollapply  34.373  27.837  88.552
    # zoo::roll[fun]   0.215   0.185      NA
    # frollapply       5.404   1.419  56.475
    # froll[fun]       0.003   0.002      NA

hi guys, will FUN(user defined) passed to frollapply be changed to return an R object or data.frame(data.table), x passed to frollapply could be data.table of character not coerced to numeric, then FUN could do on labels and frollapply return a list? then we can do rolling regression or rolling testing like doing Benford's testing or summary on labels.

It is always useful to provide reproducible example. To clarify... in such a scenario you would like to frollapply(dt, 3, FUN) return a list of length nrow(dt) where each list element will be data.table returned by FUN(dt[window])?
frollapply(x=dt, n=3, fun=FUN)[[3]] equals to FUN(dt[1:3])
frollapply(x=dt, n=3, FUN=FUN)[[4]] equals to FUN(dt[2:4])
is that correct? @jerryfuyu0104

Currently we support multiple columns passed to first argument but we process them separately, looping. We would probably need some extra argument multi.var=FALSE, when set to true it would not loop over x (as it does now: list(FUN(x[[1]]),FUN(x[[2]]))) but pass all columns FUN(x).

any update for this?

I second that previous request.

Furthermore, would it be possible to support a "partial" argument to allow for partial windows?

@eliocamp could you elaborate on what a partial window is?

@eliocamp it would be possible to support "partial" argument. You may know that already but support for this functionality is already there, using adaptive=TRUE argument, see examples for details.

It would mean computing the function from the beginning through the end instead than form the half-window point.
For example for a rolling mean of 11 width, the first element returned would be the mean of the elements 1 through 6. The second, the mean of the 1st through 7th, and so on.

@jangorecki oh, thanks, I didn't know that! I'll check it out.

Agree, we need partial argument, not just for convenient but also for speed. adaptive=TRUE adds an overhead.
And yes we also need rolling regression, so supplying multiple variables and rolling on them at once, not each one separately.
There is no update on the status of those.

I'd love to help but my C++ skills are utterly non-existent. :sweat: Do you think it might be suitable for complete newbies?

We don't code in C++ but in C. Yes it is good place to start with. I did exactly that on frollmean.

I look at the code and it seems daunting. But I'll update you in any case.

But now, for yet another request: frollmean(.SD) should preserve names. More generally, froll* should preserve names if the input is a list-like with names.

As a frequent user of data.table, I find it extremely useful to have "time aware" features, as those currently offered in the package tsibble. Unfortunately this package is developed around dplyr. I wonder if a data.table implementation could be possible. The window functions proposed in this issue are a subset of those features.

@ywhcuhk Thanks for feedback, I was actually thinking this issue was already asking for too much. Most of that is well covered by still lightweight package roll which is very fast. As for the other features, I suggest to create new issue for each feature you are interested in, so discussion whether we want to implement/maintain can be decided for each separately. Just from looking at readme of tstibble I don't see anything new it offers...
Its title is "Tidy Temporal Data Frames" but it doesn't even seem to offer temporal joins.

Thank you @jangorecki for the response. Maybe it's a context dependent issue. The data structure I deal with most frequently is known as "panel data", with an ID and time. If the program is "aware" of this data feature, a lot operations, especially time-series operations, will be made very easy. For someone who knows STATA, it's the operations based on tsset and xtset, such as lead, lag, fill gap, etc. I think the "index" in the data.table can be enhanced in someway to enable such operations.

Of course, these operations can be done in data.table functions like shift and by. I just thought the index in data.table has a lot potential to be explored. I agree this should belong to a different issue. But I don't know how to move it without loosing above discussions ...

Was this page helpful?
0 / 5 - 0 ratings