Pandas: Rolling window with step size

Created on 9 Feb 2017  ·  38Comments  ·  Source: pandas-dev/pandas

Just a suggestion - extend rolling to support a rolling window with a step size, such as R's rollapply(by=X).

Code Sample

Pandas - inefficient solution (apply function to every window, then slice to get every second result)

import pandas
ts = pandas.Series(range(0, 40, 2))
ts.rolling(5).apply(max).dropna()[::2]

Suggestion:

ts = pandas.Series(range(0, 40, 2))
ts.rolling(window=5, step=2).apply(max).dropna()

Inspired by R (see rollapply docs):

require(zoo)
TS <- zoo(seq(0, 40, 2))
rollapply(TS, 5, FUN=max, by=2)

8 12 16 20 24 28 32 36 40

Enhancement Needs Discussion Numeric Window

Most helpful comment

"this could be done, but i would like to see a usecase where this matters."

Whatever the project I worked on using pandas, I almost always missed this feature, it is usefull everytime you need to compute the apply only once in a while but still need good resolution inside each window.

All 38 comments

If you're using 'standard' functions, these are vectorized, and so v fast (ts.rolling(5).max().dropna()[::2]).

IIUC the saving here would come from only applying the function a fraction of the time (e.g. every nth value). But is there a case where that makes a practical difference?

this could be done, but i would like to see a usecase where this matters. This would break the 'return same size as input' API as well. Though I don't think this is actually hard to implement (though would involve a number of changes in the implementation). We use marginal windows (IOW, compute the window and as you advance drop off the points that are leaving and add points that you are gaining). So still would have to compute everthing, but you just wouldn't output it.

Thanks for your replies!

IIUC the saving here would come from only applying the function a fraction of the time (e.g. every nth value). But is there a case where that makes a practical difference?

My use case is running aggregation functions (not just max) over some large timeseries dataframes - 400 columns, hours of data at 5-25Hz. I've also done a similar thing (feature engineering on sensor data) in the past with data up to 20kHz. Running 30 second windows with a 5 second step saves a big chunk of processing - e.g. at 25Hz with a 5s step it's 1/125th of the work, which makes the difference between it running in 1 minute or 2 hours.

I can obviously fall back to numpy, but it'd be nice if there was a higher level API for doing this. I just thought it was worth the suggestion in case others would find it useful too - I don't expect you to build a feature just for me!

you can try resamplimg to a higher frequency interval first then rolling

something like

df = df.resample('30s')
df.rolling(..).max() (or whatever function)

Hey @jreback, thanks for the suggestion.

This would work if I was just running max on my data (resample needs a reduction function, otherwise it defaults to mean, right?):

df.resample('1s').max().rolling(30).max()

However I'd like to run my reduction function on 30 seconds of data, then move forward 1 second, and run it on the next 30 seconds of data, etc. The method above applies a function on 1 second of data, and then another function on 30 results of the first function.

Here's a quick example - running a peak to peak calculation doesn't work running twice (obviously):

# 10 minutes of data at 5Hz
n = 5 * 60 * 10
rng = pandas.date_range('1/1/2017', periods=n, freq='200ms')
np.random.seed(0)
d = np.cumsum(np.random.randn(n), axis=0)
s = pandas.Series(d, index=rng)

# Peak to peak
def p2p(d):
    return d.max() - d.min()

def p2p_arr(d):
    return d.max(axis=1) - d.min(axis=1)

def rolling_with_step(s, window, step, func):
    # See https://ga7g08.github.io/2015/01/30/Applying-python-functions-in-moving-windows/
    vert_idx_list = np.arange(0, s.size - window, step)
    hori_idx_list = np.arange(window)
    A, B = np.meshgrid(hori_idx_list, vert_idx_list)
    idx_array = A + B
    x_array = s.values[idx_array]
    idx = s.index[vert_idx_list + int(window/2.)]
    d = func(x_array)
    return pandas.Series(d, index=idx)

# Plot data
ax = s.plot(figsize=(12, 8), legend=True, label='Data')

# Plot resample then rolling (obviously does not work)
s.resample('1s').apply(p2p).rolling(window=30, center=True).apply(p2p).plot(ax=ax, label='1s p2p, roll 30 p2p', legend=True)

# Plot rolling window with step
rolling_with_step(s, window=30 * 5, step=5, func=p2p_arr).plot(ax=ax, label='Roll 30, step 1s', legend=True)

rolling window

@alexlouden from your original description I think something like

df.resample('5s').max().rolling('30s').mean() (or whatever reductions) is more in-line with what you want

IOW, take whatever is in a 5s bin, then reduce it to a single point, then roll over those bins. This general idea is that you have lots of data that can be summarized at a short timescale, but you actually want the rolling of this at a higher level.

Hey @jreback, I actually want to run a function over 30 seconds of data, every 5 seconds. See the rolling_with_step function in my previous example. The additional step of max/mean doesn't work for my use case.

@jreback, there is a real need for the step function that hasn't been brought out in this discussion yet. I second everything that @alexlouden has described, but I would like to add more use cases.

Suppose that we are doing time-series analysis with input data sampled approximately 3 to 10 milliseconds. We are interested in frequency domain features. The first step in constructing them would be to find out the Nyquist frequency. Suppose by domain knowledge we know that is 10 Hz (once every 100 ms). That means, we need the data to have a frequency of at least 20 Hz (once every 50 ms), if the features should capture the input signal well. We cannot resample to a lower frequency than that. Ultimately here are the computations we do:

df.resample('50ms').mean().rolling(window=32).aggregate(power_spectrum_coeff)

Here we chose a window size in multiples of 8, and choosing 32 makes the window size to be 1.6 seconds. The aggregate function returns the single-sided frequency domain coefficients and without the first mean component (the fft function is symmetric and with mean value at 0th element). Following is the sample aggregate function:

def power_spectrum_coeff():
    def power_spectrum_coeff_(x):
        return np.fft.fft(x)[1 : int(len(x) / 2 + 1)]

    power_spectrum_coeff_.__name__ = 'power_spectrum_coeff'
    return power_spectrum_coeff_

Now, we would like to repeat this in a sliding window of, say, every 0.4 seconds or every 0.8 seconds. There is no point in wasting computations and calculating FFT every 50 ms instead and then slicing later. Further, resampling down to 400 ms is not an option, because 400 ms is just 2.5 Hz, which is much lower than Nyquist frequency and doing so will result in all information being lost from the features.

This was frequency domain features, which has applications in many time-series related scientific experiments. However, even simpler time-domain aggregate functions such as standard deviation cannot be supported effectively by resampling.

Though I don't think this is actually hard to implement (though would involve a number of changes in the implementation). We use marginal windows (IOW, compute the window and as you advance, drop off the points that are leaving and add points that you are gaining). So still would have to compute everything, but you just wouldn't output it.

Having the 'step' parameter and being able to reduce actual computations by using it has to be the future goal of Pandas. If the step parameter only returns fewer points, then it's not worth doing, because we can slice the output anyhow. Perhaps given the work involved in doing this, we might just recommend all projects with these needs to use Numpy.

@Murmuria you are welcome to submit a pull-request to do this. Its actually not that difficult.

While I second the request for a step parameter in rolling(), I'd like to point out that it is possible to get the desired result with the base parameter in resample(), if the step size is an integer fraction of the window size. Using @alexlouden 's example:

pandas.concat([
    s.resample('30s', label='left', loffset=pandas.Timedelta(15, unit='s'), base=i).agg(p2p) 
    for i in range(30)
]).sort_index().plot(ax=ax, label='Solution with resample()', legend=True, style='k:')

We get the same result (note that the line extends by 30 sec. on both sides):
rolling_with_step_using_resample

This is still somewhat wasteful, depending on the type of aggregation. For the particular case of peak-to-peak calculation as in @alexlouden 's example, p2p_arr() is almost 200x faster because it rearranges the series to a 2-D matrix and then uses a single call to max() and min().

The step parameter in rolling would also allow using this feature without a datetime index. Is there anyone already working on it?

@alexlouden above said this:

I can obviously fall back to numpy, but it'd be nice if there was a higher level API for doing this.

Can @alexlouden or anyone else who knows please share some insight as to how to do this with numpy? From my research so far, it seems it is not trivial to do this either in numpy. In fact, there's an open issue about it here https://github.com/numpy/numpy/issues/7753

Thanks

Hi @tsando - did the function rolling_with_step I used above not work for you?

@alexlouden thanks, just checked that function and it seems to still depend on pandas (takes a series as an input and also uses the series index). I was wondering if there's a purely numpy approach on this. In the thread i mentioned https://github.com/numpy/numpy/issues/7753 they propose a function which uses numpy strides, but they are hard to understand and translate to window and step inputs.

@tsando Here's a PDF of the blog post I linked to above - looks like the author has changed his Github username and hasn't put his site up again. (I just ran it locally to convert it to PDF).

My function above was me just converting his last example to work with Pandas - if you wanted to use numpy directly you could do something like this: https://gist.github.com/alexlouden/e42f1d96982f7f005e62ebb737dcd987

Hope this helps!

@alexlouden thanks! I just tried it on an array of shape (13, 1313) but it gave me this error:

image

"this could be done, but i would like to see a usecase where this matters."

Whatever the project I worked on using pandas, I almost always missed this feature, it is usefull everytime you need to compute the apply only once in a while but still need good resolution inside each window.

I agree and support this feature too

Need it almost every time when dealing with time series, the feature could give much better control for generating time series features for both visualization and analysis. Strongly support this idea!

agree and support this feature too

This would be very helpful to reduce computing time still keeping a good window resolution.

I provide a solution codes, which could be further adjusted accordign to your particular target.

def average_smoothing(signal, kernel_size, stride):
    sample = []
    start = 0
    end = kernel_size
    while end <= len(signal):
        start = start + stride
        end = end + stride
        sample.append(np.mean(signal[start:end]))
    return np.array(sample)

I agree and support this feature. I see is in stop motion right now.

Calculating and then downsampling is not an option when you have TBs of data.

It would be very helpful in what I do as well. I have TBs of data where I need various statistics of non-overlapping windows to understand local conditions. My current "fix" is to just create a generator that slices the data frames and yield's statistics. Would be very helpful to have this feature.

This feature is indeed a must have when time series are involved!

Agree, certainly need this feature added in. Trying to do running window correlations between stock prices and have to create my own function for it

Can't believe such a basic feature isn't there yet!
When will this issue be solved?
Thanks

To contribute to 'further discussion':
My use case is to compute one min/max/median value per hour for a month of data with a resolution of 1 second. It's energy usage data and there are peaks for 1-2 seconds that I would lose with resampling. Other than that, resampling to e.g. 5 seconds/1 minute wouldn't change the fact that I still have to compute 4k/1k windows per day that need to be thrown away, rather than just being able to compute the needed 24 windows per day.

It would be possible to work around this by using groupby a.s.o. but that seems to be neither intuitive nor as fast as the rolling implementation (2 seconds for 2.5mil hour-long windows with sorting). It's impressively fast and useful, but we really need a stride argument to fully utilize its power.

I took a look at the problem. This is relatively trivial, however the way the code is implemented, from a cursory look I think it'll require someone to slog through manually editing all the rolling routines. None of them respect the window boundaries given by the indexer classes. If they did, both this request as well as #11704 would be very easily solvable. In any case, I think it is manageable for anyone who wants to spend some time sprucing things up. I initiated a half-baked PR (expected to be rejected, just for an MVP) to demonstrate how I would tackle the problem.

Running:

import numpy as np
import pandas as pd

data = pd.Series(
    np.arange(100),
    index=pd.date_range('2020/05/12 12:00:00', '2020/05/12 12:00:10', periods=100))

print('1s rolling window every 2s')
print(data.rolling('1s', step='2s').apply(np.mean))

data.sort_index(ascending=False, inplace=True)

print('1s rolling window every 500ms (and reversed)')
print(data.rolling('1s', step='500ms').apply(np.mean))

yields

1s rolling window every 2s
2020-05-12 12:00:00.000000000     4.5
2020-05-12 12:00:02.020202020    24.5
2020-05-12 12:00:04.040404040    44.5
2020-05-12 12:00:06.060606060    64.5
2020-05-12 12:00:08.080808080    84.5
dtype: float64
1s rolling window every 500ms (and reversed)
2020-05-12 12:00:10.000000000    94.5
2020-05-12 12:00:09.494949494    89.5
2020-05-12 12:00:08.989898989    84.5
2020-05-12 12:00:08.484848484    79.5
2020-05-12 12:00:07.979797979    74.5
2020-05-12 12:00:07.474747474    69.5
2020-05-12 12:00:06.969696969    64.5
2020-05-12 12:00:06.464646464    59.5
2020-05-12 12:00:05.959595959    54.5
2020-05-12 12:00:05.454545454    49.5
2020-05-12 12:00:04.949494949    44.5
2020-05-12 12:00:04.444444444    39.5
2020-05-12 12:00:03.939393939    34.5
2020-05-12 12:00:03.434343434    29.5
2020-05-12 12:00:02.929292929    24.5
2020-05-12 12:00:02.424242424    19.5
2020-05-12 12:00:01.919191919    14.5
2020-05-12 12:00:01.414141414     9.5
2020-05-12 12:00:00.909090909     4.5
dtype: float64

For implementation details take a look at the PR (or here: https://github.com/anthonytw/pandas/tree/rolling-window-step)

While I would have liked to spend more time to finish it up I unfortunately have none left to tackle the grunt work of reworking all the rolling functions. My recommendation for anyone who wants to tackle this would be to enforce the window boundaries generated by the indexer classes and unify the rolling_*_fixed/variable functions. With start and end boundaries I don't see any reason they should be different, unless you have a function which does something special with non-uniformly sampled data (in which case that specific function would be better able to handle the nuance, so maybe set a flag or something).

Will this also work for a custom window using the get_window_bounds() approach?

Hi there, I second also the suggestion please. This would be a really useful feature.

If you're using 'standard' functions, these are vectorized, and so v fast (ts.rolling(5).max().dropna()[::2]).

IIUC the saving here would come from only applying the function a fraction of the time (e.g. every nth value). But is there a case where that makes a practical difference?

I have just such an example here: https://stackoverflow.com/questions/63729190/pandas-resample-daily-data-to-annual-data-with-overlap-and-offset

Every Nth would be every 365th. The window size is variable over the lifetime of the program and the step is not guaranteed to be an integer fraction of the window size.

I basically need a set window size that steps by "# of days in the year it's looking at" which is impossible with every solution I've found for this issue so far.

I also have a similar need with the following context (adapted from a real and professional need):

  • I have a chronological dataframe with a timestamp column and a value column, which represents irregular events. Like the timestamp of when a dog passed below my window and how many seconds it took her to pass along. I can have 6 events for a given day and then no event at all for the next 2 days
  • I would like compute a metric (let's say the mean time spent by dogs in front of my window) with a rolling window of 365 days, which would roll every 30 days

As far as I understand, the dataframe.rolling() API allows me to specify the 365 days duration, but not the need to skip 30 days of values (which is a non-constant number of rows) to compute the next mean over another selection of 365 days of values.

Obviously, the resulting dataframe I expect will have a (much) smaller number of rows than the initial 'dog events' dataframe.

Just to gain more clarity about this request with a simple example.

If we have this Series:

In [1]: s = pd.Series(range(5))

In [2]: s
Out[2]:
0    0
1    1
2    2
3    3
4    4
dtype: int64

and we have a window size of 2 and step size of 1. This first window at index 0 would be evaluated, step over the window at index 1, evaluate the window at index 2, etc?

In [3]: s.rolling(2, step=1, min_periods=0).max()

Out[3]:
0    0.0
1    NaN # step over this observation
2    2.0
3    NaN # step over this observation
4    4.0
dtype: float64

Likewise if we have this time based Series

In [1]: s = pd.Series(range(5), index=pd.DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-06', '2020-01-09']))

In [2]: s
Out[2]:
2020-01-01    0
2020-01-02    1
2020-01-03    2
2020-01-06    3
2020-01-09    4
dtype: int64

and we have a window size of '3D' and step size of '3D'. Would this be the correct result?

In [3]: s.rolling('3D', step='3D', min_periods=0).max()

Out[3]:
2020-01-01    0.0       # evaluate this window
2020-01-02    NaN    # step over this observation (2020-01-01 + 3 days > 2020-01-02)
2020-01-03    NaN    # step over this observation (2020-01-01 + 3 days > 2020-01-03)
2020-01-06    3.0      # evaluate this window ("snap back" to this observation)
2020-01-09    4.0      # evaluate this window (2020-01-06 + 3 days = 2020-01-09)
dtype: float64

@mroeschke wrt to the first example ([3]), the results are not what I would expect. I assume this is a trailing window (e.g., at index=0 it would be the max of elements at -1 and 0, so just max([0]), then it should step forward "1" index, to index=0+step=1, and the next computation would be max([0,1]), then max([1,2]), etc. What it looks like you meant to have was a step size of two, so you would move from index=0 to index=0+2=2 (skipping index 1), and continuing like that. In this case it's almost correct, but there should be no NaNs. While it may be "only" double the size in this case, in other cases it is substantial. For example, I have about an hour's worth of 500Hz ECG data for a patient, that's 1.8 million samples. If I wanted a 5-minute moving average every two minutes, that would be an array of 1.8 million elements with 30 valid computations and slightly less than 1.8 million NaNs. :-)

For indexing, step size = 1 is the current behavior, i.e., compute the feature of interest using data in the window, shift the window by one, then repeat. In this example, I want to compute the feature of interest using the data in the window, then shift by 60,000 indices, then repeat.

Similar remarks for the time. In this case, there might be some disagreement as to the correct way to implement this type of window, but in my opinion the "best"(TM) way is to start from time t0, find all elements in the range (t0-window, t0], compute the feature, then move by the step size. Throw away any windows that have fewer than the minimum number of elements (can be configurable, default to 1). That example is for a trailing window, but you can modify to fit any window configuration. This has the disadvantage of wasting time in large gaps, but gaps can be handled intelligently and even if you compute the naive way (because you're lazy like me) I've yet to see this matter in practice, since the gaps are usually not large enough to matter in real data. YMMV.

Maybe that's clearer? Take a look at my example + code above, that might explain it better.

Thanks for the clarification @anthonytw. Indeed, looks like I needed to interpret step as "step to point".

As for the NaNs, I understand the sentiments to drop the NaNs in the output result automatically, but as mentioned in https://github.com/pandas-dev/pandas/issues/15354#issuecomment-278676420 by @jreback, there is an API consistency consideration to have the output have the same length as the input. There may be user that would like to keep the NaNs as well (maybe?), and dropna would still be available after the rolling(..., step=...).func() operation.

@mroeschke I think exceptions should be made. So long as the you put an explicit note in the documentation, and the behavior is not default, no one will be adversely affected by not returning a vector full of junk. Keeping NaNs defeats half the purpose. One objective is to limit the number of times we perform an expensive computation. The other objective is to minimize the feature set to something manageable. That example I gave you is real one, and not nearly as much data as one really has to process in a patient monitoring application. Is it really necessary to allocate 60000x the necessary space, then search through the array to delete NaNs? For each feature we want to compute?

Note that one computation might produce an array of values. What do I want to do with an ECG waveform? Well, compute the power spectrum, of course! So I need to allocate then enough space for 1 full PSD vector (150,000 elements) 1.8 million times (2TB of data) then filter through to get the pieces I care about (34MB). For all the series. For all the patients. I guess I need to buy more RAM!

It's also worth mentioning that NaN, for some features, might be a meaningful output. In which case, I no longer can tell the difference between a meaningful NaN and the junk NaNs padding the data.

While I understand the desire to maintain the API, this is not a feature that will break any existing code (because it's a new feature that didn't exist before), and given the functionality there is no reason anyone would expect it to yield an output of the same size. And even if they did, a note in the documentation for the step size would be sufficient. The disadvantages far outweigh any benefit of having a "consistent" API (for a feature that didn't previously exist, mind you). Not proceeding this way will cripple the feature, it's almost not even worth implementing in that case (in my experience the space cost is almost always the bigger factor).

Was this page helpful?
0 / 5 - 0 ratings