Scikit-learn: Implement WalkForward cross-validator for time series data.

Created on 15 Jul 2019 · 23Comments · Source: scikit-learn/scikit-learn

Description

Implement the walk forward cv for time series data with gap between the train set and the test set.

Expected Results

Expanding

Needs Decision New Feature

Source

saninstein

👍7

Most helpful comment

I think adding a better time-series cross-validation is in scope.

amueller on 23 Jul 2019

👍10 🚀5

All 23 comments

@saninstein Why the gap?

clstaudt on 15 Jul 2019

@clstaudt gap is useful feature for models evaluation in stock trading. In price prediction it's very common case when model performs very well on the data right after training set and degrades over time. So, in some cases it can be useful to skip a little amount of "good" data that occurred after training set.

ksanderer on 22 Jul 2019

👍5

I think our usual position on time-series related features is that it's out of scope for sklearn (at least for now). And to me, it would make sense to revisit the matter once we have sample properties such as timestamp attached to the data.

I'd like to have at least one other opinion from @scikit-learn/core-devs on this, but my vote is a "won't fix" resolution for now.

adrinjalali on 22 Jul 2019

@adrinjalali sklearn already has the many good models for timeseries,

Which ones?

GaelVaroquaux on 22 Jul 2019

@adrinjalali there is TimeSeriesSplit which already implemented in sklearn :)

TimeSeriesSplit is specific case of proposed WalkForwardCV (can be achieved with gap=0, expanding=True args), actually TimeSeriesSplit can be easily replaced with WalkForwardCV (just set default args as I mentioned before).

I'm part of the stock trading ML team and we successfully using various sklearn features and sklearn compatible libraries. There was lack of proper CV splitter and we decided to contribute it back to sklearn, but if this is not the case we will use it internally :)

ksanderer on 22 Jul 2019

TimeSeriesSplit is specific case of proposed WalkForwardCV (can be achieved with gap=0, expanding=True args), actually TimeSeriesSplit can be easily replaced with WalkForwardCV (just set default args as I mentioned before).

Fair, but then I'd probably try to patch TimeSeriesSplit in a backward compatible way to add the feature you need. That also has a higher chance of being accepted by the community.

adrinjalali on 22 Jul 2019

👍1

Timeseries prediction not the only case for WalkForwardCV. It useful when dataset observation is ordered in time and validation need to be done in the same way.

For example one of our task is loosing trades filtering classification problem when we trying to improve existing trading strategy with ml model wich will "give permission" to trade.

Fair, but then I'd probably try to patch TimeSeriesSplit in a backward compatible way to add the feature you need. That also has a higher chance of being accepted by the community.

I believe it would be the best solution.

ksanderer on 22 Jul 2019

👍1

I am a data scientist working on time series forecasting models. I assumed time series-specific things are out of scope for scikit-learn, so I started to implement my own validation code, slightly different from the method described in this thread.

Later I learned of the existence of TimeSeriesSplit and wondered whether I have started to reinvent the wheel. I would prefer to contribute the validation code to an existing, established project.

Since there is clearly a demand for this kind of model evaluation, I still wonder where it fits.

clstaudt on 22 Jul 2019

There are many patterns that are needed for prediction on time series. I
would think for instance that transformer creating wavelet features would
be very useful.

However, these are outside the scope of scikit-learn. It would be useful
to create a package that implements all these in a consistent way. It
would probably pick up momentum.

GaelVaroquaux on 22 Jul 2019

👍3

It would be useful to create a package that implements all these in a consistent way. It would probably pick up momentum.

Yes. Sign me up.

clstaudt on 22 Jul 2019

We'd be happy to have such a package in https://github.com/scikit-learn-contrib/

Closing this one then :)

adrinjalali on 22 Jul 2019

I think adding a better time-series cross-validation is in scope.

amueller on 23 Jul 2019

👍10 🚀5

also see #13666 #13204 #6322 #13761

amueller on 23 Jul 2019

Re scope, I agree with @amueller that we should be open to extending this to common use-cases. Basically, we generally assume in scikit-learn estimators (i.e. sklearn package) that the model should be more-or-less invariant to sample order and feature order. This excludes time series estimators. However, we do not have this constraint in cross validation splitters where we have long considered sample order something to pay attention to; ultimately, cross validation is where the core assumptions around ML lie.

But as @amueller also points out, really the conversation should be continued in the existing pull requests, moving them towards an agreeable state.

jnothman on 23 Jul 2019

Sure, but I'm wary of cases where the actual timestamp of the data should matter in the split (which IMO it should), and not the mere count of the rows. I don't think we're going to handle the timestamps anytime soon, are we?

adrinjalali on 23 Jul 2019

In the 5 years since I first proposed this in https://github.com/scikit-learn/scikit-learn/issues/3202, this question has come up at least 50 times in conversations teaching or applying. @saninstein , did you make a decision about whether to push for inclusion here or -contrib? I would love to help if there is anything you need assistance with to get this over the line (somewhere).

mjbommar on 3 Nov 2019

👍3

I would also like to contribute. Wrote in another issue about this and would like to expand on TimeSeriesSplit or collaborate on creating another package for that. I feel this is something that is related to splits in the CV domain and should be in sklearn.
To be honest though I am completely confused as to where to go and what to do now that I want to contribute. I am mindful of my time and I would like to use it in the right way for the community.

svenstehle on 23 Dec 2019

@mjbommar I think as I and @jnothman above say, we are quite open in moving forward and there are some exiting PRs, in particular #13761 and #13204 and feedback on the two APIs would be much appreciated.

I think #13204 looks to be the most mature so maybe going from there makes the most sense? I'm not sure if @kykosic is still working on it, given the delay in our response?

amueller on 23 Dec 2019

Hm though #13204 doesn't implement WalkForward... Do we want to merge #13204 first and then implement WalkForward later?
Should that be a separate CV object?

amueller on 23 Dec 2019

👍1

@amueller I had forgotten about #13204 until this post came up. I will address the reviews on it over the next week and see if it still fits in.