Scikit-learn: Implement WalkForward cross-validator for time series data.

Created on 15 Jul 2019  ·  23Comments  ·  Source: scikit-learn/scikit-learn

Description


Implement the walk forward cv for time series data with gap between the train set and the test set.

Expected Results


image

Expanding
image

Needs Decision New Feature

Most helpful comment

I think adding a better time-series cross-validation is in scope.

All 23 comments

@saninstein Why the gap?

@clstaudt gap is useful feature for models evaluation in stock trading. In price prediction it's very common case when model performs very well on the data right after training set and degrades over time. So, in some cases it can be useful to skip a little amount of "good" data that occurred after training set.

I think our usual position on time-series related features is that it's out of scope for sklearn (at least for now). And to me, it would make sense to revisit the matter once we have sample properties such as timestamp attached to the data.

I'd like to have at least one other opinion from @scikit-learn/core-devs on this, but my vote is a "won't fix" resolution for now.

@adrinjalali sklearn already has the many good models for timeseries,

Which ones?

@adrinjalali there is TimeSeriesSplit which already implemented in sklearn :)

TimeSeriesSplit is specific case of proposed WalkForwardCV (can be achieved with gap=0, expanding=True args), actually TimeSeriesSplit can be easily replaced with WalkForwardCV (just set default args as I mentioned before).

I'm part of the stock trading ML team and we successfully using various sklearn features and sklearn compatible libraries. There was lack of proper CV splitter and we decided to contribute it back to sklearn, but if this is not the case we will use it internally :)

TimeSeriesSplit is specific case of proposed WalkForwardCV (can be achieved with gap=0, expanding=True args), actually TimeSeriesSplit can be easily replaced with WalkForwardCV (just set default args as I mentioned before).

Fair, but then I'd probably try to patch TimeSeriesSplit in a backward compatible way to add the feature you need. That also has a higher chance of being accepted by the community.

Timeseries prediction not the only case for WalkForwardCV. It useful when dataset observation is ordered in time and validation need to be done in the same way.

For example one of our task is loosing trades filtering classification problem when we trying to improve existing trading strategy with ml model wich will "give permission" to trade.


Fair, but then I'd probably try to patch TimeSeriesSplit in a backward compatible way to add the feature you need. That also has a higher chance of being accepted by the community.

I believe it would be the best solution.

I am a data scientist working on time series forecasting models. I assumed time series-specific things are out of scope for scikit-learn, so I started to implement my own validation code, slightly different from the method described in this thread.

Later I learned of the existence of TimeSeriesSplit and wondered whether I have started to reinvent the wheel. I would prefer to contribute the validation code to an existing, established project.

Since there is clearly a demand for this kind of model evaluation, I still wonder where it fits.

There are many patterns that are needed for prediction on time series. I
would think for instance that transformer creating wavelet features would
be very useful.

However, these are outside the scope of scikit-learn. It would be useful
to create a package that implements all these in a consistent way. It
would probably pick up momentum.

It would be useful to create a package that implements all these in a consistent way. It would probably pick up momentum.

Yes. Sign me up.

We'd be happy to have such a package in https://github.com/scikit-learn-contrib/

Closing this one then :)

I think adding a better time-series cross-validation is in scope.

also see #13666 #13204 #6322 #13761

Re scope, I agree with @amueller that we should be open to extending this to common use-cases. Basically, we generally assume in scikit-learn estimators (i.e. sklearn package) that the model should be more-or-less invariant to sample order and feature order. This excludes time series estimators. However, we do not have this constraint in cross validation splitters where we have long considered sample order something to pay attention to; ultimately, cross validation is where the core assumptions around ML lie.

But as @amueller also points out, really the conversation should be continued in the existing pull requests, moving them towards an agreeable state.

Sure, but I'm wary of cases where the actual timestamp of the data should matter in the split (which IMO it should), and not the mere count of the rows. I don't think we're going to handle the timestamps anytime soon, are we?

In the 5 years since I first proposed this in https://github.com/scikit-learn/scikit-learn/issues/3202, this question has come up at least 50 times in conversations teaching or applying. @saninstein , did you make a decision about whether to push for inclusion here or -contrib? I would love to help if there is anything you need assistance with to get this over the line (somewhere).

I would also like to contribute. Wrote in another issue about this and would like to expand on TimeSeriesSplit or collaborate on creating another package for that. I feel this is something that is related to splits in the CV domain and should be in sklearn.
To be honest though I am completely confused as to where to go and what to do now that I want to contribute. I am mindful of my time and I would like to use it in the right way for the community.

@mjbommar I think as I and @jnothman above say, we are quite open in moving forward and there are some exiting PRs, in particular #13761 and #13204 and feedback on the two APIs would be much appreciated.

I think #13204 looks to be the most mature so maybe going from there makes the most sense? I'm not sure if @kykosic is still working on it, given the delay in our response?

Hm though #13204 doesn't implement WalkForward... Do we want to merge #13204 first and then implement WalkForward later?
Should that be a separate CV object?

@amueller I had forgotten about #13204 until this post came up. I will address the reviews on it over the next week and see if it still fits in.

@kykosic awesome, thanks!

Curious about why was this issue closed because of #13204 being finished. I thought that #13204 was a pre-requisite for this one.

13204 added gap to TimeSeriesSplit that was the feature requested by this issue.

Was this page helpful?
0 / 5 - 0 ratings