Why Tricky?

Unlike in conventional cross validation that assumes data point independence across time i.e observation that are near in time are related.

Therefore when the conventional cross validation technique are used to estimate the  model accuracy for the time series data, then it fails  miserably, as the conventional  cross validation takes  some input data at random points of the data.

E.g for a time series data 1,2,3,4,5,6,7,8,9,10 a traditional cross validation  might yield the  set as

1,10,9,4  as train and rest as test set

However in a time series data we would want to preserve the data point order and closeness. We might want to have something as

1,2,3,4,5,6 as train and 7,8,9,10 as test set.

This splitted time series data set has data dependence completely preserved.

 

Time Series Split 

For such  time series data need, in python we can use the Time Series Split, which returns first k folds as train set and the k+1th hold as test set.

 

Reference:

http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-of-time-series-data

 

Advertisements