Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Add inter_batch_stride param to TimeSeriesGenerator #251

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rjmccabe3701
Copy link

Summary

Grant explicit control for inter-batch sample ordering in TimeSeriesGenerator.
This change is backwards compatible.

Related Issues

See this stackoverflow question for background.

PR Overview

The original TimeSeriesGenerator emits batches where the first sample is
a stride ahead of the previous batch's last sample. Many times more
control is required (especially if feeding to a stateful RNN).

The inter_batch_stride option allows user to explicitly specify the
inter-batch stride relationships. If this option is None (or not
supplied), the original behavior is maintained.

  • [y] This PR requires new unit tests [y/n] (make sure tests are included)
  • [y] This PR requires to update the documentation [y/n] (make sure the docs are up-to-date)
  • [y] This PR is backwards compatible [y/n]
  • [y] This PR changes the current API [y/n] (all API changes need to be approved by fchollet)

McCabe, Robert J added 2 commits October 10, 2019 19:07
The original TimeSeriesGenerator emits batches where the first sample is
a stride ahead of the previous batch's last sample. Many times more
control is required (especially if feeding to a stateful RNN).

The inter_batch_stride option allows user to explicitly specify the
inter-batch stride relationships.  If this option is None (or not
supplied), the original behavior is maintained.
@OverLordGoldDragon
Copy link

OverLordGoldDragon commented Oct 11, 2019

I suggest staying clear of TimeseriesGenerator altogether; I'm not too familiar with it as I never used it, but I did so for a reason; consider:

import numpy as np

data = np.random.randn(240, 4)  # (timesteps, channels)
length = 6
batch_size = 8
stride = 1
start_index, end_index, index = 6, 40, 0
i = start_index + batch_size * stride * index
rows = np.arange(i, min(i + batch_size * stride, end_index + 1), stride)

samples = np.array([data[row - length:row:sampling_rate] for row in rows])
print(samples.shape)
# (8, 6, 4)

Batch dimension and a feature dimension (timesteps) are mixed, which is a big no-no and will obliterate training. The only workaround is if you feed data compatible with such a manipulation to begin with - good luck with that. There are other pitfalls I'll spare listing - bottom line, write your own timeseries generator.

@rjmccabe3701
Copy link
Author

@OverLordGoldDragon I understand your concerns, but this is a well-respected library. If you feel strongly about it, help make it more understandable/extensible.

@OverLordGoldDragon
Copy link

@rjmccabe3701 Keras' PR page is piled with fine improvements or extensions of functionality that are never merged - TensorFlow's a lot less phobic to this end. If I were to make a PR, pretty sure it'd be rejected or left collecting electron dust - as it'd be a total overhaul of TimeseriesGenerator (which it does appear to need).

For your specific instance, you need not take my word for anything - just compare the data you feed in with that which it spits out and see if it makes sense.

@TanguyUrvoy
Copy link

TanguyUrvoy commented Oct 11, 2019

@OverLordGoldDragon Hello,

"Batch dimension and a feature dimension (timesteps) are mixed"
Indeed, this called online learning and this is precisely what time-series are about: suppose you want to train a temperature predictor from four sensors on your house, the only data you get is a single (infinite) sequence of time-indexed vectors of dimension 4. I will be quite difficult to train your model without "mixing time and batches".

You could learn purely online in order to keep a potentially unbounded history, I tried something like that with the (experimental) stateful option, but it seems more efficient (especially with GPUs) to generate fixed-length history samples of the series with a sliding/overlapping window.
If the overlapping, and the correlation it induce, is a problem for your setting you may subsample the windows. A simple way to do that is to set the stride parameter to a value larger than the window's length with the new version of the TimeseriesGenerator. But on my experience it doesn't help so much as it rejects a lot of data.

Regards,

TU

@OverLordGoldDragon
Copy link

OverLordGoldDragon commented Oct 11, 2019

@TanguyUrvoy Misleading information; mixing batch and features dimensions is never beneficial.

What you describe for online learning is splitting the sequence into finite parts, then treating them as independent. This no longer qualifies for 'mixing' dimensions, as sequences are assumed independent to begin with. The problem arises when you have a single long sequence that is not to be treated as independent, but still needs to be split up - now mixing dimensions will treat a single, causally contiguous sequence as independent sequences, and will prohibit learning any time dependence.

The one case that differs is one-step prediction, where one sequence is fed with varying lags in a single batch - but that already accounts for above in setting up data as independent sequences. OP's case concerns statefulness, for which I little-to-no use for TimeseriesGenerator - details here

@TanguyUrvoy
Copy link

TanguyUrvoy commented Oct 14, 2019

@OverLordGoldDragon well I am happy that we both agree on the fact that, splitting long sequences into finite parts is a necessary evil as it crops the histories :)

The (experimental) statefuloption was an attempt to interleave sequences into consecutive batches in order to render stateful learning possible with batches of size greater than one. It requires the stride and the window's sizes to coincide. But I am never using it and I am not sure it really works as it is.

TU

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants