Add inter_batch_stride param to TimeSeriesGenerator #251

rjmccabe3701 · 2019-10-11T00:16:27Z

Summary

Grant explicit control for inter-batch sample ordering in TimeSeriesGenerator.
This change is backwards compatible.

Related Issues

See this stackoverflow question for background.

PR Overview

The original TimeSeriesGenerator emits batches where the first sample is
a stride ahead of the previous batch's last sample. Many times more
control is required (especially if feeding to a stateful RNN).

The inter_batch_stride option allows user to explicitly specify the
inter-batch stride relationships. If this option is None (or not
supplied), the original behavior is maintained.

[y] This PR requires new unit tests [y/n] (make sure tests are included)
[y] This PR requires to update the documentation [y/n] (make sure the docs are up-to-date)
[y] This PR is backwards compatible [y/n]
[y] This PR changes the current API [y/n] (all API changes need to be approved by fchollet)

The original TimeSeriesGenerator emits batches where the first sample is a stride ahead of the previous batch's last sample. Many times more control is required (especially if feeding to a stateful RNN). The inter_batch_stride option allows user to explicitly specify the inter-batch stride relationships. If this option is None (or not supplied), the original behavior is maintained.

OverLordGoldDragon · 2019-10-11T02:21:17Z

I suggest staying clear of TimeseriesGenerator altogether; I'm not too familiar with it as I never used it, but I did so for a reason; consider:

import numpy as np

data = np.random.randn(240, 4)  # (timesteps, channels)
length = 6
batch_size = 8
stride = 1
start_index, end_index, index = 6, 40, 0

i = start_index + batch_size * stride * index
rows = np.arange(i, min(i + batch_size * stride, end_index + 1), stride)

samples = np.array([data[row - length:row:sampling_rate] for row in rows])
print(samples.shape)
# (8, 6, 4)

Batch dimension and a feature dimension (timesteps) are mixed, which is a big no-no and will obliterate training. The only workaround is if you feed data compatible with such a manipulation to begin with - good luck with that. There are other pitfalls I'll spare listing - bottom line, write your own timeseries generator.

rjmccabe3701 · 2019-10-11T02:52:36Z

@OverLordGoldDragon I understand your concerns, but this is a well-respected library. If you feel strongly about it, help make it more understandable/extensible.

OverLordGoldDragon · 2019-10-11T03:01:53Z

@rjmccabe3701 Keras' PR page is piled with fine improvements or extensions of functionality that are never merged - TensorFlow's a lot less phobic to this end. If I were to make a PR, pretty sure it'd be rejected or left collecting electron dust - as it'd be a total overhaul of TimeseriesGenerator (which it does appear to need).

For your specific instance, you need not take my word for anything - just compare the data you feed in with that which it spits out and see if it makes sense.

TanguyUrvoy · 2019-10-11T15:52:26Z

@OverLordGoldDragon Hello,

"Batch dimension and a feature dimension (timesteps) are mixed"
Indeed, this called online learning and this is precisely what time-series are about: suppose you want to train a temperature predictor from four sensors on your house, the only data you get is a single (infinite) sequence of time-indexed vectors of dimension 4. I will be quite difficult to train your model without "mixing time and batches".

You could learn purely online in order to keep a potentially unbounded history, I tried something like that with the (experimental) stateful option, but it seems more efficient (especially with GPUs) to generate fixed-length history samples of the series with a sliding/overlapping window.
If the overlapping, and the correlation it induce, is a problem for your setting you may subsample the windows. A simple way to do that is to set the stride parameter to a value larger than the window's length with the new version of the TimeseriesGenerator. But on my experience it doesn't help so much as it rejects a lot of data.

Regards,

TU

OverLordGoldDragon · 2019-10-11T18:55:30Z

@TanguyUrvoy Misleading information; mixing batch and features dimensions is never beneficial.

What you describe for online learning is splitting the sequence into finite parts, then treating them as independent. This no longer qualifies for 'mixing' dimensions, as sequences are assumed independent to begin with. The problem arises when you have a single long sequence that is not to be treated as independent, but still needs to be split up - now mixing dimensions will treat a single, causally contiguous sequence as independent sequences, and will prohibit learning any time dependence.

The one case that differs is one-step prediction, where one sequence is fed with varying lags in a single batch - but that already accounts for above in setting up data as independent sequences. OP's case concerns statefulness, for which I little-to-no use for TimeseriesGenerator - details here

TanguyUrvoy · 2019-10-14T06:27:37Z

@OverLordGoldDragon well I am happy that we both agree on the fact that, splitting long sequences into finite parts is a necessary evil as it crops the histories :)

The (experimental) statefuloption was an attempt to interleave sequences into consecutive batches in order to render stateful learning possible with batches of size greater than one. It requires the stride and the window's sizes to coincide. But I am never using it and I am not sure it really works as it is.

TU

McCabe, Robert J added 2 commits October 10, 2019 19:07

Test for inter_batch_stride TimeseriesGenerator option

b39f31f

rjmccabe3701 force-pushed the master branch from 8030260 to b39f31f Compare October 11, 2019 00:54

rjmccabe3701 mentioned this pull request Oct 11, 2019

New TimeseriesGenerator #7

Closed

rjmccabe3701 closed this Oct 11, 2019

rjmccabe3701 reopened this Oct 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add inter_batch_stride param to TimeSeriesGenerator #251

Add inter_batch_stride param to TimeSeriesGenerator #251

rjmccabe3701 commented Oct 11, 2019

OverLordGoldDragon commented Oct 11, 2019 •

edited

Loading

rjmccabe3701 commented Oct 11, 2019

OverLordGoldDragon commented Oct 11, 2019

TanguyUrvoy commented Oct 11, 2019 •

edited

Loading

OverLordGoldDragon commented Oct 11, 2019 •

edited

Loading

TanguyUrvoy commented Oct 14, 2019 •

edited

Loading

Add inter_batch_stride param to TimeSeriesGenerator #251

Are you sure you want to change the base?

Add inter_batch_stride param to TimeSeriesGenerator #251

Conversation

rjmccabe3701 commented Oct 11, 2019

Summary

Related Issues

PR Overview

OverLordGoldDragon commented Oct 11, 2019 • edited Loading

rjmccabe3701 commented Oct 11, 2019

OverLordGoldDragon commented Oct 11, 2019

TanguyUrvoy commented Oct 11, 2019 • edited Loading

OverLordGoldDragon commented Oct 11, 2019 • edited Loading

TanguyUrvoy commented Oct 14, 2019 • edited Loading

OverLordGoldDragon commented Oct 11, 2019 •

edited

Loading

TanguyUrvoy commented Oct 11, 2019 •

edited

Loading

OverLordGoldDragon commented Oct 11, 2019 •

edited

Loading

TanguyUrvoy commented Oct 14, 2019 •

edited

Loading