Describing and serializing objects #23

flukeskywalker · 2015-08-10T23:34:04Z

From what I can see, the mechanism for generating descriptions and serializing networks etc. does not fully work yet. @Qwlouse, you were working on this. Any comments on what else is needed?

I think we should have network.save() and trainer.save() methods for dumping to disk, and load_network() and load_trainer() functions for reading the dumps. It shouldn't be more complicated than this, I think. Thoughts?

untom · 2015-08-11T05:44:41Z

Why not implement the pickle serialization methods (__getstate__ and __setstate__) instead? It would have the advantage that you can then use a network whenever pickling is needed (e.g. when using the multiprocessing or concurrent packages)

flukeskywalker · 2015-08-11T19:37:45Z

The reason for not using pickle is security, since saved networks will probably be shared by users with others.

I recommend that we use hdf5 as much as possible. Brainstorm already supports Describable objects, so we can generate a description and then save it in an hdf5 file along with the weights. There can be options to save the internal states/gradients and to use compression. Apart from this, we should also save features/outputs as hdf5 files.

untom · 2015-08-11T20:52:21Z

Is pickle insecure?

flukeskywalker · 2015-08-11T20:53:54Z

Indeed: http://www.benfrederickson.com/dont-pickle-your-data/

On 11 August 2015 at 22:52, untom [email protected] wrote:

Is pickle insecure?

—
Reply to this email directly or view it on GitHub
#23 (comment).

untom · 2015-08-11T21:00:00Z

Thanks for the link!

Qwlouse · 2015-08-17T13:25:38Z

We can still provide the __getstate__ and __setstate__ functions for the usecases that @untom pointed out. But as a standard format for distributing models I agree with @flukeskywalker that hdf5 would be nicer.

None of this should be difficult once I've fixed the serialization issue. The network architecture is JSON serializable, and so will be the initializers, weight-modifiers, and gradient-modifiers. All that remains to be done then is to save the weights (and maybe the state), and those are just big arrays.

Qwlouse · 2015-08-17T15:12:06Z

Ok, I implemented describing the network and the handlers. From here it should be only a small step to pickleable, and once we've fixed the format for hdf5 that should be rather easy too.

Qwlouse · 2015-08-18T09:38:58Z

Format suggestion for HDF5

The simplest format for storing a network as hdf5 would have two groups:

network: a JSON serialized version of the network description
parameters: float array with all parameters

We could unfold both of these more, to make the network easier to inspect by just looking at the hdf5 file.

Buffer structure

HDF5 supports links (hard and soft) to pieces of the data. So we could actually mirror the whole buffer structure in the file:

network: a JSON serialized version of the network description
buffer
- parameters: all parameters as float array
- RnnLayer_1
  - parameters
    - HX
    - HR
    - H_bias
- FullyConnectedLayer
- [...]

I do like this idea, and I think this should not be difficult. The only drawback is, that it might be confusing if you want to write a network file without using brainstorm.

Network Description

We could unfold the JSON description of the network into the hdf5 file by (mis)using their internal structure. So dictionaries would become groups; integer, float and strings would become attributes. numerical lists become datasets, and other lists would have to be shoehorned into something.

This would allow browsing the architecture of the network using just a normal hd5 viewer. On the con side it is quite a bit of work, and it feels a bit artificial. I do not think this is worth the effort.

Optional Full State

We could allow to (optionally) save the full internal state of the network to the hdf5 file. Not too much work, but I'm not sure about good usecases.

untom · 2015-08-18T11:03:11Z

Overall that sounds good. But how are you going to store the JSON? I didn't know HDF5 was meant/well suited for storing large strings?

OTOH, I don't like the idea of translating the JSON into HDF5 structure, it sounds artifical/forced to me, too. Also, I'm not sure there's ever a need to store the Full State, since it usually can be easily computed afterwards (and who needs the full state for a very specific input data point, anyhow?)

flukeskywalker · 2015-08-18T23:03:35Z

JSON + array in HDF5 is good.

Full state/gradients are probably not needed. What would actually help is to have helper functions (which can be used as hooks) which allow the saving of features/gradients/states/deltas as HDF5 files. These would be used for feature extraction, introspection etc.

Qwlouse · 2015-08-25T11:45:29Z

Ok, I'll just store the JSON encoded as utf-8 as a byte array.

Qwlouse · 2015-08-25T13:11:42Z

Done for saving and loading the network.

Qwlouse · 2015-10-15T20:06:59Z

Are we doing anything about this for the release? Currently we can:

store the network (description + parameters as hdf5)
store the trainer logs (using a hook as hdf5 file)
get a description of the trainer, but neglecting any internal state.

So the only thing that would be missing is a way to continue interrupted training somehow.

flukeskywalker · 2015-10-15T20:17:50Z

We can continue training from a network and trainer description.
If we're shuffling the data, we can't really continue training from exactly the same batch, can we?

Qwlouse · 2015-10-15T20:25:31Z

The trainer description not enough, because it (currently) discards "fleeting" information like current_epoch_nr, current_update_nr. Also steppers might have some internal state (like the velocity in MomentumStep), which would not be saved.

And you are right of course: data iterators don't really allow for anything but restarting after a completed epoch.

flukeskywalker · 2015-10-15T20:32:23Z

Unless we can save stepper states, epoch/update info, and optionally some info about the iterator state (batch index?), there does not seem to be much point in continuing training. The batch index can actually be deduced from the update number+batch size, so we could just 'skip' data up to that point when restarting.
The first two are not much work, but this does not seem essential for initial release.

Qwlouse self-assigned this Aug 17, 2015

flukeskywalker added this to the Initial Public Beta Release 0.5 milestone Sep 22, 2015

Qwlouse removed this from the Initial Public Beta Release 0.5 milestone Oct 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Describing and serializing objects #23

Describing and serializing objects #23

flukeskywalker commented Aug 10, 2015

untom commented Aug 11, 2015

flukeskywalker commented Aug 11, 2015

untom commented Aug 11, 2015

flukeskywalker commented Aug 11, 2015

untom commented Aug 11, 2015

Qwlouse commented Aug 17, 2015

Qwlouse commented Aug 17, 2015

Qwlouse commented Aug 18, 2015

untom commented Aug 18, 2015

flukeskywalker commented Aug 18, 2015

Qwlouse commented Aug 25, 2015

Qwlouse commented Aug 25, 2015

Qwlouse commented Oct 15, 2015

flukeskywalker commented Oct 15, 2015

Qwlouse commented Oct 15, 2015

flukeskywalker commented Oct 15, 2015

Describing and serializing objects #23

Describing and serializing objects #23

Comments

flukeskywalker commented Aug 10, 2015

untom commented Aug 11, 2015

flukeskywalker commented Aug 11, 2015

untom commented Aug 11, 2015

flukeskywalker commented Aug 11, 2015

untom commented Aug 11, 2015

Qwlouse commented Aug 17, 2015

Qwlouse commented Aug 17, 2015

Qwlouse commented Aug 18, 2015

Format suggestion for HDF5

Buffer structure

Network Description

Optional Full State

untom commented Aug 18, 2015

flukeskywalker commented Aug 18, 2015

Qwlouse commented Aug 25, 2015

Qwlouse commented Aug 25, 2015

Qwlouse commented Oct 15, 2015

flukeskywalker commented Oct 15, 2015

Qwlouse commented Oct 15, 2015

flukeskywalker commented Oct 15, 2015