Many different problems #15

Fred-Erik · 2017-07-31T13:08:14Z

Hello all,

I get a lot of very basic problems when I try to use Im2Text. Could anyone please help me? It seems unlikely to me that I am the only one that encounters these problems, so it does make me wonder if I am the only one trying to use the model using the Quick start section.

First, when I installed OpenNMT using luarocks install --local https://raw.githubusercontent.com/OpenNMT/OpenNMT/master/rocks/opennmt-scm-1.rockspec, but then I get the error OpenNMT not found. Please enter the path to OpenNMT. If I then enter the path to opennmt where it is installed according to luarocks list it can't find it. Same goes when I manually clone OpenNMT from GitHub. But if I uninstall opennmt, Im2Text doesn't give any error and everything seems to work! So is this part of Quick start outdated?
Next, I'm able to train the test data model without errors, and the validation perplexity is going down. But when I try to run the model provided in Quick start I get this error:
/home/frederik/torch/install/bin/luajit: ./src/model.lua:55: attempt to get length of field 'idToVocab' (a nil value) stack traceback: ./src/model.lua:55: in function 'load' src/train.lua:234: in function 'main' src/train.lua:288: in main chunk [C]: in function 'dofile' ...erik/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405d50
So maybe this model is made for an older version of Im2Text and doesn't work anymore?
I then try to run my own trained model, which seems to work but is finished in 1 second and only writes an empty results.txt:
[07/31/17 14:45:56 INFO] Loading data state from /home/frederik/Documents/dev/experimental/Im2Text/model/model_190-data [07/31/17 14:45:56 INFO] Loaded [07/31/17 14:45:56 INFO] Running... [07/31/17 14:45:56 INFO] Results saved to /home/frederik/Documents/dev/experimental/Im2Text/results/results.txt.
Looking intro train.lua, I see that the problem is that the current epoch is extracted from trianData.epoch, and the maximum number of epochs is set to 1 for testing. So I set trainData.epoch = 1 when phase == test, which works for a couple of images, depending on how for how many steps I trained my model, because it tries to test the trainingdata. When I then remove trainData:load(dataPath) in line 272 of train.lua it does execute on data/test.txt.
But the test speed is very slow. With a batch_size of 1 it takes 13(!) seconds to get the results for one image; 10 images take 2min23sec! When I set the batch_size to 16 (it still only takes 2-4 images at once) it uses only marginally more time at 15 seconds. The results are this, but that's probably because of the few training steps:
[07/31/17 14:56:41 INFO] 55358c150e.png { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } { } {
But is it supposed to be this slow? Evaluating the model during training happens in 4 seconds for batch_size 10, so there does seem to be something wrong I'd say.
Also, the training takes an enormous amount of VRAM. I had to reduce the batch_size to 18 during training for it to fit into the 12 GB I have available. During test time it fares better, with the model using batch_size 1 "only" using 1GB of VRAM. Is the RNN so memory-intensive? Because the CNN is quite small. And with Number of parameters: 9382588 (9M parameters) and the default settings I'd say the the whole network should not be too computationally expensive.
Finally, should I be able to train a working model using the data and model provided in the Quik start? Because the step perplexity doesn't go below ~35 and the Val perplexity doesn't go below 45.8, and the results keep being the same as described above (for every image exactly the same).

The text was updated successfully, but these errors were encountered:

da03 · 2017-07-31T16:44:39Z

Thanks for the detailed feedback! @Fred-Erik I'm looking into these issues now.

da03 · 2017-08-09T04:59:45Z

Okay most problems as reported by @Fred-Erik are fixed now.
For 4, the reason that the test is even slower than train is because the model is not well trained, so the model cannot produce an END-OF-SEQUENCE properly, in which case beam search would go until the max_num_tokens steps (500 as in default) have been reached. For a trained model, the beam search would typically end within several ten-ish steps, which is much faster.
For 5, that is normal behavior, because for training we need to keep most decoder hidden states, including the attentions and context which is of the size of the image. This finding was mentioned in our paper, as well as by Bluche, Théodore, Jérôme Louradour, and Ronaldo Messina. "Scan, attend and read: End-to-end handwritten paragraph recognition with mdlstm attention." arXiv preprint arXiv:1604.03286 (2016).
For 6, the provided sample set is quite small, so you cannot start with it for a reasonable model. As mentioned in our paper (https://arxiv.org/pdf/1609.04938.pdf), at least 20K instances are required to get a decent performance.

Fred-Erik · 2017-08-10T15:23:34Z

Thanks, everything is working now! I get about 2-3 results per second when evaluating the working model you uploaded.

Now I'm going to try to get it to work with release_model.lua in the hope I can deploy this model to an ARM platform. :) You didn't perchance make any progress with it already? #5

da03 · 2017-08-10T20:46:04Z

Not yet, since I'm using cudnn in the CNN part. I'll look into this to make it working with GPU (note that would be much slower though)

Fred-Erik · 2017-08-11T13:47:17Z

I guess you mean without GPU? Anyway, thank you very much! I'd be glad to hear it when you got something working. :)

Fred-Erik changed the title ~~Is this actually maintained?~~ Many different problems Jul 31, 2017

Fred-Erik closed this as completed Aug 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many different problems #15

Many different problems #15

Fred-Erik commented Jul 31, 2017 •

edited

Loading

da03 commented Jul 31, 2017

da03 commented Aug 9, 2017 •

edited

Loading

Fred-Erik commented Aug 10, 2017

da03 commented Aug 10, 2017

Fred-Erik commented Aug 11, 2017

Many different problems #15

Many different problems #15

Comments

Fred-Erik commented Jul 31, 2017 • edited Loading

da03 commented Jul 31, 2017

da03 commented Aug 9, 2017 • edited Loading

Fred-Erik commented Aug 10, 2017

da03 commented Aug 10, 2017

Fred-Erik commented Aug 11, 2017

Fred-Erik commented Jul 31, 2017 •

edited

Loading

da03 commented Aug 9, 2017 •

edited

Loading