Skip to content
This repository has been archived by the owner on Jan 3, 2023. It is now read-only.

nan cost #37

Open
basant-kumar opened this issue Jun 26, 2017 · 8 comments
Open

nan cost #37

basant-kumar opened this issue Jun 26, 2017 · 8 comments
Assignees

Comments

@basant-kumar
Copy link

hi, I'm getting nan cost after resuming the training for the pre-trained model (librispeech_16_epochs.prm).
the cost becomes nan after 16/17 epoch and the testing results (after each epoch) are null.

OS: Ubuntu 16.04
GPU: Nvidia Titan-X Pascal (12GB RAM)
Neon: version 1.9.0

@tyler-nervana tyler-nervana self-assigned this Jul 10, 2017
@tyler-nervana
Copy link
Contributor

Could you share a bit more details on your setup? We haven't seen this behavior. What is the command you are running to train further? Which dataset are you using? Is there anything different about your data from the librispeech dataset?

@gardenia22
Copy link

gardenia22 commented Jul 24, 2017

I am getting the same problem. My audio data are in wav format other than flac. Is this a problem?
following is my command:
python train.py --manifest train:data/train_1700hour.csv --manifest val:data/dev_1700hour.csv -e 20 -z 12 -s model/ds2_1700hour_20_epochs.prm --model_file model/librispeech_16_epochs.prm

@gardenia22
Copy link

My transcription files have '\n' in the file, which leads to nan cost problem.

@tyler-nervana
Copy link
Contributor

Thanks for the quick update. Currently anything in the transcript files is treated as a character, including "\n".

@pankaj2701
Copy link

I also get the same problem, when .wav files are used. When I converted the files to flac files then
the nan value problem did not appear.

@tyler-nervana
Copy link
Contributor

Thanks for noticing the difficulty with .wav files. We'll take a look.

@Drea1989
Copy link

hello, i write here because i encountered a problem with nan cost as well.
I am using Neon 2.0 for python 2.7 on Ubuntu 16.04 using GTX1080 backend.

in my case i am using librispeech train-500-other and after 50-60% of the epoch the cost becomes nan.
i have tried training the model only using the other libispeech packages and it trains as expected.
any thoughts on this?

@Drea1989
Copy link

Drea1989 commented Nov 8, 2017

i was able to fix the issue by dropping the learning rate of 2 order of magnitude, the issue was apparently due to an infinite cost caused by a prediction being too certain of a very wrong value.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants