The notebook can be run in sequential order to train the models and show example outputs (which display in jupyter) and generate metrics.
The neural network architecture comprises a listener feature embedder and a speller. The listener feature embedder processes input data with shape [128, 1621, 27], and the speller generates output with shape [128, 328]. The architecture is composed of a total of 3954 layers.
The listener feature embedder processes the input data and consists of convolutional layers, batch normalization, and activation functions. It takes input of shape [128, 1621, 27] and produces output of shape [128, 256, 1621].
The speller generates the final output by processing the output from the listener feature embedder. It consists of multiple layers, including linear layers and activation functions.
- Input Shape: [27, 256, 5]
- Output Shape: [128, 256, 1621]
- Parameters: 34.816k
- Mult-Adds: 56.02176M
- Input Shape: [512, 31]
- Output Shape: [128, 31]
- Parameters: 15.903k
- Mult-Adds: 0.51118M
- Trainable parameters: 28.479775M
- Non-trainable parameters: 0.0
- Mult-Adds: 3.36951424G
The pBLSTM architecture consists of a bidirectional LSTM layer followed by a truncation and reshape operation. It takes a packed sequence as input and produces a packed sequence as output.
The bidirectional LSTM layer has two layers and is set to be batch-first.
Trained for close to 200 epochs in total:
- 35 epochs without scheduler or teacher-forcing
- 15 epochs with teacher forcing starting from 1.0 and reduced at 0.05 every 2 epochs
- 5 epochs with teacher forcing reduced at 0.025 rate every 2 epochs
- Additional 5 epochs with teacher forcing reduced at 0.025 rate every 2 epochs
- 20 epochs without teacher forcing
- Optimizer: AdamW with learning rate=1e-3
- Loss Function: Cross Entropy Loss
- Teacher Forcing Rate: Varies across epochs, starting from 1 and reducing by 0.05 every 10 epochs
- Batch Size: 128
- Locked Dropout in the pBLSTM encoder
- Weight Tying for the character embedding layer and the character probability layer in the decoder
Time and Frequency Masking transforms were applied to improve the variance of the data.
The model was initially tested on a Toy Dataset to ensure proper functionality.