Fun little demonstration of encoder-decoder architectures. Try captioning your favorite image!
https://caption-it.projects.yadgaran.net/
This project uses SimpleML to automate modeling and persistence. Try it in your own projects!
Homepage: https://github.com/eyadgaran/SimpleML
Installation: pip install simpleml
Documentation: https://simpleml.readthedocs.io/en/latest
The first thing to remember when manipulating datasets is that it must mirror what will be used when predicting. The COCO dataset is formatted as follows:
coco_url | date_captured | file_name | height | id | license | width | caption_count | y_0 | y_1 | y_2 | y_3 | y_4 | y_5 | y_6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | http://images.cocodataset.org/val2017/00000039... | 2013-11-14 17:02:52 | 000000397133.jpg | 427 | 397133 | 4 | 640 | 5 | A man is in a kitchen making pizzas. | Man in apron standing on front of oven with pa... | A baker is working in the kitchen rolling dough. | A person standing by a stove in a kitchen. | A table with pies being made and a person stan... | NaN | NaN |
1 | http://images.cocodataset.org/val2017/00000003... | 2013-11-14 20:55:31 | 000000037777.jpg | 230 | 37777 | 1 | 352 | 5 | The dining table near the kitchen has a bowl o... | A small kitchen has various appliances and a t... | The kitchen is clean and ready for us to see. | A kitchen and dining area decorated in white. | A kitchen that has a bowl of fruit on the table. | NaN | NaN |
2 | http://images.cocodataset.org/val2017/00000025... | 2013-11-14 22:32:02 | 000000252219.jpg | 428 | 252219 | 4 | 640 | 5 | a person with a shopping cart on a city street | City dwellers walk by as a homeless man begs f... | People walking past a homeless man begging on ... | a homeless man holding a cup and standing next... | People are walking on the street by a homeless... | NaN | NaN |
3 | http://images.cocodataset.org/val2017/00000008... | 2013-11-14 23:11:37 | 000000087038.jpg | 480 | 87038 | 1 | 640 | 5 | A person on a skateboard and bike at a skate p... | A man on a skateboard performs a trick at the ... | A skateboarder jumps into the air as he perfor... | Athletes performing tricks on a BMX bicycle an... | a man falls off his skateboard in a skate park. | NaN | NaN |
Due to the memory and disk overhead, I do not load the images until they are needed, so url is interchangeable with image file for this project.
When predicting later, only the image file and current caption will be passed and the output is expected to be a SINGLE next word, like this:
url | current_predicted_y | next_predicted_y | |
---|---|---|---|
0 | some_file.jpg | [Cool, Generated] | Caption |
Additionally, because of the text form of the label, the model will actually be predicting tokens so it will really look like this (our text processor will handle the vocabulary and transformation back to words):
url | current_predicted_y | next_predicted_y | |
---|---|---|---|
0 | some_file.jpg | [token_123, token_456] | token_789 |
In order to align these forms, we will use a dataset pipeline (Dataset Pipelines are only used to adjust the format of our data, actual transformations we want to apply at runtime must go through a traditional pipeline)
There are only a few transformers we will use in the dataset pipeline:
- Drop unnecessary columns
- Stack dataset so there is only 1 caption per row (duplicate images across rows)
- Encode our caption into vectors using the text processor
- Duplicate an offset version of the caption for the recurrent input (previous tokens as features to predict next token)
The final output will look like this:
image | caption | y | |
---|---|---|---|
0 | http://images.cocodataset.org/val2017/00000039... | [START_TOKEN, token_643, token_9984, ...] | [token_643, token_9984, END_TOKEN, ...] |
0 | http://images.cocodataset.org/val2017/00000039... | [START_TOKEN, token_423, token_24, ...] | [token_423, token_24, END_TOKEN, ...] |
. | |||
. | |||
1 | http://images.cocodataset.org/val2017/00000003... | [START_TOKEN, token_94, token_70, ...] | [token_94, token_70, END_TOKEN, ...] |
1 | http://images.cocodataset.org/val2017/00000003... | [START_TOKEN, token_84, token_24, ...] | [token_84, token_24, END_TOKEN, ...] |
Now that we have our formatted dataset, we can configure our pipeline that will transform all images (future as well) for the proper model input.
- Read image pixels into ndarray
- Crop image to a square (centered)
- Resize image to matrix dimensions
- Normalize image using the imagenet preprocessor
- Encode the image using a pretrained image model
Encoder-Decoder architecture is a form of sequence-to-sequence task, but there are actually a few models at play here that we glossed over.
-
Text processor: this model fits over our dataset and learns the vocabulary accessible to our decoder
-
Encoder: typically this model would have to learn an embedding of our images that we can later map to our tokens. Thankfully we can use an existing model or easily tune one for our purposes with transfer learning (the objective would be classification accuracy on an unrelated image dataset)
-
Decoder: this is the main model that we work with here and what maps our image encodings to our text encodings. We condition our tokens (captions) on the image embedding and learn the generalized relationship for new captions
There are a number of subtleties in the decoder architecture that are elegantly abstracted away by the frameworks used (Tensorflow, Keras, SimpleML)
- Recurrent Architecture
-
When you think about it, every training input can be utilized in two different ways (refer back to the dataset input). In an example where there are 20 tokens in the expected prediction, we can:
- Treat every image as one sample with the objective to produce the ENTIRE expected sequence (1x20 element output array)
- Use every image as 20 independent samples without the double jeopardy of any one of them being incorrect (20x1 element output array)
-
The second technique is called teacher forcing and what we use here. This allows us to let our model learn quickly by taking advantage of the augmented dataset size. There are reasons to not use teacher forcing, namely robustness in scenarios where early sequence predictions are off target (the model learns to correct itself during training because it feeds its own predictions back in, instead of getting corrected by "the teacher").
-
Keras and Tensorflow elegantly do this under the hood via the TimeDistributed and dynamic_rnn layers. Input remains the same (image, [offset tokens]: [expected tokens]) and the loss is computed in accordance with the teacher forcing methodology.
- Inference vs Training
- The consequence of this training methodology is that the network structure and input will be different at inference time.
- SimpleML manages this by allowing us to define the shared layers in each network and it automatically transfers the weights when generating predictions
http://cocodataset.org/#home
https://ai.googleblog.com/2014/11/a-picture-is-worth-thousand-coherent.html
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/ImageCaptionInWild-1.pdf
https://cs.stanford.edu/people/karpathy/