Skip to content
This repository has been archived by the owner on Oct 21, 2023. It is now read-only.

eyadgaran/caption-it

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CaptionIt

Fun little demonstration of encoder-decoder architectures. Try captioning your favorite image!

https://caption-it.projects.yadgaran.net/

How it works

This project uses SimpleML to automate modeling and persistence. Try it in your own projects!

Homepage: https://github.com/eyadgaran/SimpleML
Installation: pip install simpleml
Documentation: https://simpleml.readthedocs.io/en/latest

Modeling





Dataset

The first thing to remember when manipulating datasets is that it must mirror what will be used when predicting. The COCO dataset is formatted as follows:

coco_url date_captured file_name height id license width caption_count y_0 y_1 y_2 y_3 y_4 y_5 y_6
0 http://images.cocodataset.org/val2017/00000039... 2013-11-14 17:02:52 000000397133.jpg 427 397133 4 640 5 A man is in a kitchen making pizzas. Man in apron standing on front of oven with pa... A baker is working in the kitchen rolling dough. A person standing by a stove in a kitchen. A table with pies being made and a person stan... NaN NaN
1 http://images.cocodataset.org/val2017/00000003... 2013-11-14 20:55:31 000000037777.jpg 230 37777 1 352 5 The dining table near the kitchen has a bowl o... A small kitchen has various appliances and a t... The kitchen is clean and ready for us to see. A kitchen and dining area decorated in white. A kitchen that has a bowl of fruit on the table. NaN NaN
2 http://images.cocodataset.org/val2017/00000025... 2013-11-14 22:32:02 000000252219.jpg 428 252219 4 640 5 a person with a shopping cart on a city street City dwellers walk by as a homeless man begs f... People walking past a homeless man begging on ... a homeless man holding a cup and standing next... People are walking on the street by a homeless... NaN NaN
3 http://images.cocodataset.org/val2017/00000008... 2013-11-14 23:11:37 000000087038.jpg 480 87038 1 640 5 A person on a skateboard and bike at a skate p... A man on a skateboard performs a trick at the ... A skateboarder jumps into the air as he perfor... Athletes performing tricks on a BMX bicycle an... a man falls off his skateboard in a skate park. NaN NaN

Due to the memory and disk overhead, I do not load the images until they are needed, so url is interchangeable with image file for this project.

When predicting later, only the image file and current caption will be passed and the output is expected to be a SINGLE next word, like this:

url current_predicted_y next_predicted_y
0 some_file.jpg [Cool, Generated] Caption

Additionally, because of the text form of the label, the model will actually be predicting tokens so it will really look like this (our text processor will handle the vocabulary and transformation back to words):

url current_predicted_y next_predicted_y
0 some_file.jpg [token_123, token_456] token_789

Dataset Pipeline

In order to align these forms, we will use a dataset pipeline (Dataset Pipelines are only used to adjust the format of our data, actual transformations we want to apply at runtime must go through a traditional pipeline)

There are only a few transformers we will use in the dataset pipeline:

  1. Drop unnecessary columns
  2. Stack dataset so there is only 1 caption per row (duplicate images across rows)
  3. Encode our caption into vectors using the text processor
  4. Duplicate an offset version of the caption for the recurrent input (previous tokens as features to predict next token)

The final output will look like this:

image caption y
0 http://images.cocodataset.org/val2017/00000039... [START_TOKEN, token_643, token_9984, ...] [token_643, token_9984, END_TOKEN, ...]
0 http://images.cocodataset.org/val2017/00000039... [START_TOKEN, token_423, token_24, ...] [token_423, token_24, END_TOKEN, ...]
.
.
1 http://images.cocodataset.org/val2017/00000003... [START_TOKEN, token_94, token_70, ...] [token_94, token_70, END_TOKEN, ...]
1 http://images.cocodataset.org/val2017/00000003... [START_TOKEN, token_84, token_24, ...] [token_84, token_24, END_TOKEN, ...]

Pipeline

Now that we have our formatted dataset, we can configure our pipeline that will transform all images (future as well) for the proper model input.

  1. Read image pixels into ndarray
  2. Crop image to a square (centered)
  3. Resize image to matrix dimensions
  4. Normalize image using the imagenet preprocessor
  5. Encode the image using a pretrained image model

Model

Encoder-Decoder architecture is a form of sequence-to-sequence task, but there are actually a few models at play here that we glossed over.

  1. Text processor: this model fits over our dataset and learns the vocabulary accessible to our decoder

  2. Encoder: typically this model would have to learn an embedding of our images that we can later map to our tokens. Thankfully we can use an existing model or easily tune one for our purposes with transfer learning (the objective would be classification accuracy on an unrelated image dataset)

  3. Decoder: this is the main model that we work with here and what maps our image encodings to our text encodings. We condition our tokens (captions) on the image embedding and learn the generalized relationship for new captions

The Decoder

There are a number of subtleties in the decoder architecture that are elegantly abstracted away by the frameworks used (Tensorflow, Keras, SimpleML)

  1. Recurrent Architecture
  • When you think about it, every training input can be utilized in two different ways (refer back to the dataset input). In an example where there are 20 tokens in the expected prediction, we can:

    • Treat every image as one sample with the objective to produce the ENTIRE expected sequence (1x20 element output array)
    • Use every image as 20 independent samples without the double jeopardy of any one of them being incorrect (20x1 element output array)
  • The second technique is called teacher forcing and what we use here. This allows us to let our model learn quickly by taking advantage of the augmented dataset size. There are reasons to not use teacher forcing, namely robustness in scenarios where early sequence predictions are off target (the model learns to correct itself during training because it feeds its own predictions back in, instead of getting corrected by "the teacher").

  • Keras and Tensorflow elegantly do this under the hood via the TimeDistributed and dynamic_rnn layers. Input remains the same (image, [offset tokens]: [expected tokens]) and the loss is computed in accordance with the teacher forcing methodology.

  1. Inference vs Training
  • The consequence of this training methodology is that the network structure and input will be different at inference time.
  • SimpleML manages this by allowing us to define the shared layers in each network and it automatically transfers the weights when generating predictions

Evaluation

References

http://cocodataset.org/#home
https://ai.googleblog.com/2014/11/a-picture-is-worth-thousand-coherent.html
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/ImageCaptionInWild-1.pdf
https://cs.stanford.edu/people/karpathy/

About

Caption your favorite image!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published