CaptionIt

Fun little demonstration of encoder-decoder architectures. Try captioning your favorite image!

https://caption-it.projects.yadgaran.net/

How it works

This project uses SimpleML to automate modeling and persistence. Try it in your own projects!

Homepage: https://github.com/eyadgaran/SimpleML
Installation: pip install simpleml
Documentation: https://simpleml.readthedocs.io/en/latest

Modeling

Dataset

The first thing to remember when manipulating datasets is that it must mirror what will be used when predicting. The COCO dataset is formatted as follows:

	coco_url	date_captured	file_name	height	id	license	width	caption_count	y_0	y_1	y_2	y_3	y_4	y_5	y_6
0	http://images.cocodataset.org/val2017/00000039...	2013-11-14 17:02:52	000000397133.jpg	427	397133	4	640	5	A man is in a kitchen making pizzas.	Man in apron standing on front of oven with pa...	A baker is working in the kitchen rolling dough.	A person standing by a stove in a kitchen.	A table with pies being made and a person stan...	NaN	NaN
1	http://images.cocodataset.org/val2017/00000003...	2013-11-14 20:55:31	000000037777.jpg	230	37777	1	352	5	The dining table near the kitchen has a bowl o...	A small kitchen has various appliances and a t...	The kitchen is clean and ready for us to see.	A kitchen and dining area decorated in white.	A kitchen that has a bowl of fruit on the table.	NaN	NaN
2	http://images.cocodataset.org/val2017/00000025...	2013-11-14 22:32:02	000000252219.jpg	428	252219	4	640	5	a person with a shopping cart on a city street	City dwellers walk by as a homeless man begs f...	People walking past a homeless man begging on ...	a homeless man holding a cup and standing next...	People are walking on the street by a homeless...	NaN	NaN
3	http://images.cocodataset.org/val2017/00000008...	2013-11-14 23:11:37	000000087038.jpg	480	87038	1	640	5	A person on a skateboard and bike at a skate p...	A man on a skateboard performs a trick at the ...	A skateboarder jumps into the air as he perfor...	Athletes performing tricks on a BMX bicycle an...	a man falls off his skateboard in a skate park.	NaN	NaN

Due to the memory and disk overhead, I do not load the images until they are needed, so url is interchangeable with image file for this project.

When predicting later, only the image file and current caption will be passed and the output is expected to be a SINGLE next word, like this:

	url	current_predicted_y	next_predicted_y
0	some_file.jpg	[Cool, Generated]	Caption

Additionally, because of the text form of the label, the model will actually be predicting tokens so it will really look like this (our text processor will handle the vocabulary and transformation back to words):

	url	current_predicted_y	next_predicted_y
0	some_file.jpg	[token_123, token_456]	token_789

Dataset Pipeline

In order to align these forms, we will use a dataset pipeline (Dataset Pipelines are only used to adjust the format of our data, actual transformations we want to apply at runtime must go through a traditional pipeline)

There are only a few transformers we will use in the dataset pipeline:

Drop unnecessary columns
Stack dataset so there is only 1 caption per row (duplicate images across rows)
Encode our caption into vectors using the text processor
Duplicate an offset version of the caption for the recurrent input (previous tokens as features to predict next token)

The final output will look like this:

	image	caption	y
0	http://images.cocodataset.org/val2017/00000039...	[START_TOKEN, token_643, token_9984, ...]	[token_643, token_9984, END_TOKEN, ...]
0	http://images.cocodataset.org/val2017/00000039...	[START_TOKEN, token_423, token_24, ...]	[token_423, token_24, END_TOKEN, ...]
.
.
1	http://images.cocodataset.org/val2017/00000003...	[START_TOKEN, token_94, token_70, ...]	[token_94, token_70, END_TOKEN, ...]
1	http://images.cocodataset.org/val2017/00000003...	[START_TOKEN, token_84, token_24, ...]	[token_84, token_24, END_TOKEN, ...]

Pipeline

Now that we have our formatted dataset, we can configure our pipeline that will transform all images (future as well) for the proper model input.

Read image pixels into ndarray
Crop image to a square (centered)
Resize image to matrix dimensions
Normalize image using the imagenet preprocessor
Encode the image using a pretrained image model

Model

Encoder-Decoder architecture is a form of sequence-to-sequence task, but there are actually a few models at play here that we glossed over.

Text processor: this model fits over our dataset and learns the vocabulary accessible to our decoder
Encoder: typically this model would have to learn an embedding of our images that we can later map to our tokens. Thankfully we can use an existing model or easily tune one for our purposes with transfer learning (the objective would be classification accuracy on an unrelated image dataset)
Decoder: this is the main model that we work with here and what maps our image encodings to our text encodings. We condition our tokens (captions) on the image embedding and learn the generalized relationship for new captions

The Decoder

There are a number of subtleties in the decoder architecture that are elegantly abstracted away by the frameworks used (Tensorflow, Keras, SimpleML)

Recurrent Architecture

When you think about it, every training input can be utilized in two different ways (refer back to the dataset input). In an example where there are 20 tokens in the expected prediction, we can:
- Treat every image as one sample with the objective to produce the ENTIRE expected sequence (1x20 element output array)
- Use every image as 20 independent samples without the double jeopardy of any one of them being incorrect (20x1 element output array)
The second technique is called teacher forcing and what we use here. This allows us to let our model learn quickly by taking advantage of the augmented dataset size. There are reasons to not use teacher forcing, namely robustness in scenarios where early sequence predictions are off target (the model learns to correct itself during training because it feeds its own predictions back in, instead of getting corrected by "the teacher").
Keras and Tensorflow elegantly do this under the hood via the TimeDistributed and dynamic_rnn layers. Input remains the same (image, [offset tokens]: [expected tokens]) and the loss is computed in accordance with the teacher forcing methodology.

Inference vs Training

The consequence of this training methodology is that the network structure and input will be different at inference time.
SimpleML manages this by allowing us to define the shared layers in each network and it automatically transfers the weights when generating predictions

Evaluation

References

http://cocodataset.org/#home
https://ai.googleblog.com/2014/11/a-picture-is-worth-thousand-coherent.html
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/ImageCaptionInWild-1.pdf
https://cs.stanford.edu/people/karpathy/

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
captioner		captioner
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CaptionIt

How it works

Modeling

Dataset

Dataset Pipeline

Pipeline

Model

The Decoder

Evaluation

References

About

Releases

Packages

Languages

License

eyadgaran/caption-it

Folders and files

Latest commit

History

Repository files navigation

CaptionIt

How it works

Modeling

Dataset

Dataset Pipeline

Pipeline

Model

The Decoder

Evaluation

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages