python caption.py [args]
The main script now works with the following (choose one):
--model "THUDM/cogvlm-chat-hf"
--model "THUDM/cogvlm2-llama3-chat-19B"
--model "xtuner/llava-llama-3-8b-v1_1-transformers"
--model "THUDM/glm-4v-9b"
--model "llava-hf/llava-v1.6-vicuna-7b-hf"
Support for all models in Windows is not gauranteed. Consider using the Nvidia-Ubuntu-cuda docker container (see doc/SETUP.md) or WSL2 if you are on windows and want best compatibility.
The script uses the CogVLM Vicuna model (first) by default if no --model
arg is specified.
CogVLM (code) (model) is a very high quality, but slow model for captioning.
The model uses about 13.5GB of VRAM with BNB 4bit quant with the default setting of 1 beam, and up to 4 or 5 beams is possible with a 24GB GPU meaning it is very capable on consumer hardware. It is slow, ~6-10+ seconds on a RTX 3090, but the quality is worth it over other models.
It is capable of naming and identifying things with proper nouns and has a large vocabulary. It can also readily read text even for hard to read fonts, from oblique angles, or from curved surfaces.
Both the (Vicuna-based) and (Llama3-based) models are supported.
Choose these by using one of these two CLI args:
--model THUDM/cogvlm-chat-hf
--model THUDM/cogvlm2-llama3-chat-19B
Yet another option from the THUDM team. Specify it by using this CLI arg:
--model THUDM/glm-4v-9b
Xtuner's Llava Llama3 8b v1.1.
--model "xtuner/llava-llama-3-8b-v1_1-transformers"
When using Xtuner Llava, the script will perform some clean-up operations to remove some less-than-useful language from the caption because the bad_words part of the Hugginface Transformers API is not supported by Llava.
Vicuna-based Llava 1.6 7B is also supported and working.
--model "llava-hf/llava-v1.6-vicuna-7b-hf"
Run python caption.py --help
to get a list of options.
You can get started just by providing the root path to where all your images are located. The script will create .txt sidecar files for each image in the same directory, an run recursively through subdirectories. The default prompt Write a description.
is used when no prompt is provided.
The simplest possible use:
python caption_cog.py --image_dir /mnt/mydata/training_data/
A command in Windows might look more like this:
python caption_cog.py --image_dir D:\training_data\
The default prompt is Write a description.
if none is provided.
Basic usage for prompt:
--prompt "Write a description that includes [...] "
I've found the longer the prompt the less effective it can be, but it's worth experimenting with this or tailoring it to your data if feasible, to tease out specific details you want in your captoins. See Prompt modification plugins for more capability.
Some prompt ideas:
Write a concise, accurate, blunt, and detailed description. Avoid euphemisms, vague wording, or ambiguous expressions. Do not exceed 21 words.
If you know the images are all of a single subject/character, you can ask it to be more specific about the subject:
Write a desciption. Include pose, outfit, and surroundings. Be concise, accurate, blunt, and detailed description. Avoid euphemisms, vague wording, or ambiguous expressions. Do not exceed 26 words.
You can add this somewhere in the prompt to get it to attempt ot describe the "style" of the image:
Include the style or medium of the artwork.
You can include hints to help the model understand the context, such as if you have a folder full of photos from Iceland, add this as part of your prompt:
As a hint, this is from Peru. Write a description...
or
Write a description of this photo taken in Peru.
Christoph Shuhmann and Peter Bevan's laion-pop dataset has an example very long, detailed prompt for general purpose Cog captioning in the readme. They are effectively using starts_with
and remove_starts_with
as well, which you can use similarly here (see below).
--starts_with "A photograph of"
will add the text given to the caption.
There are two circumstances where this is extremely useful. If you are captioning images that are all of the same subject, you can provide the subject's proper name and force it to be included. Such as --starts_with "A photograph of John Smith"
. The caption will continue from there.
Another circumstance is to provide a starting phrase such a "An image showcasing" or "An image of", and follow up with using the --remove_starts_with
option to remove the starting phrase from the caption. Often Cog will add "An image of" on its own, wasting tokens and making the caption less useful. By providing the starting phrase then removing it with --remove_starts_with
you can short circuit the model to start in a more concise manner.
--remove_starts_with
will remove the starts_with
text from the start of the output caption. Suggested use is to use this if your starts_with is something like an image of
but not if your starts_with is a proper noun.
--append "by Claude Monet."
will add the text given to the end of every caption, and is not fed to the model, it is simply tacked on to the end of the caption. This can be useful for things like artist or collection names that are fixed across all images. This is "dumb code" string append.
--no_overwrite
will skip captioning the image if a corresponding .txt file already exists, useful for resuming.
The script has the ability to execute arbitrary code to alter the prompt before it is sent to the model. This allows users to write their own plugins that execute python code, opening any capability you want to program for in-context learning or retrieval augmented techniques.
Injecting special information to the prompt greatly increases the quality and accuracy of the synthetic captions generated. If you are scraping data, I would strongly encourage you try to collect any metadata you can about the images for use with this feature.
Enable a plugin with --prompt_plugin "plugin_key"
such as --prompt_plugin "from_leaf_directory"
Here are the working plugins that come with the script:
from_leaf_directory
Adds "hint: folder_name" to the front of your prompt. The leaf directory (immediate directory of image, not roots) of each image is used. Let's assume the--prompt
is simply set to "Write A description" and go through an example.
Ex. if your data is structured as:
/mnt/mydata/training_data/Peru/001.jpg
/mnt/mydata/training_data/Argentina/002.jpg
The 001.jpg will have the prompt such as
hint: Peru
Write a description.
and 002.jpg will have the prompt adjusted like so:
Hint: Argentina
Write a description.
This is very useful if you can organize your data into folders that are meaningful to the captioning task, either manually, or with a classifier.
title_and_tags_from_metadata_json
Adds the title and tags from a metadata.json file in the same folder as the image to the prompt. This is useful if you have a metadata.json file in each folder with the images that applies to all the images in that folder. The metadata.json file should look like this:
{
"title": "A photograph of John Smith",
"tags": ["portrait", "outdoors", "smiling"]
}
And the prompt will be modified with the information pulled from the metadata.json file. The prompt will look like this after modification:
Hint: title: A photograph of John Smith, tags: portrait, outdoors, smiling
Write a description.
-
title_and_tags_from_image_json
Same as above but looks for a file ending in.json
with the same basename and in the same directory as the image (ex./myfolder/001.png
,/myfolder/001.json
), enabling per-image metadata instead of a per-folder metadata file. -
from_image_json
inserts the entire contents of the json with the same base name as the image. It also supports an extra CLI arg--exclude_keys
in which you can pass in a CSV of keys you want removed before the contents are added to the prompt. ex.
--prompt_plugin from_image_json --exclude keys "date,uploaded by,file size"
The plugins are all in /plugins/caption_plugins.py
and are easy to modify or add to. The plugins are executed in the order they are provided on the command line. Inherit from the PromptIdentityPlugin
class and spass a key for the arg and your function like super().__init(key="my_cool_plugin",fn=your_fn)
. Should be obvious from there for anyone familiar with Python.
ChatGPT should be capable of writing these if you paste in the PromptIdentityPlugin class code and describe what you want it to do.
It's worth reading through Huggingface's tips and blog post as a start for tweaking sampling arguments. The technical documenuts for the Transformers pipeline also will help explain the parameters. The type of search (beam, greedy, probabilistic, etc) is set automatically based on your options. Default is greedy search (1 beam, no sampling args set).
I would recommend not setting any of these and leave the default values until you have time to read all of the above.
--num_beams 1
more beams provide extra "opinions" on the next token to choose. Default is 1, but increasing this slightly may improve quality at the cost of significantly higher VRAM and slower processing. Setting this to 2 or higher enables beam search.
--repetition_penalty 1.0
penalizes repeating tokens/words, can adjust up if you see repeated terms. 1.0 does nothing.
--length_penalty 1.0
penalizes longer captions if <0.0 or rewards longer captions if >0.0. Adjusting down may produce somewhat abruptly ending output.
--no_repeat_ngram_size 3
prevents the same n-gram (successive token sequence) from being repeated in the output. Can help prevent the model from repeating itself.
--bad_words "foo,bar"
Attempts to prevent the model from using these words in the output caption. Comma-delimited. Very useful, consider trying "depicts,poses,posing,showcases,appears,suggests"
to get more concise phrasing in captions. This is not a guarantee, due to different tokenizations being possible for a given bad_word.
--force_word "photograph,Spain"
Attempts to force the model to include the words in the output caption. Comma-delimited.
--min_new_tokens 5
Force the model to produce at least n tokens.
--max_new_tokens 120
Truncates output after n tokens. May cut off captions abruptly.
--no_repeat_ngram_size 3
prevents the same n-gram (sequence of size n-tokens) from being repeated in the output. Default is 0, which means no n-gram is prevented from repeating. Setting this to 2 or 3 can help prevent the model from repeating itself.
These all control and enable multinomial sampling. Setting at least one will turn multinomial sampling on and set the other sampling args to default values if not set.
--temperature 1.0
relates to randomness used for next token chosen.
--top_k 50
Highest probability vocabulary size for filtering.
--top_p 1.0
Probability mass to be considered.