Skip to content

models openai clip vit large patch14

github-actions[bot] edited this page Dec 21, 2024 · 13 revisions

openai-clip-vit-large-patch14

Overview

OpenAI's CLIP (Contrastive Language–Image Pre-training) model was designed to investigate the factors that contribute to the robustness of computer vision tasks. It can seamlessly adapt to a range of image classification tasks without requiring specific training for each, demonstrating efficiency, flexibility, and generality.

In terms of architecture, CLIP utilizes a ViT-B/32 Transformer for image encoding and a masked self-attention Transformer for text encoding. These encoders undergo training to improve the similarity of (image, text) pairs using a contrastive loss.

For training purposes, CLIP leverages image-text pairs from the internet and engages in a proxy task: when presented with an image, predict the correct text snippet from a set of 32,768 randomly sampled options. This approach allows CLIP to comprehend visual concepts and establish associations with their textual counterparts, enhancing its performance across various visual classification tasks.

The design of CLIP effectively tackles notable challenges, including the dependence on expensive labeled datasets, the need for fine-tuning on new datasets to achieve optimal performance across diverse tasks, and the disparity between benchmark and real-world performance.

The primary intended users of these models are AI researchers for tasks requiring image and/or text embeddings such as text and image retrieval.

For more details on CLIP model, review the original-paper or the original-model-card.

Training Details

Training Data

The training of the CLIP model involved utilizing publicly accessible image-caption data obtained by crawling several websites and incorporating commonly-used existing image datasets like YFCC100M. Researchers curated a novel dataset comprising 400 million image-text pairs sourced from diverse publicly available internet outlets. This dataset, referred to as WIT (WebImageText), possesses a word count comparable to the WebText dataset employed in training GPT-2.

As a consequence, the data in WIT is reflective of individuals and societies predominantly linked to the internet, often leaning towards more developed nations and a demographic skewed towards younger, male users.

Training Procedure

The Vision Transformers ViT-B/32 underwent training for 32 epochs, employing the Adam optimizer with applied decoupled weight decay regularization. The learning rate was decayed using a cosine schedule. The learnable temperature parameter τ was initialized to the equivalent of 0.07. Training utilized a very large mini-batch size of 32,768, and mixed-precision techniques were employed to expedite training and conserve memory. The largest Vision Transformer was trained over a period of 12 days on 256 V100 GPUs. For a more in-depth understanding, refer to sections 2 and 3 of the original-paper.

Evaluation Results

The performance of CLIP has been evaluated on a wide range of benchmarks across a variety of computer vision datasets such as OCR to texture recognition to fine-grained classification. The section 3 and 4 of the paper describes model performance on multiple datasets.

Limitations and Biases

CLIP has difficulties with tasks such as fine-grained classification and object counting. Its performance also raises concerns regarding fairness and bias. Additionally, there is a notable limitation in the evaluation approach, with the use of linear probes potentially underestimating CLIP's true performance, as suggested by evidence.

CLIP's performance and inherent biases can vary depending on class design and category choices. Assessing Fairface images unveiled significant racial and gender disparities, influenced by class construction. Evaluations on gender, race, and age classification using the Fairface dataset indicated gender accuracy exceeding 96%, with variations among races. Racial classification achieved approximately 93%, while age classification reached around 63%. These assessments aim to gauge model performance across demographics, pinpoint potential risks, and are not intended to endorse or promote such tasks. For a more details, refer to sections 6 and 7 of the original-paper.

License

MIT License

Inference Samples

Inference type Python sample (Notebook) CLI with YAML
Real time zero-shot-image-classification-online-endpoint.ipynb zero-shot-image-classification-online-endpoint.sh
Batch zero-shot-image-classification-batch-endpoint.ipynb zero-shot-image-classification-batch-endpoint.sh

Sample input and output

Sample input

{
   "input_data":{
      "columns":[
         "image", "text"
      ],
      "index":[0, 1],
      "data":[
         ["image1", "label1, label2, label3"],
         ["image2"]
      ]
   }
}

Note:

  • "image1" and "image2" should be publicly accessible urls or strings in base64 format.
  • The text column in the first row determines the labels for image classification. The text column in the other rows is not used and can be blank.

Sample output

[
    {
        "probs": [0.95, 0.03, 0.02],
        "labels": ["label1", "label2", "label3"]
    },
    {
        "probs": [0.04, 0.93, 0.03],
        "labels": ["label1", "label2", "label3"]
    }
]

Visualization of inference result for a sample image

For a sample image and label text "credit card payment, contactless payment, cash payment, mobile order".

zero shot image classification visualization

Version: 9

Tags

Preview license : mit task : zero-shot-image-classification huggingface_model_id : openai/clip-vit-large-patch14 SharedComputeCapacityEnabled hiddenlayerscanned inference_compute_allow_list : ['Standard_DS2_v2', 'Standard_D2a_v4', 'Standard_D2as_v4', 'Standard_DS3_v2', 'Standard_D4a_v4', 'Standard_D4as_v4', 'Standard_DS4_v2', 'Standard_D8a_v4', 'Standard_D8as_v4', 'Standard_DS5_v2', 'Standard_D16a_v4', 'Standard_D16as_v4', 'Standard_D32a_v4', 'Standard_D32as_v4', 'Standard_D48a_v4', 'Standard_D48as_v4', 'Standard_D64a_v4', 'Standard_D64as_v4', 'Standard_D96a_v4', 'Standard_D96as_v4', 'Standard_F4s_v2', 'Standard_FX4mds', 'Standard_F8s_v2', 'Standard_FX12mds', 'Standard_F16s_v2', 'Standard_F32s_v2', 'Standard_F48s_v2', 'Standard_F64s_v2', 'Standard_F72s_v2', 'Standard_FX24mds', 'Standard_FX36mds', 'Standard_FX48mds', 'Standard_E2s_v3', 'Standard_E4s_v3', 'Standard_E8s_v3', 'Standard_E16s_v3', 'Standard_E32s_v3', 'Standard_E48s_v3', 'Standard_E64s_v3', 'Standard_NC4as_T4_v3', 'Standard_NC6s_v3', 'Standard_NC8as_T4_v3', 'Standard_NC12s_v3', 'Standard_NC16as_T4_v3', 'Standard_NC24s_v3', 'Standard_NC64as_T4_v3', 'Standard_NC24ads_A100_v4', 'Standard_NC48ads_A100_v4', 'Standard_NC96ads_A100_v4', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4', 'Standard_ND40rs_v2']

View in Studio: https://ml.azure.com/registries/azureml/models/openai-clip-vit-large-patch14/version/9

License: mit

Properties

SharedComputeCapacityEnabled: True

inference-min-sku-spec: 2|0|7|14

inference-recommended-sku: Standard_DS2_v2, Standard_D2a_v4, Standard_D2as_v4, Standard_DS3_v2, Standard_D4a_v4, Standard_D4as_v4, Standard_DS4_v2, Standard_D8a_v4, Standard_D8as_v4, Standard_DS5_v2, Standard_D16a_v4, Standard_D16as_v4, Standard_D32a_v4, Standard_D32as_v4, Standard_D48a_v4, Standard_D48as_v4, Standard_D64a_v4, Standard_D64as_v4, Standard_D96a_v4, Standard_D96as_v4, Standard_F4s_v2, Standard_FX4mds, Standard_F8s_v2, Standard_FX12mds, Standard_F16s_v2, Standard_F32s_v2, Standard_F48s_v2, Standard_F64s_v2, Standard_F72s_v2, Standard_FX24mds, Standard_FX36mds, Standard_FX48mds, Standard_E2s_v3, Standard_E4s_v3, Standard_E8s_v3, Standard_E16s_v3, Standard_E32s_v3, Standard_E48s_v3, Standard_E64s_v3, Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2

Clone this wiki locally