Skip to content

models OpenAI CLIP Image Text Embeddings vit base patch32

github-actions[bot] edited this page Oct 21, 2023 · 14 revisions

OpenAI-CLIP-Image-Text-Embeddings-vit-base-patch32

Overview

The CLIP model was developed by OpenAI researchers to learn about what contributes to robustness in computer vision tasks and to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. The model was trained on publicly available image-caption data, which was gathered in a mostly non-interventionist manner. The model is intended as a research output for research communities, and the primary intended users of these models are AI researchers. The model has been evaluated on a wide range of benchmarks across a variety of computer vision datasets, but it currently struggles with respect to certain tasks such as fine-grained classification and counting objects. The model also poses issues with regards to fairness and bias, and the specific biases it exhibits can depend significantly on class design and the choices one makes for categories to include and exclude.

The above summary was generated using ChatGPT. Review the original-model-card to understand the data used to train the model, evaluation metrics, license, intended uses, limitations and bias before using the model.

Inference samples

Inference type Python sample (Notebook) CLI with YAML
Real time image-text-embeddings-online-endpoint.ipynb image-text-embeddings-online-endpoint.sh
Batch image-text-embeddings-batch-endpoint.ipynb image-text-embeddings-batch-endpoint.sh

Sample inputs and outputs (for real-time inference)

Sample input for image embeddings

{
   "input_data":{
      "columns":[
         "image", "text"
      ],
      "index":[0, 1],
      "data":[
         ["image1", ""],
         ["image2", ""]
      ]
   }
}

Note: "image1" and "image2" should be publicly accessible urls or strings in base64 format

Sample output

[
    {
        "image_features": [-0.92, -0.13, 0.02, ... , 0.13],
    },
    {
        "image_features": [0.54, -0.83, 0.13, ... , 0.26],
    }
]

Note: returned embeddings have dimension 512 and are not normalized

Sample input for text embeddings

{
   "input_data":{
      "columns":[
         "image", "text"
      ],
      "index":[0, 1],
      "data":[
         ["", "sample text 1"],
         ["", "sample text 2"]
      ]
   }
}

Sample output

[
    {
        "text_features": [0.42, -0.13, -0.92, ... , 0.63],
    },
    {
        "text_features": [-0.14, 0.93, -0.15, ... , 0.66],
    }
]

Note: returned embeddings have dimension 512 and are not normalized

Sample input for image and text embeddings

{
   "input_data":{
      "columns":[
         "image", "text"
      ],
      "index":[0, 1],
      "data":[
         ["image1", "sample text 1"],
         ["image2", "sample text 2"]
      ]
   }
}

Note: "image1" and "image2" should be publicly accessible urls or strings in base64 format

Sample output

[
    {
        "image_features": [0.92, -0.13, 0.02, ... , -0.13],
        "text_features": [0.42, 0.13, -0.92, ... , -0.63]
    },
    {
        "image_features": [-0.54, -0.83, 0.13, ... , -0.26],
        "text_features": [-0.14, -0.93, 0.15, ... , 0.66]
    }
]

Note: returned embeddings have dimension 512 and are not normalized

Version: 3

Tags

Preview license : mit task : embeddings

View in Studio: https://ml.azure.com/registries/azureml/models/OpenAI-CLIP-Image-Text-Embeddings-vit-base-patch32/version/3

License: mit

Properties

inference-min-sku-spec: 2|0|7|14

inference-recommended-sku: Standard_DS2_v2, Standard_D2a_v4, Standard_D2as_v4, Standard_DS3_v2, Standard_D4a_v4, Standard_D4as_v4, Standard_DS4_v2, Standard_D8a_v4, Standard_D8as_v4, Standard_DS5_v2, Standard_D16a_v4, Standard_D16as_v4, Standard_D32a_v4, Standard_D32as_v4, Standard_D48a_v4, Standard_D48as_v4, Standard_D64a_v4, Standard_D64as_v4, Standard_D96a_v4, Standard_D96as_v4, Standard_F4s_v2, Standard_FX4mds, Standard_F8s_v2, Standard_FX12mds, Standard_F16s_v2, Standard_F32s_v2, Standard_F48s_v2, Standard_F64s_v2, Standard_F72s_v2, Standard_FX24mds, Standard_FX36mds, Standard_FX48mds, Standard_E2s_v3, Standard_E4s_v3, Standard_E8s_v3, Standard_E16s_v3, Standard_E32s_v3, Standard_E48s_v3, Standard_E64s_v3, Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2

model_id: openai/clip-vit-base-patch32

Clone this wiki locally