-
Notifications
You must be signed in to change notification settings - Fork 128
models openai clip vit base patch32
Description: The CLIP
model was developed by OpenAI researchers to learn about what contributes to robustness in computer vision tasks and to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. The model was trained on publicly available image-caption data, which was gathered in a mostly non-interventionist manner. The model is intended as a research output for research communities, and the primary intended users of these models are AI researchers. The model has been evaluated on a wide range of benchmarks across a variety of computer vision datasets, but it currently struggles with respect to certain tasks such as fine-grained classification and counting objects. The model also poses issues with regards to fairness and bias, and the specific biases it exhibits can depend significantly on class design and the choices one makes for categories to include and exclude. > The above summary was generated using ChatGPT. Review the original-model-card to understand the data used to train the model, evaluation metrics, license, intended uses, limitations and bias before using the model. ### Inference samples Inference type|Python sample (Notebook)|CLI with YAML |--|--|--| Real time|zero-shot-image-classification-online-endpoint.ipynb|zero-shot-image-classification-online-endpoint.sh Batch|zero-shot-image-classification-batch-endpoint.ipynb|zero-shot-image-classification-batch-endpoint.sh ### Sample inputs and outputs (for real-time inference) #### Sample input json { "input_data":{ "columns":[ "image", "text" ], "index":[0, 1], "data":[ ["image1", "label1, label2, label3"], ["image2"] ] } }
Note: - "image1" and "image2" should be publicly accessible urls or strings in base64
format. - The text column in the first row determines the labels for image classification. The text column in the other rows is not used and can be blank. #### Sample output json [ { "probs": [0.95, 0.03, 0.02], "labels": ["label1", "label2", "label3"] }, { "probs": [0.04, 0.93, 0.03], "labels": ["label1", "label2", "label3"] } ]
#### Model inference - visualization For a sample image and label text "credit card payment, contactless payment, cash payment, mobile order".
Version: 3
Preview
license : mit
task : zero-shot-image-classification
View in Studio: https://ml.azure.com/registries/azureml/models/openai-clip-vit-base-patch32/version/3
License: mit
SHA: e6a30b603a447e251fdaca1c3056b2a16cdfebeb
inference-min-sku-spec: 2|0|7|14
inference-recommended-sku: Standard_DS2_v2, Standard_D2a_v4, Standard_D2as_v4, Standard_DS3_v2, Standard_D4a_v4, Standard_D4as_v4, Standard_DS4_v2, Standard_D8a_v4, Standard_D8as_v4, Standard_DS5_v2, Standard_D16a_v4, Standard_D16as_v4, Standard_D32a_v4, Standard_D32as_v4, Standard_D48a_v4, Standard_D48as_v4, Standard_D64a_v4, Standard_D64as_v4, Standard_D96a_v4, Standard_D96as_v4, Standard_F4s_v2, Standard_FX4mds, Standard_F8s_v2, Standard_FX12mds, Standard_F16s_v2, Standard_F32s_v2, Standard_F48s_v2, Standard_F64s_v2, Standard_F72s_v2, Standard_FX24mds, Standard_FX36mds, Standard_FX48mds, Standard_E2s_v3, Standard_E4s_v3, Standard_E8s_v3, Standard_E16s_v3, Standard_E32s_v3, Standard_E48s_v3, Standard_E64s_v3, Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2
model_id: openai/clip-vit-base-patch32