Skip to content

models mmeft

github-actions[bot] edited this page Oct 25, 2023 · 10 revisions

mmeft

Overview

Multimodal Early Fusion Transformer, MMEFT, is a transformer-based model tailored for processing both structured and unstructured data.

It can be used for multi-class and multi-label multimodal classification tasks, and is capable of handling datasets with features from diverse modes, including categorical, numerical, image, and text. The MMEFT architecture is composed of embedding, fusion, aggregation, and output layers. The embedding layer produces independent non-contextual embeddings for features of varying modes. Then, the fusion Layer integrates the non-contextual embeddings to yield contextual multimodal embeddings. The aggregation layer consolidates these contextual multimodal embeddings into a single multimodal embedding vector. Lastly, the output Layer, processes the final multimodal embedding to generate the model's prediction based on task for which it is used. MMEFT uses BertTokenizer for text data embeddings, and considers 'openai/clip-vit-base-patch32' model from Hugging Face for image data embeddings. This model is designed to offer a comprehensive approach to multimodal data, ensuring accurate and efficient classification across varied datasets. NOTE: We highly recommend to finetune the model on your dataset before deploying.

Inference samples

Inference type Python sample (Notebook) CLI with YAML
Real time multimodal-classification-online-endpoint.ipynb multimodal-classification-online-endpoint.sh
Batch multimodal-classification-batch-endpoint.ipynb multimodal-classification-batch-endpoint.sh

Finetuning samples

Task Dataset Python sample (Notebook) CLI with YAML
Multimodal multi-class classification Airbnb listings dataset multimodal-multiclass-classification.ipynb multimodal-multiclass-classification.sh
Multimodal multi-label classification Hateful memes dataset multimodal-multilabel-classification.ipynb multimodal-multilabel-classification.sh

Sample inputs and outputs (for real-time inference)

Sample input

{ 
 "input_data": { 
        "columns": ["column1","column2","column3","column4","column5","column6"], 
        "data": [[22,11.2,"It was a great experience!",image1,"Categorical value",True],
                 [111,8.2,"I may not consider this option again.",image2,"Categorical value",False]
                ]
     } 
} 

Note:

  • "image1", "image2" are strings in base64 format.

Sample output

[ 
     {
        "label1": 0.1,
        "label2": 0.7,
        "label3": 0.2
     }, 
     {
        "label1": 0.3,
        "label2": 0.3,
        "label3": 0.4
     },
] 
  

Version: 1

Tags

Preview license : mit task : multimodal-classification

View in Studio: https://ml.azure.com/registries/azureml/models/mmeft/version/1

License: mit

Properties

evaluation-min-sku-spec: 4|1|28|176,

evaluation-recommended-sku: Standard_NC6s_v3,

finetune-min-sku-spec: 4|1|28|176,

finetune-recommended-sku: Standard_NC6s_v3,

finetuning-tasks: multimodal-classification,

inference-min-sku-spec: 2|0|7|14,

inference-recommended-sku: Standard_DS3_v2, Standard_D4a_v4, Standard_D4as_v4, Standard_DS4_v2, Standard_D8a_v4, Standard_D8as_v4, Standard_DS5_v2, Standard_D16a_v4, Standard_D16as_v4, Standard_D32a_v4, Standard_D32as_v4, Standard_D48a_v4, Standard_D48as_v4, Standard_D64a_v4, Standard_D64as_v4, Standard_D96a_v4, Standard_D96as_v4, Standard_FX4mds, Standard_F8s_v2, Standard_FX12mds, Standard_F16s_v2, Standard_F32s_v2, Standard_F48s_v2, Standard_F64s_v2, Standard_F72s_v2, Standard_FX24mds, Standard_FX36mds, Standard_FX48mds, Standard_E2s_v3, Standard_E4s_v3, Standard_E8s_v3, Standard_E16s_v3, Standard_E32s_v3, Standard_E48s_v3, Standard_E64s_v3, Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2

model_id: mmeft

Clone this wiki locally