This is the official repository of the paper "MappIng MemeS to WordS for MUltimodal Hateful MEme ClaSsification" (ISSUES).
Multimodal image-text memes are prevalent on the internet, serving as a unique form of communication that combines visual and textual elements to convey humor, ideas, or emotions. However, some memes take a malicious turn, promoting hateful content and perpetuating discrimination. Detecting hateful memes within this multimodal context is a challenging task that requires understanding the intertwined meaning of text and images. In this work, we address this issue by proposing a novel approach named ISSUES for multimodal hateful meme classification. ISSUES leverages a pre-trained CLIP vision-language model and the textual inversion technique to effectively capture the multimodal semantic content of the memes. The experiments show that our method achieves state-of-the-art results on the Hateful Memes Challenge and HarMeme datasets.
Overview of the proposed approach. We disentangle CLIP common embedding space via linear projections. We employ textual inversion to make the textual representation multimodal. We fuse the textual and visual features with a Combiner architecture.
We recommend using the Anaconda package manager to avoid dependency/reproducibility problems. For Linux systems, you can find a conda installation guide here.
- Clone the repository
git clone https://github.com/miccunifi/ISSUES.git
- Install Python dependencies
Navigate to the root folder of the repository and use the command:
conda config --add channels conda-forge
conda create -n issues -y python=3.9.16
conda activate issues
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
conda install --file requirements.txt
pip install git+https://github.com/openai/CLIP.git
- Log in to your WandB account
wandb login
We do not hold rights to the original HMC and HarMeme datasets. To download the full original datasets use the following links:
Download the files in the release and place the resources
folder in the root folder:
project_base_path └─── resources ... └─── src | combiner.py | datasets.py | engine.py ... ...
Ensure the HMC and HarMeme datasets match the following structure:
project_base_path └─── resources └─── datasets └─── harmeme └─── clip_embds | test_no-proj_output.pt | train_no-proj_output.pt | val_no-proj_output.pt └─── img | covid_memes_2.png | covid_memes_3.png | covid_memes_4.png .... └─── labels | info.csv └─── hmc └─── clip_embds | dev_seen_no-proj_output.pt | dev_unseen_no-proj_output.pt | test_seen_no-proj_output.pt | test_unseen_no-proj_output.pt | train_no-proj_output.pt └─── img | 01235.png | 01236.png | 01243.png .... └─── labels | info.csv ... └─── src | combiner.py | datasets.py | engine.py ... ...
We provide the pre-trained models in the release. Ensure that the weights match the following structure:
project_base_path └─── resources └─── datasets ... └─── pretrained_models | hmc_text-inv-comb_best.ckpt | harmeme_text-inv-comb_best.ckpt └─── pretrained_weights | hmc | harmeme | phi └─── src | combiner.py | datasets.py | engine.py ... ...
We provide scripts for training and testing our approach on the HMC and HarMeme datasets.
project_base_path └─── resources ... └─── src ... run_harmeme_text-inv-comb.sh run_hmc_text-inv-comb.sh ...
To use a script, navigate to the root folder and use the following commands:
chmod +x <filename>.sh
./<filename>.sh
where:
<filename> = run_harmeme_text-inv-comb
is related to the HarMeme dataset<filename> = run_hmc_text-inv-comb
is related to the HMC dataset
For training the model from scratch and then evaluating its performance, disable the --reproduce
flag of the script.
For testing the pre-trained models and reproducing our results, enable the --reproduce
flag of the script.
In the following, we describe each argument of the scripts.
dataset
- dataset name: [hmc or harmeme]num_mapping_layers
- number of projection layers to map CLIP features in a task-oriented latent spacenum_pre_output_layers
- number of MLP hidden layers for performing the final classificationmax_epochs
- maximum number of epochslr
- learning ratebatch_size
- batch sizefast_process
- flag to indicate whether to use pre-computed CLIP features as the input of the model instead of computing them during the training processname
- name of the modelpretrained_model
- name of the checkpoint of the pretrained model in the 'pretrained_models' folderreproduce
- flag to indicate whether to perform the training process followed by the evaluation phase (False) or directly evaluate a pre-trained model on the test data (True)
map_dim
- output dimension of the projected feature vectorsfusion
- fusion method between the textual and visual modalities (when applicable): [concat or align]pretrained_proj_weights
- flag to indicate whether to use pre-trained projection weights (when applicable)freeze_proj_layers
- flag to indicate whether to freeze the pre-trained weights
comb_proj
- flag to indicate whether to project the input features of the Combinercomb_fusion
- fusion method to use to combine the input features of the Combinerconvex_tensor
- flag to indicate whether to compute a tensor or a scalar as the output of the convex combination
text_inv_proj
- flag to indicate whether to use CLIP textual encoder projectionphi_inv_proj
- flag to indicate whether to project the output of phi networkpost_inv_proj
- flag to indicate whether to project the CLIP textual encoder output featuresenh_text
- flag to indicate whether to use a prompt with only the pseudo-word or concatenate the meme textphi_freeze
- flag to indicate whether to freeze the pre-trained phi network
Our code is based on SEARLE and Hate-CLIPper.
This work was partially supported by the European Commission under European Horizon 2020 Programme, grant number 101004545 - ReInHerit.