LIME: LESS IS MORE FOR MLLM EVALUATION

Annoucement

[2024-10] 📰 We have released both the LIME dataset and the data duration pipeline!
[2024-09] 🍋 We have open-sourced the evaluation data and corresponding evaluation code for LIME. The data duration pipeline for LIME will be open-sourced within two weeks.

Introduction

We use a gneral data process pipeline and curate a LIME, which contains 9403 samples and is refined across 10 tasks within 6 domains. We select six major tasks in the Multimodal domain and use 9 MLLMs to refine those 10 benchmarks within the corresponding domain.

How to use LIME

1. Installation

First, you need to download the repo to your local machine. For quickly start using LIME, we recommend following the lmms-eval tutorial to quickly deploy the evaluation environment. also you can install by following steps

cd lmms-eval
pip install -e .

2. download dataset from huggingface

download all datasets from here

3.run evaluation

For MLLMs evaluation:

You can run scripts for all the subtasks included in LIME-M using the following method.

accelerate launch --num_processes=8 -m lmms_eval --model internvl2 --model_args pretrained="OpenGVLab/InternVL2-8B"  --tasks textcaps_suit,ok_vqa_suit,coco_cap_suit,textvqa_suit,chartqa_suit,pope_suit,infovqa_suit,ai2d_suit,ocrbench_suit,scienceqa_img_suit  --batch_size 1 --log_samples --log_samples_suffix internvl2_suits --summary True --output_path output_path

For LLMs evaluation:

we utlize VLLM for text only evaluation

python lmms_eval/__main__.py --model llama   --model_args pretrained="meta-llama/Meta-Llama-3-8B-Instruct"  --tasks ai2d_suit,scienceqa_img_suit --batch_size 1 --log_samples --log_samples_suffix llama3_8b_text_only --summary True --output_path output_path

model_path refers to the local storage path of the model, and output_path refers to the location where the final logs are stored.

overall Leadboard

data duration pipeline

The data duration pipeline consists of three parts: (1) Using open-source models as judges, (2) A semi-automated screening process, and (3) Eliminating answer leakage.

You can reproducte the process through the following steps:

1.collect models result

By running this step, you can collect all models results.

python data_curation_pipeline/Models_Judges.py

2.classify samples' category

now we need to classify the difficulty level of each sample. We define N as the number of models that correctly answer the sample. If N ≥ 6, the question is classified as the easy set. If 3 ≤ N ≤ 5, it is classified as the middle set. Conversely, if N ≤ 2, it is classified as the hard set.

3.gpt double check & human double check

To mitigate these potential errors and filter out totally incorrect questions, we use gpt double. running data_curation_pipeline/gpt_double_check.py & data_curation_pipeline/Human_double_check.ipynb

4. Eliminating answer leakage.

For Eliminating answer leakage, we use pure-text models for evaluation, and the other processes are similar to those mentioned above.

Name		Name	Last commit message	Last commit date
Latest commit History 952 Commits
.github		.github
DeepSeek-VL		DeepSeek-VL
__MACOSX		__MACOSX
data_curation_pipeline		data_curation_pipeline
docs		docs
excel_result		excel_result
imgs		imgs
lmms_eval		lmms_eval
miscs		miscs
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
llava_next_110B.py		llava_next_110B.py
llava_next_110B_all.py		llava_next_110B_all.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LIME: LESS IS MORE FOR MLLM EVALUATION

Annoucement

Introduction

How to use LIME

1. Installation

2. download dataset from huggingface

3.run evaluation

For MLLMs evaluation:

For LLMs evaluation:

overall Leadboard

data duration pipeline

1.collect models result

2.classify samples' category

3.gpt double check & human double check

4. Eliminating answer leakage.

About

Releases

Packages

Contributors 27

Languages

License

kangreen0210/LIME-rebuttal

Folders and files

Latest commit

History

Repository files navigation

LIME: LESS IS MORE FOR MLLM EVALUATION

Annoucement

Introduction

How to use LIME

1. Installation

2. download dataset from huggingface

3.run evaluation

For MLLMs evaluation:

For LLMs evaluation:

overall Leadboard

data duration pipeline

1.collect models result

2.classify samples' category

3.gpt double check & human double check

4. Eliminating answer leakage.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 27

Languages

Packages