Skip to content

The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity". The MMInstruct dataset includes 973K instructions from 24 domains and four instruction types.

License

Notifications You must be signed in to change notification settings

yuecao0119/MMInstruct

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MMInstruct

The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity".

The dataset is available on Hugging Face at 🤗 yuecao0119/MMInstruct.

📣 News

  • [Oct 14, 2024] Our paper is accepted by SCIENCE CHINA Information Sciences!
  • [Aug 6, 2024] The dataset is already accessible on Hugging Face at 🤗 yuecao0119/MMInstruct.
  • [Jul 22, 2024] The paper has been released on arXiv!
  • [Jul 22, 2024] Code has been released.

Todo List

  • Data Engine.
  • Open Source Datasets.
  • Release the checkpoint.

Introduction

Vision-language supervised fine-tuning effectively enhances VLLM performance, but existing visual instruction tuning datasets have limitations:

  1. Instruction Annotation Quality: Despite strong performance, advanced VLLMs may generate instructions with inaccuracies, such as hallucinations.
  2. Instruction and Image Diversity: Limited instruction types and lack of diverse image data impact the model's ability to generate varied and realistic outputs.

MMInstruct Dataset

To address these challenges, we created the MMInstruct dataset, featuring:

  • 973K instructions from 24 domains
  • Four instruction types: Judgement, Multiple-Choice, Long Visual Question Answering, and Short Visual Question Answering.
image

The open source datasets on Hugging Face 🤗 yuecao0119/MMInstruct include:

  • caption_cn: 144K English detailed image caption data generated using gpt-4-vision-preview.
  • caption_en: 18.2K Chinese detailed image caption data generated using gpt-4-vision-preview.
  • qa_en: 216K instruction data generated using GPT-3.5-turbo, including 161K multi-round long questions and answers and 55K manually corrected instruction data from 23 fields, as shown in the figure below.

We also expand MMInstruct with other open-source data, including:

Domain Dataset
mathematics datasets GEOS; UniGeo; GeoQA+; Geometry3k; CLEVR-Math; Supre-CLEVR; TabMWP
charts and plots DVQA (100K); FigureQA
scientific figure TQA
map chart MapQA

Data Engine

We developed an instruction generation data engine leveraging GPT-4V, GPT-3.5, and manual correction. This engine allows semi-automatic, low-cost, multi-domain instruction generation at 1/6 the cost of manual construction.

image

As described in our paper, we mainly proposed a semi-automatic and low-cost instruction generation data engine using GPT-4V, GPT-3.5 and manual correction. Our data engine consists of six steps: (a) image collection, (b) image caption generation, (c) seed question collection, (d) automatic instruction generation, (e) dataset expansion and (f) manual correction.

(a) First, we collect a large number of different images from various sources, which are mainly obtained through some selected source images, and then retrieved by crawlers and clips, etc., as shown in image_retrieval_bing_spider.py and image_retrieval_clip.py.

(b) And use GPT-4V to generate detailed image captions, as shown in gpt4v_caption.py.

(c) Then experts designed corresponding seed questions for different fields.

(d) We use image captions and seed questions to automatically generate a rich and diverse set of instruction data through GPT-3.5, as shown in gpt35_qa.py.

(e), (f) In addition, we also use various methods to expand our dataset. Finally, manual correction is performed to ensure data quality and accuracy.

Performance

image

Citation

If this work is helpful for your research, please consider citing the following BibTeX entry.

@article{liu2024mminstruct,
  title={MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity},
  author={Liu, Yangzhou and Cao, Yue and Gao, Zhangwei and Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Tian, Hao and Lu, Lewei and Zhu, Xizhou and Lu, Tong and others},
  journal={arXiv preprint arXiv:2407.15838},
  year={2024}
}

About

The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity". The MMInstruct dataset includes 973K instructions from 24 domains and four instruction types.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages