Skip to content

Commit

Permalink
add TextOCR dataset converter (open-mmlab#293)
Browse files Browse the repository at this point in the history
* textocr converter for text recog

* textocr converter for text detection

* update documentation

* remove unnecessary garbage collection lines

* multi-processing textocr converter

* json->mmcv, fix documentation
  • Loading branch information
gaotongxiao authored Jun 21, 2021
1 parent 8befec3 commit 7b072b0
Show file tree
Hide file tree
Showing 3 changed files with 228 additions and 0 deletions.
47 changes: 47 additions & 0 deletions docs/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,10 @@ The structure of the text detection dataset directory is organized as follows.
├── synthtext
│   ├── imgs
│   └── instances_training.lmdb
├── textocr
│   ├── train
│   ├── instances_training.json
│   └── instances_val.json
├── totaltext
│   ├── imgs
│   ├── instances_test.json
Expand All @@ -47,6 +51,7 @@ The structure of the text detection dataset directory is organized as follows.
| ICDAR2015 | [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads) | | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) | - | [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json) |
| ICDAR2017 | [homepage](https://rrc.cvc.uab.es/?ch=8&com=downloads) | [renamed_imgs](https://download.openmmlab.com/mmocr/data/icdar2017/renamed_imgs.tar) | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_training.json) | [instances_val.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_val.json) | - | | |
| Synthtext | [homepage](https://www.robots.ox.ac.uk/~vgg/data/scenetext/) | | [instances_training.lmdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb) | - |
| TextOCR | [homepage](https://textvqa.org/textocr/dataset) | | - | - | -
| Totaltext | [homepage](https://github.com/cs-chan/Total-Text-Dataset) | | - | - | -

- For `icdar2015`:
Expand Down Expand Up @@ -96,6 +101,24 @@ The structure of the text detection dataset directory is organized as follows.
```bash
python tools/data/textdet/ctw1500_converter.py /path/to/ctw1500 -o /path/to/ctw1500 --split-list training test
```
- For `TextOCR`:
- Step1: Download [train_val_images.zip](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip), [TextOCR_0.1_train.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json) and [TextOCR_0.1_val.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json) to `textocr/`.
```bash
mkdir textocr && cd textocr

# Download TextOCR dataset
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json

# For images
unzip -q train_val_images.zip
mv train_images train
```
- Step2: Generate `instances_training.json` and `instances_val.json` with the following command:
```bash
python tools/data/textdet/textocr_converter.py /path/to/textocr
```
- For `Totaltext`:
- Step1: Download `totaltext.zip` from [github dataset](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Dataset) and `groundtruth_text.zip` from [github Groundtruth](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Groundtruth/Text) (We recommend downloading the text groundtruth with .mat format since our totaltext_converter.py supports groundtruth with .mat format only).
```bash
Expand Down Expand Up @@ -171,6 +194,10 @@ The structure of the text detection dataset directory is organized as follows.
│ │ ├── label.txt
│ │ ├── label.lmdb
│ │ ├── SynthText_Add
│   ├── TextOCR
│ │ ├── image
│ │ ├── train_label.txt
│ │ ├── val_label.txt
│   ├── Totaltext
│ │ ├── imgs
│ │ ├── annotations
Expand All @@ -192,6 +219,7 @@ The structure of the text detection dataset directory is organized as follows.
| Syn90k | [homepage](https://www.robots.ox.ac.uk/~vgg/data/text/) | [shuffle_labels.txt](https://download.openmmlab.com/mmocr/data/mixture/Syn90k/shuffle_labels.txt) \| [label.txt](https://download.openmmlab.com/mmocr/data/mixture/Syn90k/label.txt) | - | |
| SynthText | [homepage](https://www.robots.ox.ac.uk/~vgg/data/scenetext/) | [shuffle_labels.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthText/shuffle_labels.txt) \| [instances_train.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthText/instances_train.txt) \| [label.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthText/label.txt) | - | |
| SynthAdd | [SynthText_Add.zip](https://pan.baidu.com/s/1uV0LtoNmcxbO-0YA7Ch4dg) (code:627x) | [label.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthAdd/label.txt) | - | |
| TextOCR | [homepage](https://textvqa.org/textocr/dataset) | - | - | |
| Totaltext | [homepage](https://github.com/cs-chan/Total-Text-Dataset) | - | - | |

- For `icdar_2013`:
Expand Down Expand Up @@ -284,6 +312,25 @@ For example,
```bash
python tools/data/utils/txt2lmdb.py -i data/mixture/Syn90k/label.txt -o data/mixture/Syn90k/label.lmdb
```
- For `TextOCR`:
- Step1: Download [train_val_images.zip](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip), [TextOCR_0.1_train.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json) and [TextOCR_0.1_val.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json) to `textocr/`.
```bash
mkdir textocr && cd textocr

# Download TextOCR dataset
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json

# For images
unzip -q train_val_images.zip
mv train_images train
```
- Step2: Generate `train_label.txt`, `val_label.txt` and crop images using 4 processes with the following command:
```bash
python tools/data/textrecog/textocr_converter.py /path/to/textocr 4
```


- For `Totaltext`:
- Step1: Download `totaltext.zip` from [github dataset](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Dataset) and `groundtruth_text.zip` from [github Groundtruth](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Groundtruth/Text) (We recommend downloading the text groundtruth with .mat format since our totaltext_converter.py supports groundtruth with .mat format only).
Expand Down
74 changes: 74 additions & 0 deletions tools/data/textdet/textocr_converter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
import argparse
import math
import os.path as osp

import mmcv

from mmocr.utils import convert_annotations


def parse_args():
parser = argparse.ArgumentParser(
description='Generate training and validation set of TextOCR ')
parser.add_argument('root_path', help='Root dir path of TextOCR')
args = parser.parse_args()
return args


def collect_textocr_info(root_path, annotation_filename, print_every=1000):

annotation_path = osp.join(root_path, annotation_filename)
if not osp.exists(annotation_path):
raise Exception(
f'{annotation_path} not exists, please check and try again.')

annotation = mmcv.load(annotation_path)

# img_idx = img_start_idx
img_infos = []
for i, img_info in enumerate(annotation['imgs'].values()):
if i > 0 and i % print_every == 0:
print(f'{i}/{len(annotation["imgs"].values())}')

img_info['segm_file'] = annotation_path
ann_ids = annotation['imgToAnns'][img_info['id']]
anno_info = []
for ann_id in ann_ids:
ann = annotation['anns'][ann_id]

# Ignore illegible or non-English words
text_label = ann['utf8_string']
iscrowd = 1 if text_label == '.' else 0

x, y, w, h = ann['bbox']
x, y = max(0, math.floor(x)), max(0, math.floor(y))
w, h = math.ceil(w), math.ceil(h)
bbox = [x, y, w, h]
segmentation = [max(0, int(x)) for x in ann['points']]
anno = dict(
iscrowd=iscrowd,
category_id=1,
bbox=bbox,
area=ann['area'],
segmentation=[segmentation])
anno_info.append(anno)
img_info.update(anno_info=anno_info)
img_infos.append(img_info)
return img_infos


def main():
args = parse_args()
root_path = args.root_path
print('Processing training set...')
training_infos = collect_textocr_info(root_path, 'TextOCR_0.1_train.json')
convert_annotations(training_infos,
osp.join(root_path, 'instances_training.json'))
print('Processing validation set...')
val_infos = collect_textocr_info(root_path, 'TextOCR_0.1_val.json')
convert_annotations(val_infos, osp.join(root_path, 'instances_val.json'))
print('Finish')


if __name__ == '__main__':
main()
107 changes: 107 additions & 0 deletions tools/data/textrecog/textocr_converter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
import argparse
import math
import os
import os.path as osp
from functools import partial

import mmcv

from mmocr.utils.fileio import list_to_file


def parse_args():
parser = argparse.ArgumentParser(
description='Generate training and validation set of TextOCR '
'by cropping box image.')
parser.add_argument('root_path', help='Root dir path of TextOCR')
parser.add_argument(
'n_proc', default=1, type=int, help='Number of processes to run')
args = parser.parse_args()
return args


def process_img(args, src_image_root, dst_image_root):
# Dirty hack for multi-processing
img_idx, img_info, anns = args
src_img = mmcv.imread(osp.join(src_image_root, img_info['file_name']))
labels = []
for ann_idx, ann in enumerate(anns):
text_label = ann['utf8_string']

# Ignore illegible or non-English words
if text_label == '.':
continue

x, y, w, h = ann['bbox']
x, y = max(0, math.floor(x)), max(0, math.floor(y))
w, h = math.ceil(w), math.ceil(h)
dst_img = src_img[y:y + h, x:x + w]
dst_img_name = f'img_{img_idx}_{ann_idx}.jpg'
dst_img_path = osp.join(dst_image_root, dst_img_name)
mmcv.imwrite(dst_img, dst_img_path)
labels.append(f'{osp.basename(dst_image_root)}/{dst_img_name}'
f' {text_label}')
return labels


def convert_textocr(root_path,
dst_image_path,
dst_label_filename,
annotation_filename,
img_start_idx=0,
nproc=1):

annotation_path = osp.join(root_path, annotation_filename)
if not osp.exists(annotation_path):
raise Exception(
f'{annotation_path} not exists, please check and try again.')
src_image_root = root_path

# outputs
dst_label_file = osp.join(root_path, dst_label_filename)
dst_image_root = osp.join(root_path, dst_image_path)
os.makedirs(dst_image_root, exist_ok=True)

annotation = mmcv.load(annotation_path)

process_img_with_path = partial(
process_img,
src_image_root=src_image_root,
dst_image_root=dst_image_root)
tasks = []
for img_idx, img_info in enumerate(annotation['imgs'].values()):
ann_ids = annotation['imgToAnns'][img_info['id']]
anns = [annotation['anns'][ann_id] for ann_id in ann_ids]
tasks.append((img_idx + img_start_idx, img_info, anns))
labels_list = mmcv.track_parallel_progress(
process_img_with_path, tasks, keep_order=True, nproc=nproc)
final_labels = []
for label_list in labels_list:
final_labels += label_list
list_to_file(dst_label_file, final_labels)
return len(annotation['imgs'])


def main():
args = parse_args()
root_path = args.root_path
print('Processing training set...')
num_train_imgs = convert_textocr(
root_path=root_path,
dst_image_path='image',
dst_label_filename='train_label.txt',
annotation_filename='TextOCR_0.1_train.json',
nproc=args.n_proc)
print('Processing validation set...')
convert_textocr(
root_path=root_path,
dst_image_path='image',
dst_label_filename='val_label.txt',
annotation_filename='TextOCR_0.1_val.json',
img_start_idx=num_train_imgs,
nproc=args.n_proc)
print('Finish')


if __name__ == '__main__':
main()

0 comments on commit 7b072b0

Please sign in to comment.