add TextOCR dataset converter (open-mmlab#293)

* textocr converter for text recog * textocr converter for text detection * update documentation * remove unnecessary garbage collection lines * multi-processing textocr converter * json->mmcv, fix documentation
gaotongxiao · Jun 21, 2021 · 7b072b0 · 7b072b0
1 parent 8befec3
commit 7b072b0
Show file tree

Hide file tree

Showing 3 changed files with 228 additions and 0 deletions.
diff --git a/docs/datasets.md b/docs/datasets.md
@@ -34,6 +34,10 @@ The structure of the text detection dataset directory is organized as follows.
 ├── synthtext
 │   ├── imgs
 │   └── instances_training.lmdb
+├── textocr
+│   ├── train
+│   ├── instances_training.json
+│   └── instances_val.json
 ├── totaltext
 │   ├── imgs
 │   ├── instances_test.json
@@ -47,6 +51,7 @@ The structure of the text detection dataset directory is organized as follows.
 | ICDAR2015 | [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads)     |                                                                                      | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) |                    -                    | [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json) |
 | ICDAR2017 | [homepage](https://rrc.cvc.uab.es/?ch=8&com=downloads)     | [renamed_imgs](https://download.openmmlab.com/mmocr/data/icdar2017/renamed_imgs.tar) | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_training.json) | [instances_val.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_val.json) | - |       |       |
 | Synthtext | [homepage](https://www.robots.ox.ac.uk/~vgg/data/scenetext/)  |                                                                                      | [instances_training.lmdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb) |                    -                    |
+| TextOCR | [homepage](https://textvqa.org/textocr/dataset)  |                                                                                      | - |                    -                    | -
 | Totaltext | [homepage](https://github.com/cs-chan/Total-Text-Dataset)  |                                                                                      | - |                    -                    | -
 
 - For `icdar2015`:
@@ -96,6 +101,24 @@ The structure of the text detection dataset directory is organized as follows.
   ```bash
   python tools/data/textdet/ctw1500_converter.py /path/to/ctw1500 -o /path/to/ctw1500 --split-list training test
   ```
+- For `TextOCR`:
+  - Step1: Download [train_val_images.zip](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip), [TextOCR_0.1_train.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json) and [TextOCR_0.1_val.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json) to `textocr/`.
+  ```bash
+  mkdir textocr && cd textocr
+
+  # Download TextOCR dataset
+  wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
+  wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
+  wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json
+
+  # For images
+  unzip -q train_val_images.zip
+  mv train_images train
+  ```
+  - Step2: Generate `instances_training.json` and `instances_val.json` with the following command:
+  ```bash
+  python tools/data/textdet/textocr_converter.py /path/to/textocr
+  ```
 - For `Totaltext`:
   - Step1: Download `totaltext.zip` from [github dataset](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Dataset) and `groundtruth_text.zip` from [github Groundtruth](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Groundtruth/Text) (We recommend downloading the text groundtruth with .mat format since our totaltext_converter.py supports groundtruth with .mat format only).
   ```bash
@@ -171,6 +194,10 @@ The structure of the text detection dataset directory is organized as follows.
 │   │   ├── label.txt
 │   │   ├── label.lmdb
 │   │   ├── SynthText_Add
+│   ├── TextOCR
+│   │   ├── image
+│   │   ├── train_label.txt
+│   │   ├── val_label.txt
 │   ├── Totaltext
 │   │   ├── imgs
 │   │   ├── annotations
@@ -192,6 +219,7 @@ The structure of the text detection dataset directory is organized as follows.
 |  Syn90k  |               [homepage](https://www.robots.ox.ac.uk/~vgg/data/text/)                |                                                       [shuffle_labels.txt](https://download.openmmlab.com/mmocr/data/mixture/Syn90k/shuffle_labels.txt) \| [label.txt](https://download.openmmlab.com/mmocr/data/mixture/Syn90k/label.txt)                                                       |                                                    -                                                    |       |
 | SynthText  |           [homepage](https://www.robots.ox.ac.uk/~vgg/data/scenetext/)              | [shuffle_labels.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthText/shuffle_labels.txt) \| [instances_train.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthText/instances_train.txt) \| [label.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthText/label.txt) |                                                    -                                                    |       |
 |  SynthAdd  |  [SynthText_Add.zip](https://pan.baidu.com/s/1uV0LtoNmcxbO-0YA7Ch4dg)  (code:627x)   |                                                                                                           [label.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthAdd/label.txt)                                                                                                            |                                                    -                                                    |       |
+|  TextOCR  |  [homepage](https://textvqa.org/textocr/dataset)   |                                                                                                           -                                                                                                           |                                                    -                                                    |       |
 |  Totaltext  |  [homepage](https://github.com/cs-chan/Total-Text-Dataset)   |                                                                                                           -                                                                                                           |                                                    -                                                    |       |
 
 - For `icdar_2013`:
@@ -284,6 +312,25 @@ For example,
 ```bash
 python tools/data/utils/txt2lmdb.py -i data/mixture/Syn90k/label.txt -o data/mixture/Syn90k/label.lmdb
 ```
+- For `TextOCR`:
+  - Step1: Download [train_val_images.zip](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip), [TextOCR_0.1_train.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json) and [TextOCR_0.1_val.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json) to `textocr/`.
+  ```bash
+  mkdir textocr && cd textocr
+
+  # Download TextOCR dataset
+  wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
+  wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
+  wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json
+
+  # For images
+  unzip -q train_val_images.zip
+  mv train_images train
+  ```
+  - Step2: Generate `train_label.txt`, `val_label.txt` and crop images using 4 processes with the following command:
+  ```bash
+  python tools/data/textrecog/textocr_converter.py /path/to/textocr 4
+  ```
+
 
 - For `Totaltext`:
   - Step1: Download `totaltext.zip` from [github dataset](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Dataset) and `groundtruth_text.zip` from [github Groundtruth](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Groundtruth/Text) (We recommend downloading the text groundtruth with .mat format since our totaltext_converter.py supports groundtruth with .mat format only).

diff --git a/tools/data/textdet/textocr_converter.py b/tools/data/textdet/textocr_converter.py
@@ -0,0 +1,74 @@
+import argparse
+import math
+import os.path as osp
+
+import mmcv
+
+from mmocr.utils import convert_annotations
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description='Generate training and validation set of TextOCR ')
+    parser.add_argument('root_path', help='Root dir path of TextOCR')
+    args = parser.parse_args()
+    return args
+
+
+def collect_textocr_info(root_path, annotation_filename, print_every=1000):
+
+    annotation_path = osp.join(root_path, annotation_filename)
+    if not osp.exists(annotation_path):
+        raise Exception(
+            f'{annotation_path} not exists, please check and try again.')
+
+    annotation = mmcv.load(annotation_path)
+
+    # img_idx = img_start_idx
+    img_infos = []
+    for i, img_info in enumerate(annotation['imgs'].values()):
+        if i > 0 and i % print_every == 0:
+            print(f'{i}/{len(annotation["imgs"].values())}')
+
+        img_info['segm_file'] = annotation_path
+        ann_ids = annotation['imgToAnns'][img_info['id']]
+        anno_info = []
+        for ann_id in ann_ids:
+            ann = annotation['anns'][ann_id]
+
+            # Ignore illegible or non-English words
+            text_label = ann['utf8_string']
+            iscrowd = 1 if text_label == '.' else 0
+
+            x, y, w, h = ann['bbox']
+            x, y = max(0, math.floor(x)), max(0, math.floor(y))
+            w, h = math.ceil(w), math.ceil(h)
+            bbox = [x, y, w, h]
+            segmentation = [max(0, int(x)) for x in ann['points']]
+            anno = dict(
+                iscrowd=iscrowd,
+                category_id=1,
+                bbox=bbox,
+                area=ann['area'],
+                segmentation=[segmentation])
+            anno_info.append(anno)
+        img_info.update(anno_info=anno_info)
+        img_infos.append(img_info)
+    return img_infos
+
+
+def main():
+    args = parse_args()
+    root_path = args.root_path
+    print('Processing training set...')
+    training_infos = collect_textocr_info(root_path, 'TextOCR_0.1_train.json')
+    convert_annotations(training_infos,
+                        osp.join(root_path, 'instances_training.json'))
+    print('Processing validation set...')
+    val_infos = collect_textocr_info(root_path, 'TextOCR_0.1_val.json')
+    convert_annotations(val_infos, osp.join(root_path, 'instances_val.json'))
+    print('Finish')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/tools/data/textrecog/textocr_converter.py b/tools/data/textrecog/textocr_converter.py
@@ -0,0 +1,107 @@
+import argparse
+import math
+import os
+import os.path as osp
+from functools import partial
+
+import mmcv
+
+from mmocr.utils.fileio import list_to_file
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description='Generate training and validation set of TextOCR '
+        'by cropping box image.')
+    parser.add_argument('root_path', help='Root dir path of TextOCR')
+    parser.add_argument(
+        'n_proc', default=1, type=int, help='Number of processes to run')
+    args = parser.parse_args()
+    return args
+
+
+def process_img(args, src_image_root, dst_image_root):
+    # Dirty hack for multi-processing
+    img_idx, img_info, anns = args
+    src_img = mmcv.imread(osp.join(src_image_root, img_info['file_name']))
+    labels = []
+    for ann_idx, ann in enumerate(anns):
+        text_label = ann['utf8_string']
+
+        # Ignore illegible or non-English words
+        if text_label == '.':
+            continue
+
+        x, y, w, h = ann['bbox']
+        x, y = max(0, math.floor(x)), max(0, math.floor(y))
+        w, h = math.ceil(w), math.ceil(h)
+        dst_img = src_img[y:y + h, x:x + w]
+        dst_img_name = f'img_{img_idx}_{ann_idx}.jpg'
+        dst_img_path = osp.join(dst_image_root, dst_img_name)
+        mmcv.imwrite(dst_img, dst_img_path)
+        labels.append(f'{osp.basename(dst_image_root)}/{dst_img_name}'
+                      f' {text_label}')
+    return labels
+
+
+def convert_textocr(root_path,
+                    dst_image_path,
+                    dst_label_filename,
+                    annotation_filename,
+                    img_start_idx=0,
+                    nproc=1):
+
+    annotation_path = osp.join(root_path, annotation_filename)
+    if not osp.exists(annotation_path):
+        raise Exception(
+            f'{annotation_path} not exists, please check and try again.')
+    src_image_root = root_path
+
+    # outputs
+    dst_label_file = osp.join(root_path, dst_label_filename)
+    dst_image_root = osp.join(root_path, dst_image_path)
+    os.makedirs(dst_image_root, exist_ok=True)
+
+    annotation = mmcv.load(annotation_path)
+
+    process_img_with_path = partial(
+        process_img,
+        src_image_root=src_image_root,
+        dst_image_root=dst_image_root)
+    tasks = []
+    for img_idx, img_info in enumerate(annotation['imgs'].values()):
+        ann_ids = annotation['imgToAnns'][img_info['id']]
+        anns = [annotation['anns'][ann_id] for ann_id in ann_ids]
+        tasks.append((img_idx + img_start_idx, img_info, anns))
+    labels_list = mmcv.track_parallel_progress(
+        process_img_with_path, tasks, keep_order=True, nproc=nproc)
+    final_labels = []
+    for label_list in labels_list:
+        final_labels += label_list
+    list_to_file(dst_label_file, final_labels)
+    return len(annotation['imgs'])
+
+
+def main():
+    args = parse_args()
+    root_path = args.root_path
+    print('Processing training set...')
+    num_train_imgs = convert_textocr(
+        root_path=root_path,
+        dst_image_path='image',
+        dst_label_filename='train_label.txt',
+        annotation_filename='TextOCR_0.1_train.json',
+        nproc=args.n_proc)
+    print('Processing validation set...')
+    convert_textocr(
+        root_path=root_path,
+        dst_image_path='image',
+        dst_label_filename='val_label.txt',
+        annotation_filename='TextOCR_0.1_val.json',
+        img_start_idx=num_train_imgs,
+        nproc=args.n_proc)
+    print('Finish')
+
+
+if __name__ == '__main__':
+    main()