Workspace Bulk Add

Adding files to workspaces in bulk

Basic recipe

Let's assume you have directories path/to/files/PAGE containing PAGE files and path/to/files/IMG with images. The files have a basename page_0001.xml, page_0001.tif' etc.

ocrd workspace bulk-add \
    --regex '^.*/(?P<fileGrp>[^/]+)/page_(?P<pageid>.*)\.(?P<ext>[^\.]*)$' \
    --file-id 'FILE_{{ fileGrp }}_{{ pageid }}' \
    --page-id 'PHYS_{{ pageid }}' \
    --file-grp "{{ fileGrp }}" \
    --url '{{ fileGrp }}/FILE_{{ pageid }}.{{ ext }}' \
    'path/to/files/*/*.*'

This will first expand the glob to get filenames and resolve them to absolute paths.

Every path is then matched against --regex with re.match, yielding template variables derived from the syntax of the path. These template variables can be used in all file-specific options. --url after expansion is used as the filename relative to the workspace directory and copied into the workspace if not already present. After expanding all template variables, the file is added with Workspace.add_file.

In this case:

path/to/files/PAGE/page_0001.xml ->
- url: PAGE/FILE_0001.xml (will be copied because file name is different)
- fileGrp: PAGE
- ID: FILE_0001
- pageId: PHYS_0001

--mimetype, if not provided, is mapped from the file extension.

--ignore will disable the check for existing files with the same @ID and is a huge performance boost.

Adding legacy OCR-D GT data in bulk

For example, to import the old (first-generation zip-file) OCR-D GT directories, one could then do:

# in a directory where all zip-files have been extracted already:
for book in */; do

pushd $book
ocrd workspace init
ocrd workspace set-id $book

# only images, no copying
ocrd workspace bulk-add \
  --skip \
  --regex '^(?P<dispname>[^/]*)/(?P=dispname)_(?P<pageid>[0-9]*)\.tif$' \
  --file-id 'FILE_ORIG_{{ pageid }}'
  --page-id 'PHYS_{{ pageid }}' \
  --file-grp OCR-D-IMG \
  --url '{{ dispname }}_{{ pageid }}.tif' \
  $(find . -name "*.tif")

# only PAGE, no copying
ocrd workspace bulk-add \
  --skip \
  --regex '^(?P<dispname>[^/]*)/page/(?P=dispname)_(?P<pageid>[0-9]*)\.xml$' \
  --file-id 'FILE_GT_{{ pageid }}' \
  --page-id 'PHYS_{{ pageid }}' \
  --file-grp OCR-D-GT-SEG-PAGE \
  --url 'page/{{ dispname }}_{{ pageid }}.xml' \
  $(find . -name "*.xml")

# only ALTO, no copying
ocrd workspace bulk-add \
  --skip \
  --regex '^(?P<dispname>[^/]*)/alto/(?P=dispname)_(?P<pageid>[0-9]*)\.xml$' \
  --file-id 'FILE_GT-ALTO_{{ pageid }}' \
  --page-id 'PHYS_{{ pageid }}' \
  --file-grp OCR-D-GT-ALTO-SEG-PAGE \
  --mimetype application/alto+xml \
  --url 'alto/{{ dispname }}_{{ pageid }}.xml' \
  $(find . -name "*.xml")

popd
done

(You cannot match the non-existing image subdirectory as fileGrp in this convention directly, and breaking it up allows a basic form of string transformation.)

Adding flat directory hierarchies

In the common case where images and annotations reside in per-document directories with image files along PAGE-XML files of the same basename (as in the old LAREX bookpath convention, or in various GT collections), the following would import such books into (OCR-D conforming) METS, while not copying files into new (OCR-D conforming) paths:

# in the bookpath/library directory:
for book in */; do

pushd $book
ocrd workspace init
ocrd workspace set-id $book

ocrd workspace bulk-add \
  --regex '^(?P<pageid>.*)\.xml$' \
  --file-id 'OCR-D-GT-SEG-LINE_{{ pageid }}' \
  --page-id 'PHYS_{{ pageid }}' \
  --file-grp OCR-D-GT-SEG-LINE \
  --url '{{ pageid }}.xml' \
  $(find . -name "*.xml" -not -name mets.xml)

ocrd workspace bulk-add \
  --regex '^(?P<pageid>.*)\.(^P<ext>[^.]*)$' \
  --file-id 'OCR-D-IMG_{{ pageid }}' \
  --page-id 'PHYS_{{ pageid }}' \
  --file-grp OCR-D-IMG \
  --url '{{ pageid }}.{{ ext }}' \
  $(find . -type f -not -name "*.xml")

popd
done

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials

Discussions

Expert section on OCR-D- workflows

Particular workflow steps

Recommended workflows

Successful Workflows for Particular Material (Template)

Workflow Guide

Videos

Section on Ground Truth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workspace Bulk Add

Adding files to workspaces in bulk

Basic recipe

Adding legacy OCR-D GT data in bulk

Adding flat directory hierarchies

Clone this wiki locally