Skip to content
Robert Sachunsky edited this page Feb 9, 2022 · 4 revisions

Adding files to workspaces in bulk

Basic recipe

Let's assume you have directories path/to/files/PAGE containing PAGE files and path/to/files/IMG with images. The files have a basename page_0001.xml, page_0001.tif' etc.

ocrd workspace bulk-add \
    --regex '^.*/(?P<fileGrp>[^/]+)/page_(?P<pageid>.*)\.(?P<ext>[^\.]*)$' \
    --file-id 'FILE_{{ fileGrp }}_{{ pageid }}' \
    --page-id 'PHYS_{{ pageid }}' \
    --file-grp "{{ fileGrp }}" \
    --url '{{ fileGrp }}/FILE_{{ pageid }}.{{ ext }}' \
    'path/to/files/*/*.*'

This will first expand the glob to get filenames and resolve them to absolute paths.

Every path is then matched against --regex with re.match, yielding template variables derived from the syntax of the path. These template variables can be used in all file-specific options. --url after expansion is used as the filename relative to the workspace directory and copied into the workspace if not already present. After expanding all template variables, the file is added with Workspace.add_file.

In this case:

  • path/to/files/PAGE/page_0001.xml ->
    • url: PAGE/FILE_0001.xml (will be copied because file name is different)
    • fileGrp: PAGE
    • ID: FILE_0001
    • pageId: PHYS_0001

--mimetype, if not provided, is mapped from the file extension.

--ignore will disable the check for existing files with the same @ID and is a huge performance boost.

Adding legacy OCR-D GT data in bulk

For example, to import the old (first-generation zip-file) OCR-D GT directories, one could then do:

# in a directory where all zip-files have been extracted already:
for book in */; do

pushd $book
ocrd workspace init
ocrd workspace set-id $book

# only images, no copying
ocrd workspace bulk-add \
  --skip \
  --regex '^(?P<dispname>[^/]*)/(?P=dispname)_(?P<pageid>[0-9]*)\.tif$' \
  --file-id 'FILE_ORIG_{{ pageid }}'
  --page-id 'PHYS_{{ pageid }}' \
  --file-grp OCR-D-IMG \
  --url '{{ dispname }}_{{ pageid }}.tif' \
  $(find . -name "*.tif")

# only PAGE, no copying
ocrd workspace bulk-add \
  --skip \
  --regex '^(?P<dispname>[^/]*)/page/(?P=dispname)_(?P<pageid>[0-9]*)\.xml$' \
  --file-id 'FILE_GT_{{ pageid }}' \
  --page-id 'PHYS_{{ pageid }}' \
  --file-grp OCR-D-GT-SEG-PAGE \
  --url 'page/{{ dispname }}_{{ pageid }}.xml' \
  $(find . -name "*.xml")

# only ALTO, no copying
ocrd workspace bulk-add \
  --skip \
  --regex '^(?P<dispname>[^/]*)/alto/(?P=dispname)_(?P<pageid>[0-9]*)\.xml$' \
  --file-id 'FILE_GT-ALTO_{{ pageid }}' \
  --page-id 'PHYS_{{ pageid }}' \
  --file-grp OCR-D-GT-ALTO-SEG-PAGE \
  --mimetype application/alto+xml \
  --url 'alto/{{ dispname }}_{{ pageid }}.xml' \
  $(find . -name "*.xml")

popd
done

(You cannot match the non-existing image subdirectory as fileGrp in this convention directly, and breaking it up allows a basic form of string transformation.)

Adding flat directory hierarchies

In the common case where images and annotations reside in per-document directories with image files along PAGE-XML files of the same basename (as in the old LAREX bookpath convention, or in various GT collections), the following would import such books into (OCR-D conforming) METS, while not copying files into new (OCR-D conforming) paths:

# in the bookpath/library directory:
for book in */; do

pushd $book
ocrd workspace init
ocrd workspace set-id $book

ocrd workspace bulk-add \
  --regex '^(?P<pageid>.*)\.xml$' \
  --file-id 'OCR-D-GT-SEG-LINE_{{ pageid }}' \
  --page-id 'PHYS_{{ pageid }}' \
  --file-grp OCR-D-GT-SEG-LINE \
  --url '{{ pageid }}.xml' \
  $(find . -name "*.xml" -not -name mets.xml)

ocrd workspace bulk-add \
  --regex '^(?P<pageid>.*)\.(^P<ext>[^.]*)$' \
  --file-id 'OCR-D-IMG_{{ pageid }}' \
  --page-id 'PHYS_{{ pageid }}' \
  --file-grp OCR-D-IMG \
  --url '{{ pageid }}.{{ ext }}' \
  $(find . -type f -not -name "*.xml")

popd
done

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials
Discussions
Expert section on OCR-D- workflows
Particular workflow steps
Recommended workflows
Workflow Guide
Videos
Section on Ground Truth
Clone this wiki locally