GT guidelines: implement and open for community contributions #226

bertsky · 2021-06-07T22:17:31Z

In their current form, the OCR-D transcription guidelines are often of little use to annotators looking for answers or guidance. They are written top-down intellectual accounts, but not formal (i.e. runnable/verifiable) and not searchable and – well, quite incomplete. Although many examples are given already, this is not nearly enough for the diverse set of materials and pecularities which annotators face (esp. those without a bibliological / humanities background).

How can we improve that?

I propose attacking this on multiple levels:

first fixing GT guidelines: fix formatting #225 and toc sidebar for GT guidelines #207 (and perhaps google custom search for documentation #102)
finally starting a software implementation (which can normalize arbitrary text input at each GT level or canonicalize to the next lower level)
opening up the repository for comments and amendments by users/practitioners (perhaps in the same way that the workflow guide was mirrored to the wiki and gets synchronized back every now and then)
supplementing https://ocr-d.de/en/gt-guidelines/trans/ocr_d_koordinationsgremium_codierung.html and https://ocr-d.de/en/gt-guidelines/trans/trFremdsprache.html with data columns for all GT levels (for quick lookup)
starting a public glyph repository by aggregating diverse textual GT, enriching it with glyph coordinates via OCR (e.g. Tesseract 3 segmenter) forced alignment, and extracting glyph image-text file pairs
tying the website to the glyph repo with a dedicated search interface: text→image search and image→text search (via image similarity like in Newspaper Navigator)

bertsky · 2021-06-09T08:58:12Z

opening up the repository for comments and ammendments by users/practitioners (perhaps in the same way that the workflow guide was mirrored to the wiki and gets synchronized back every now and then)

It's not as convenient as a Wiki (with direct preview), and not as conventient as editing Markdown files on Github (with direct preview), but perhaps users can just fork/edit the gt-guidelines repo?

finally starting a software implementation (which can normalize arbitrary text input at each GT level or canonicalize to the next lower level)

Existing places to look for (just off my head):

Calamari text regularizer rules
cor-asv-ann-evaluate character equivalences
GT4HistOCR/tools/regularize.pl (with care!)
IMPACT ocrevalUAtion character equivalences
Ocropus1 replacements und Ocropus1 whitespace rules
Tesseract string normalization and Tesseract grapheme/language validation

kba · 2023-04-25T12:58:20Z

ping @tboenig once he's back from vacation

bertsky mentioned this issue Jul 22, 2021

text regularizer: missing replacements, new group combinations Calamari-OCR/calamari#264

Open

krvoigt added the Epic label Feb 9, 2022

krvoigt assigned Boenig and tboenig Feb 9, 2022

tboenig added the gt label Feb 14, 2022

kba mentioned this issue Apr 25, 2023

GT Guidelines and OCR-D Wiki #294

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GT guidelines: implement and open for community contributions #226

GT guidelines: implement and open for community contributions #226

bertsky commented Jun 7, 2021 •

edited

Loading

bertsky commented Jun 9, 2021

kba commented Apr 25, 2023

GT guidelines: implement and open for community contributions #226

GT guidelines: implement and open for community contributions #226

Comments

bertsky commented Jun 7, 2021 • edited Loading

bertsky commented Jun 9, 2021

kba commented Apr 25, 2023

bertsky commented Jun 7, 2021 •

edited

Loading