Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text regularizer: missing replacements, new group combinations #264

Open
bertsky opened this issue Jun 12, 2021 · 3 comments
Open

text regularizer: missing replacements, new group combinations #264

bertsky opened this issue Jun 12, 2021 · 3 comments
Labels
accuracy Concerns quality of (some model's) predictions enhancement New feature or request

Comments

@bertsky
Copy link
Collaborator

bertsky commented Jun 12, 2021

While trying to compare GT4HistOCR model performance between Tesseract and Calamari, I stumbled over a few peculiarities of Calamari's (superb!) text pre/postprocessor.

First off, when using vanilla Levenshtein (unweighted, without character equivalences or normalization beyond NFC), CER is 0.8% and thus already pretty low on the Calamari model trained by the Qurator team, which seem to have used the default regularizer extended.

But when taking a closer look, it appears a lot of its remaining errors can be attributed to the quotes and various subgroups of the extended regularizer, amounting to about half of CER (0.4%)

Now, GT4HistOCR more or less corresponds to DTA / OCR-D GT transcription level 2 guidelines. To me that seems like a plausible compromise for OCR models: they are not as expensive to transcribe as level 3 (while still preserving many semantically important distinctions) and can still be reduced to level 1 automatically afterwards (if strong normalization is needed, e.g. for search indexing).

The Calamari equivalent of GT level 2 would be ['spaces', 'roman_digits', 'ligatures-consonantal'] IINM. I therefore suggest giving that combination a predefined group under a suitable name, say conservative (and probably even make that the new default). Also, it would not hurt mapping all official GT levels with aliases in Calamari:

  • gtlevel1: ['spaces', 'punctuation', 'quotes', 'various', 'roman_digits', 'ligatures-vocal', 'ligatures-consonantal', 'uvius']
  • gtlevel2: ['spaces', 'roman_digits', 'ligatures-consonantal'] (or 'conservative')
  • gtlevel3: ['spaces']

Furthermore, I believe these regularizers should be made available prominently on the CLIs:

  • by describing and advertising these options in calamari-train --data.post_proc.processors.5.replacement_groups
  • by adding and describing these options to calamari-predict and calamari-eval (for additional postprocessing beyond what's already in the model)
  • by creating a separate CLI merely for text post-processing (to re-use elsewhere), say calamari-textproc

And second, within the existing groups, I believe a few individual rule changes are worth considering:

  1. The quotes group does not contain the following characters yet: ‚ ‛ ‟ « » ‹ › 〟 〞‟ (low-9, high-reversed-9, angular and historical variants)
  2. In the quotes group, IMHO " (ASCII dq) to '' (double ASCII sq) normalization is not adequate under most circumstances and thus should be an extra.
  3. It's probably useful to also have rules for regularizing footnote numerals ⁰ ¹ ² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ꝰ to ASCII.
  4. In the various group,
    • add 𝛑π (U+1D6D1 MATHEMATICAL BOLD SMALL PI)
    • add 𝜋π (U+1D70B MATHEMATICAL ITALIC SMALL PI)
    • add 𝝅π (U+1D745 MATHEMATICAL BOLD ITALIC SMALL PI)
    • add 𝝿π (U+1D77F MATHEMATICAL SANS-SERIF BOLD SMALL PI)
    • add 𝞹π (U+1D7B9 MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL PI)
    • add - (U+2212 MINUS SIGN)
    • add - (U+2010 HYPHEN)
    • add - (U+2011 NON-BREAKING HYPHEN)
    • add - (U+2012 FIGURE DASH)
    • add - (U+2015 QUOTATION DASH)
    • add - (U+2043 HYPHEN BULLET)
    • add - (U+FE58 SMALL EM DASH)
    • add - (U+2500 FORMS LIGHT HORIZONTAL)
    • add ~ (U+223C TILDE OPERATOR)
    • add ˜~ (U+02DC SMALL TILDE)
    • add ~ or - (U+2053 SWUNG DASH)
    • add ( (U+27E8 MATHEMATICAL LEFT ANGLE BRACKET)
    • add ) (U+27E9 MATHEMATICAL RIGHT ANGLE BRACKET)
    • add ( (U+207D SUPERSCRIPT LEFT PARENTHESIS)
    • add ) (U+207E SUPERSCRIPT RIGHT PARENTHESIS)
    • add / (U+2044 FRACTION SLASH) – but perhaps we instead should encourage this representation for fractions (even where precomposed codepoints exist)
    • add / (U+2215 DIVISION SLASH)
    • add \ (U+2216 SET MINUS)
  5. In the uvius group,
    • add ijij (U+0133 LATIN SMALL LIGATURE IJ)
    • add / (U+29F8 BIG SOLIDUS)
    • add \ (U+29F9 BIG REVERSE SOLIDUS)
    • add \ (U+29F5 REVERSE SOLIDUS OPERATOR)
    • use JI instead of IJ – because i is more common than j, is more conventional among canonicalizations (e.g. ietzt ietzo), and avoids additional misrepresentation of roman numerals

EDITED for correct CER measurement and its interpretation.

@chreul
Copy link
Member

chreul commented Jun 16, 2021

i fully agree with defining and adding further predefined groups like (ocr-d/dta level 1-3). however, in my opinion the default simply has to be "spaces". i think christoph already changed this a few days ago.

is level 2 really supposed to preserve all random PUA codes? i know it is not explicitely stated in the rules but this looks like an odd decision. it does not seem to matter now because uwe wiped them all out when creating the GT4HistOCR corpus but when adding more GT there will be further cases... the punctuation rules would have to be added as well, right?

i also fully agree on the CLI things. in addition, i think it would be helpful to have the rules and profiles available in an external data format, like for example in the PAGETools. @ChWick ?

regarding the proposed individual rule changes i will have to take a closer look but it all looks reasonable to me.

@bertsky
Copy link
Collaborator Author

bertsky commented Jul 22, 2021

i fully agree with defining and adding further predefined groups like (ocr-d/dta level 1-3). however, in my opinion the default simply has to be "spaces". i think christoph already changed this a few days ago.

Yes, it certainly looks so:

is level 2 really supposed to preserve all random PUA codes? i know it is not explicitely stated in the rules but this looks like an odd decision.

No, IIUC level 2 should replace them all with some regularized form. @tboenig is currently working on a new version of the guidelines that are more readable – that should make it more clear. AFAICT we are only beginning to collect these cases/rules.

it does not seem to matter now because uwe wiped them all out when creating the GT4HistOCR corpus but when adding more GT there will be further cases... the punctuation rules would have to be added as well, right?

Yes. And punctuation in particular will be a likely candidate for debate (and refinement, esp. virgula and old punctuation characters).

i also fully agree on the CLI things. in addition, i think it would be helpful to have the rules and profiles available in an external data format, like for example in the PAGETools. @ChWick ?

Oh, @ChWick has already implemented this in a3668d9 – with pkg resources for rules. We could easily replace that with an external Python package (either from Calamari-OCR or managed by OCR-D or any of the other existing regularization libraries) in the future. Great work!

Am I right in assuming I will then somehow be able to reference these JSON file names in calamari-train --data.post_proc.processors.5.replacement_groups?

regarding the proposed individual rule changes i will have to take a closer look but it all looks reasonable to me.

We should coordinate with @tboenig, @stweil, @mikegerber et alii.

@ChWick
Copy link
Member

ChWick commented Jul 22, 2021

@bertsky

Am I right in assuming I will then somehow be able to reference these JSON file names in calamari-train --data.post_proc.processors.5.replacement_groups?

Yes, you can modify the groups, however the new names are rulesets and rulegroups, whereby rulegroups must be present in the resources of Calamari, and rulesets can be a list of predefined rule sets of Calamari (in the resources) or to an arbitrary json-path.

@bertsky bertsky added enhancement New feature or request accuracy Concerns quality of (some model's) predictions labels Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accuracy Concerns quality of (some model's) predictions enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants