-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
text regularizer: missing replacements, new group combinations #264
Comments
i fully agree with defining and adding further predefined groups like (ocr-d/dta level 1-3). however, in my opinion the default simply has to be "spaces". i think christoph already changed this a few days ago. is level 2 really supposed to preserve all random PUA codes? i know it is not explicitely stated in the rules but this looks like an odd decision. it does not seem to matter now because uwe wiped them all out when creating the GT4HistOCR corpus but when adding more GT there will be further cases... the punctuation rules would have to be added as well, right? i also fully agree on the CLI things. in addition, i think it would be helpful to have the rules and profiles available in an external data format, like for example in the PAGETools. @ChWick ? regarding the proposed individual rule changes i will have to take a closer look but it all looks reasonable to me. |
Yes, it certainly looks so:
No, IIUC level 2 should replace them all with some regularized form. @tboenig is currently working on a new version of the guidelines that are more readable – that should make it more clear. AFAICT we are only beginning to collect these cases/rules.
Yes. And punctuation in particular will be a likely candidate for debate (and refinement, esp. virgula and old punctuation characters).
Oh, @ChWick has already implemented this in a3668d9 – with pkg resources for rules. We could easily replace that with an external Python package (either from Calamari-OCR or managed by OCR-D or any of the other existing regularization libraries) in the future. Great work! Am I right in assuming I will then somehow be able to reference these JSON file names in
We should coordinate with @tboenig, @stweil, @mikegerber et alii. |
Yes, you can modify the groups, however the new names are |
While trying to compare GT4HistOCR model performance between Tesseract and Calamari, I stumbled over a few peculiarities of Calamari's (superb!) text pre/postprocessor.
First off, when using vanilla Levenshtein (unweighted, without character equivalences or normalization beyond NFC), CER is 0.8% and thus already pretty low on the Calamari model trained by the Qurator team, which seem to have used the default regularizer
extended
.But when taking a closer look, it appears a lot of its remaining errors can be attributed to the
quotes
andvarious
subgroups of theextended
regularizer, amounting to about half of CER (0.4%)Now, GT4HistOCR more or less corresponds to DTA / OCR-D GT transcription level 2 guidelines. To me that seems like a plausible compromise for OCR models: they are not as expensive to transcribe as level 3 (while still preserving many semantically important distinctions) and can still be reduced to level 1 automatically afterwards (if strong normalization is needed, e.g. for search indexing).
The Calamari equivalent of GT level 2 would be
['spaces', 'roman_digits', 'ligatures-consonantal']
IINM. I therefore suggest giving that combination a predefined group under a suitable name, sayconservative
(and probably even make that the new default). Also, it would not hurt mapping all official GT levels with aliases in Calamari:gtlevel1
:['spaces', 'punctuation', 'quotes', 'various', 'roman_digits', 'ligatures-vocal', 'ligatures-consonantal', 'uvius']
gtlevel2
:['spaces', 'roman_digits', 'ligatures-consonantal']
(or'conservative'
)gtlevel3
:['spaces']
Furthermore, I believe these regularizers should be made available prominently on the CLIs:
calamari-train --data.post_proc.processors.5.replacement_groups
calamari-predict
andcalamari-eval
(for additional postprocessing beyond what's already in the model)calamari-textproc
And second, within the existing groups, I believe a few individual rule changes are worth considering:
quotes
group does not contain the following characters yet:‚ ‛ ‟ « » ‹ › 〟 〞‟
(low-9, high-reversed-9, angular and historical variants)quotes
group, IMHO"
(ASCII dq) to''
(double ASCII sq) normalization is not adequate under most circumstances and thus should be an extra.⁰ ¹ ² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ꝰ
to ASCII.various
group,𝛑
→π
(U+1D6D1 MATHEMATICAL BOLD SMALL PI)𝜋
→π
(U+1D70B MATHEMATICAL ITALIC SMALL PI)𝝅
→π
(U+1D745 MATHEMATICAL BOLD ITALIC SMALL PI)𝝿
→π
(U+1D77F MATHEMATICAL SANS-SERIF BOLD SMALL PI)𝞹
→π
(U+1D7B9 MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL PI)−
→-
(U+2212 MINUS SIGN)‐
→-
(U+2010 HYPHEN)‑
→-
(U+2011 NON-BREAKING HYPHEN)‒
→-
(U+2012 FIGURE DASH)―
→-
(U+2015 QUOTATION DASH)⁃
→-
(U+2043 HYPHEN BULLET)﹘
→-
(U+FE58 SMALL EM DASH)─
→-
(U+2500 FORMS LIGHT HORIZONTAL)∼
→~
(U+223C TILDE OPERATOR)˜
→~
(U+02DC SMALL TILDE)⁓
→~
or-
(U+2053 SWUNG DASH)⟨
→(
(U+27E8 MATHEMATICAL LEFT ANGLE BRACKET)⟩
→)
(U+27E9 MATHEMATICAL RIGHT ANGLE BRACKET)⁽
→(
(U+207D SUPERSCRIPT LEFT PARENTHESIS)⁾
→)
(U+207E SUPERSCRIPT RIGHT PARENTHESIS)⁄
→/
(U+2044 FRACTION SLASH) – but perhaps we instead should encourage this representation for fractions (even where precomposed codepoints exist)∕
→/
(U+2215 DIVISION SLASH)∖
→\
(U+2216 SET MINUS)uvius
group,ij
→ij
(U+0133 LATIN SMALL LIGATURE IJ)⧸
→/
(U+29F8 BIG SOLIDUS)⧹
→\
(U+29F9 BIG REVERSE SOLIDUS)⧵
→\
(U+29F5 REVERSE SOLIDUS OPERATOR)J
→I
instead ofI
→J
– becausei
is more common thanj
, is more conventional among canonicalizations (e.g.ietzt
ietzo
), and avoids additional misrepresentation of roman numeralsEDITED for correct CER measurement and its interpretation.
The text was updated successfully, but these errors were encountered: