greek confused with italic #5

bertsky · 2023-03-04T18:03:58Z

I have some material with alternating lines of Latin in Antiqua and Old Greek (interlinear gloss) – the perfect test case IOW.

Unfortunately, the provided model systematically detects italic (with 100% confidence) where Greek should be.

So adaptive will always resort to the SelOCR result, which are wrong half of the time. And of course, when forcing COCR globally, because the OCR model does not have Greek trained into it, the results are not usable either.

The text was updated successfully, but these errors were encountered:

GemCarr · 2023-03-13T11:56:29Z

We dont have enough training data on some classes, specifically 'greek', 'hebrew' and 'manuscript'. I will explain everything on the README.
For the classifier outputting fonts, we will train it again with more data to solve the problem (then it should output 'greek' instead of the incorrect 'italic').
However, the transcriptions results will probably not be ideal. The COCR in theory can handle these classes but provides with poor results due to the lack of data (as you have seen). On SelOCR it will basically use an OCR trained on all the data we have available (all classes), when confronted with those specific lacking classes (which may be better than the specialized italic model in the case you presented).

bertsky · 2023-03-13T12:04:42Z

We dont have enough training data on some classes, specifically 'greek', 'hebrew' and 'manuscript'.

Understood. So could #7 help here? (Even with better training data, there might always be cases where the user observes systematic suboptimal detection and has a priori knowledge to throw in...)

However, the transcriptions results will probably not be ideal.

Yes, in general we might need to use ocrd-typegroups-classifier and combine that dynamically (in the workflow) with dedicated models from other OCR processors.

GemCarr · 2023-03-13T12:29:37Z

#7 would definitely help, i will look into that

Its a possibility that we retrain the ocr models if we obtain more data for the lacking classes, if that is the case i will update the processor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

greek confused with italic #5

greek confused with italic #5

bertsky commented Mar 4, 2023

GemCarr commented Mar 13, 2023

bertsky commented Mar 13, 2023

GemCarr commented Mar 13, 2023

greek confused with italic #5

greek confused with italic #5

Comments

bertsky commented Mar 4, 2023

GemCarr commented Mar 13, 2023

bertsky commented Mar 13, 2023

GemCarr commented Mar 13, 2023