Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

greek confused with italic #5

Open
bertsky opened this issue Mar 4, 2023 · 3 comments
Open

greek confused with italic #5

bertsky opened this issue Mar 4, 2023 · 3 comments

Comments

@bertsky
Copy link
Contributor

bertsky commented Mar 4, 2023

I have some material with alternating lines of Latin in Antiqua and Old Greek (interlinear gloss) – the perfect test case IOW.

Unfortunately, the provided model systematically detects italic (with 100% confidence) where Greek should be.

So adaptive will always resort to the SelOCR result, which are wrong half of the time. And of course, when forcing COCR globally, because the OCR model does not have Greek trained into it, the results are not usable either.

@GemCarr
Copy link
Collaborator

GemCarr commented Mar 13, 2023

We dont have enough training data on some classes, specifically 'greek', 'hebrew' and 'manuscript'. I will explain everything on the README.
For the classifier outputting fonts, we will train it again with more data to solve the problem (then it should output 'greek' instead of the incorrect 'italic').
However, the transcriptions results will probably not be ideal. The COCR in theory can handle these classes but provides with poor results due to the lack of data (as you have seen). On SelOCR it will basically use an OCR trained on all the data we have available (all classes), when confronted with those specific lacking classes (which may be better than the specialized italic model in the case you presented).

@bertsky
Copy link
Contributor Author

bertsky commented Mar 13, 2023

We dont have enough training data on some classes, specifically 'greek', 'hebrew' and 'manuscript'.

Understood. So could #7 help here? (Even with better training data, there might always be cases where the user observes systematic suboptimal detection and has a priori knowledge to throw in...)

However, the transcriptions results will probably not be ideal.

Yes, in general we might need to use ocrd-typegroups-classifier and combine that dynamically (in the workflow) with dedicated models from other OCR processors.

@GemCarr
Copy link
Collaborator

GemCarr commented Mar 13, 2023

#7 would definitely help, i will look into that

Its a possibility that we retrain the ocr models if we obtain more data for the lacking classes, if that is the case i will update the processor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants