Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble with "separator lines" made of **** or ----- or ======= #301

Open
callegar opened this issue May 14, 2023 · 1 comment
Open

Trouble with "separator lines" made of **** or ----- or ======= #301

callegar opened this issue May 14, 2023 · 1 comment

Comments

@callegar
Copy link

I have noticed that when scanning documents where the old practice (going back to typewriters) to use rows of asterisks or equal signs as text separators, tesseract performs poorly.

Esample

          Some document 

Line one

**************************

Line two

===  ===  === === ===  ===  ===

Line three

On something like this, tesseract would strive to match the line made of asterisks or equal signs to some text, resulting in things like EERKKRKKERKKREAKREKKAKRKKKAK or RRR RRR NETT RRR RRR which is not typically the desired outcome.

It is my understanding that the issue might likely come from the training data rather than the engine itself. If this is so, I wonder if the training sets could be augmented to consider these cases.

@stweil
Copy link
Contributor

stweil commented May 14, 2023

I know from personal experience that Tesseract can be trained to recognize sequences like "........." (often used in tables) if the correct number of dots was part of the training data. Therefore I am rather sure that your examples could also be recognized with the right model. Typically for real world documents humans don't like counting dots or other sequences of similar characters and omit them in the transcription. And obviously the generated training data did not contain such sequences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants