Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All block_type Text just says "text" #468

Closed
pranav270-create opened this issue Jan 6, 2025 · 1 comment
Closed

All block_type Text just says "text" #468

pranav270-create opened this issue Jan 6, 2025 · 1 comment

Comments

@pranav270-create
Copy link

Used this 270 page OCR needed PDF from another GitHub issue: https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nbsspecialpublication340.pdf

The processing speed is fine (using M3 Pro)

However, with use_llm: True and force_ocr: True, all text blocks just say Text and none of the actual text is captured.
Layout seems solid, image attached.
page_39_annotated

"id": "/page/19/Text/3",
"block_type": "Text",
"html": "<p block-type="Text">

",

Not sure where this is getting dropped, since the OCR is taking 40 minutes and it is definitely doing something.

@VikParuchuri
Copy link
Owner

Hi, I ran python convert_single.py FILEPATH --page_range 19-21 --force_ocr --use_llm --output_format json as a test, and the output looked fine to me:

        {
          "id": "/page/19/Text/3",
          "block_type": "Text",
          "html": "<p block-type=\"Text\">One of the major applications of measurement is the development of guidelines which help to standardize product design and engineering practice.</p>",
          "polygon": [
            [
              20.221757322175733,
              274.2795180722892
            ],
            [
              340.169921875,

What command are you running when you get the issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants