All block_type Text just says "text" #468

pranav270-create · 2025-01-06T18:13:55Z

Used this 270 page OCR needed PDF from another GitHub issue: https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nbsspecialpublication340.pdf

The processing speed is fine (using M3 Pro)

However, with use_llm: True and force_ocr: True, all text blocks just say Text and none of the actual text is captured.
Layout seems solid, image attached.

"id": "/page/19/Text/3",
"block_type": "Text",
"html": "<p block-type="Text">

",

Not sure where this is getting dropped, since the OCR is taking 40 minutes and it is definitely doing something.

VikParuchuri · 2025-01-07T18:45:51Z

Hi, I ran python convert_single.py FILEPATH --page_range 19-21 --force_ocr --use_llm --output_format json as a test, and the output looked fine to me:

        {
          "id": "/page/19/Text/3",
          "block_type": "Text",
          "html": "<p block-type=\"Text\">One of the major applications of measurement is the development of guidelines which help to standardize product design and engineering practice.</p>",
          "polygon": [
            [
              20.221757322175733,
              274.2795180722892
            ],
            [
              340.169921875,

What command are you running when you get the issues?

pranav270-create closed this as completed Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All block_type Text just says "text" #468

All block_type Text just says "text" #468

pranav270-create commented Jan 6, 2025

VikParuchuri commented Jan 7, 2025

All block_type Text just says "text" #468

All block_type Text just says "text" #468

Comments

pranav270-create commented Jan 6, 2025

VikParuchuri commented Jan 7, 2025