Maintaining the original order of text and images when extracting content from a PDF file #827
Unanswered
akashchevli
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I need to extract text and images from PDF files and write them to a Markdown file. However, I'm facing an issue with maintaining the original order of text and images as they appear in the PDF file.
My approach is to iterate through the text blocks and images and write them to the Markdown file as I encounter them. However, this approach results in the text blocks being written first, followed by the images, which does not maintain the original order of content as it appears in the PDF file.
I've attempted to sort the content by the Y-coordinate of the bounding boxes of text blocks, but this results in incorrect order due to the presence of images influencing the positioning of text blocks.
How can I modify my approach to ensure that the text and images are extracted and written to the Markdown file in the same order as they appear in the PDF file or At least I console the result?
Here's a simplified version of my current code:
Any insights or suggestions would be greatly appreciated. Thank you!
Beta Was this translation helpful? Give feedback.
All reactions