Maintaining the original order of text and images when extracting content from a PDF file #827

akashchevli · 2024-05-03T11:45:12Z

akashchevli
May 3, 2024

I need to extract text and images from PDF files and write them to a Markdown file. However, I'm facing an issue with maintaining the original order of text and images as they appear in the PDF file.

My approach is to iterate through the text blocks and images and write them to the Markdown file as I encounter them. However, this approach results in the text blocks being written first, followed by the images, which does not maintain the original order of content as it appears in the PDF file.

I've attempted to sort the content by the Y-coordinate of the bounding boxes of text blocks, but this results in incorrect order due to the presence of images influencing the positioning of text blocks.

How can I modify my approach to ensure that the text and images are extracted and written to the Markdown file in the same order as they appear in the PDF file or At least I console the result?

Here's a simplified version of my current code:

using (PdfDocument document = PdfDocument.Open(pdfFilePath))
{
    using (StreamWriter writer = new StreamWriter(mdFilePath))
    {
        int pageIndex = 0; // Initialize the page index counter

        foreach (Page page in document.GetPages())
        {
            writer.WriteLine($"## Page {pageIndex + 1}");
            writer.WriteLine(); // Add a blank line after page header

            // Gather text blocks
            IEnumerable<Word> words = page.GetWords(NearestNeighbourWordExtractor.Instance);
            IReadOnlyList<TextBlock> textBlocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);

            // Gather images
            IEnumerable<IPdfImage> images = page.GetImages();

            foreach(var block in textBlocks)
            {
                Console.WriteLine($"{block.Text} : {block.BoundingBox.Bottom}");
            }

            foreach (var image in images)
            {
                Console.WriteLine(image);
            }
        }
    }
}

Any insights or suggestions would be greatly appreciated. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maintaining the original order of text and images when extracting content from a PDF file #827

{{title}}

Replies: 0 comments

Select a reply

Maintaining the original order of text and images when extracting content from a PDF file #827

akashchevli May 3, 2024

Replies: 0 comments

akashchevli
May 3, 2024