Fix markitdown.convert_stream handling of leading blanks #223

doggy8088 · 2024-12-27T17:19:08Z

Fixes #222

Address issue with markitdown.convert_stream crashing on input with leading blank characters or line breaks.

Modify convert_stream function in src/markitdown/_markitdown.py to strip leading blank characters or line breaks from the input stream using a new helper function _strip_leading_blanks.
Add a test case in tests/test_markitdown.py to verify that markitdown.convert_stream handles input with leading blank characters or line breaks correctly.

For more details, open the Copilot Workspace session.

Fixes microsoft#222 Address issue with `markitdown.convert_stream` crashing on input with leading blank characters or line breaks. * Modify `convert_stream` function in `src/markitdown/_markitdown.py` to strip leading blank characters or line breaks from the input stream using a new helper function `_strip_leading_blanks`. * Add a test case in `tests/test_markitdown.py` to verify that `markitdown.convert_stream` handles input with leading blank characters or line breaks correctly. --- For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/microsoft/markitdown/issues/222?shareId=XXXX-XXXX-XXXX-XXXX).

afourney · 2025-01-03T21:22:34Z

src/markitdown/_markitdown.py

@@ -1344,7 +1344,7 @@ def convert_stream(
        result = None
        try:
            # Write to the temporary file
-            content = stream.read()
+            content = self._strip_leading_blanks(stream.read())


The stream might not be text -- in which case stripping characters could be very problematic. Suggest we move this to the inside of the if statement below.

afourney · 2025-01-03T21:25:30Z

src/markitdown/_markitdown.py

@@ -1367,6 +1367,10 @@ def convert_stream(

        return result

+    def _strip_leading_blanks(self, content: bytes) -> bytes:


Do we need a helper function for lstrip?

afourney · 2025-01-03T21:28:41Z

Thanks for the PR. Before we accept this, I would like to better understand why leading spaces are causing a crash. I suspect that the issue lies deeper in the logic for guessing the file format, and it will be triggered in other conditions as well.

afourney · 2025-01-03T22:33:11Z

tests/test_markitdown.py

+    # Test input with leading blank characters
+    input_data = b"   \n\n\n<html><body><h1>Test</h1></body></html>"
+    result = markitdown.convert_stream(io.BytesIO(input_data), file_extension=".html")
+    assert "<h1>Test</h1>" in result.text_content


This test will fail. The output will be Markdown, not HTML

afourney · 2025-01-03T23:03:08Z

I've determine the problem, and am investigating other fixes that don't involve truncating the file. See #222 (comment) for more.

afourney · 2025-01-04T00:03:51Z

Fixed in #260 without modifying file.

afourney reviewed Jan 3, 2025

View reviewed changes

afourney mentioned this pull request Jan 3, 2025

Added a test for leading spaces. #258

Merged

afourney reviewed Jan 3, 2025

View reviewed changes

afourney closed this Jan 3, 2025

afourney mentioned this pull request Jan 4, 2025

If puremagic has no guesses, try again after ltrim. #260

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix markitdown.convert_stream handling of leading blanks #223

Fix markitdown.convert_stream handling of leading blanks #223

doggy8088 commented Dec 27, 2024 •

edited

Loading

afourney Jan 3, 2025

afourney Jan 3, 2025

afourney commented Jan 3, 2025

afourney Jan 3, 2025

afourney commented Jan 3, 2025

afourney commented Jan 4, 2025

		@@ -1367,6 +1367,10 @@ def convert_stream(

		return result

		def _strip_leading_blanks(self, content: bytes) -> bytes:

Fix markitdown.convert_stream handling of leading blanks #223

Fix markitdown.convert_stream handling of leading blanks #223

Conversation

doggy8088 commented Dec 27, 2024 • edited Loading

afourney Jan 3, 2025

Choose a reason for hiding this comment

afourney Jan 3, 2025

Choose a reason for hiding this comment

afourney commented Jan 3, 2025

afourney Jan 3, 2025

Choose a reason for hiding this comment

afourney commented Jan 3, 2025

afourney commented Jan 4, 2025

doggy8088 commented Dec 27, 2024 •

edited

Loading