Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix markitdown.convert_stream handling of leading blanks #223

Closed
wants to merge 1 commit into from

Conversation

doggy8088
Copy link

@doggy8088 doggy8088 commented Dec 27, 2024

Fixes #222

Address issue with markitdown.convert_stream crashing on input with leading blank characters or line breaks.

  • Modify convert_stream function in src/markitdown/_markitdown.py to strip leading blank characters or line breaks from the input stream using a new helper function _strip_leading_blanks.
  • Add a test case in tests/test_markitdown.py to verify that markitdown.convert_stream handles input with leading blank characters or line breaks correctly.

For more details, open the Copilot Workspace session.

Fixes microsoft#222

Address issue with `markitdown.convert_stream` crashing on input with leading blank characters or line breaks.

* Modify `convert_stream` function in `src/markitdown/_markitdown.py` to strip leading blank characters or line breaks from the input stream using a new helper function `_strip_leading_blanks`.
* Add a test case in `tests/test_markitdown.py` to verify that `markitdown.convert_stream` handles input with leading blank characters or line breaks correctly.

---

For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/microsoft/markitdown/issues/222?shareId=XXXX-XXXX-XXXX-XXXX).
@@ -1344,7 +1344,7 @@ def convert_stream(
result = None
try:
# Write to the temporary file
content = stream.read()
content = self._strip_leading_blanks(stream.read())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stream might not be text -- in which case stripping characters could be very problematic. Suggest we move this to the inside of the if statement below.

@@ -1367,6 +1367,10 @@ def convert_stream(

return result

def _strip_leading_blanks(self, content: bytes) -> bytes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a helper function for lstrip?

@afourney
Copy link
Member

afourney commented Jan 3, 2025

Thanks for the PR. Before we accept this, I would like to better understand why leading spaces are causing a crash. I suspect that the issue lies deeper in the logic for guessing the file format, and it will be triggered in other conditions as well.

# Test input with leading blank characters
input_data = b" \n\n\n<html><body><h1>Test</h1></body></html>"
result = markitdown.convert_stream(io.BytesIO(input_data), file_extension=".html")
assert "<h1>Test</h1>" in result.text_content
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test will fail. The output will be Markdown, not HTML

@afourney
Copy link
Member

afourney commented Jan 3, 2025

I've determine the problem, and am investigating other fixes that don't involve truncating the file. See #222 (comment) for more.

@afourney
Copy link
Member

afourney commented Jan 4, 2025

Fixed in #260 without modifying file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug] markitdown._markitdown.UnsupportedFormatException
2 participants