[webcrawler] Support emitting non HTML documents (like PDFs...) #739

eolivelli · 2023-11-30T09:05:24Z

Summary:

add new option "allow-non-html-contents" to the webcrawler-source

Please note that in LangStream it is expected that the source only emits the records, it is up to the next agent in the pipeline to extract the text or manipulate the contents.

The "text-extractor" agent already handles pretty well PDF documents, thanks to Apache Tika

…Stream#739)

[webcrawler] Support emitting non HTML documents (like PDFs...)

967003a

eolivelli merged commit 850b3ca into main Nov 30, 2023
10 checks passed

eolivelli deleted the impl/webcralwer-non-html branch November 30, 2023 10:00

benfrank241 pushed a commit to vectorize-io/langstream that referenced this pull request May 2, 2024

[webcrawler] Support emitting non HTML documents (like PDFs...) (Lang…

2dfb6d1

…Stream#739)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[webcrawler] Support emitting non HTML documents (like PDFs...) #739

[webcrawler] Support emitting non HTML documents (like PDFs...) #739

eolivelli commented Nov 30, 2023

[webcrawler] Support emitting non HTML documents (like PDFs...) #739

[webcrawler] Support emitting non HTML documents (like PDFs...) #739

Conversation

eolivelli commented Nov 30, 2023