This repository has been archived by the owner on Aug 25, 2024. It is now read-only.
forked from LangStream/langstream
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(webcrawler): improve unchanged pages sourcing (#131)
New flags: - `only-main-content`: default to false, if enabled it will remove script, style (and others) tags from the emitted document. This is particalury helpful in order to verify actual semantic changes to the pages, not related to sldf (script versioning, cache busting, etc) - `emit-content-diff`: list, default to all the content diff. You can filter the content diff you want the source to emit, if available. For example, to not emit content_unchanged, you can set `emit-content-diff: ['new', 'content_diff']`
- Loading branch information
1 parent
241d16e
commit a90c902
Showing
5 changed files
with
287 additions
and
78 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.