-
-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep h1 and other headings #75
base: master
Are you sure you want to change the base?
Conversation
I think this sort of clean-up has some merit. Maybe we could only decide to clean out the |
@jtojnar done |
Pull Request Test Coverage Report for Build 2583510249
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, you can squash 👍
Even though using h1 tags for sections inside an article is semantically wrong, a lot of websites are doing it anyway. So the idea here is to stop stripping headings, including h1 on Readability's side. Fixes wallabag/wallabag#5805 Signed-off-by: Kevin Decherf <[email protected]>
I'm currently having a second though about this cleanup. Take this link: https://interestingengineering.com/innovation/china-plans-to-build-the-worlds-first-waterless-nuclear-reactor What should we do? Maybe we could remove the length condition and replace it with something like "if it is similar to the article's title"? |
Similarity would make sense but then we would need to decide on the precise metric. Another possible heuristic would be checking if the heading is the first element in the content. Then it would spuriously preserve the heading in the case of |
Just checking if the first child of the content is h*, right? |
Right, that is what I meant. |
I'm trying to resume work on this PR. I've made a small check on entries hosted on my instance, using a query like It seems that there are several cases where the content begins with a legitimate heading entity, for example:
What should we do? |
Would not those articles still begin with The third one would not be matched by the first heuristic. Ideally, we would combine both. We should add a test suite for all these cases so that we can better discuss what is happening. |
Even though using h1 tags for sections inside an article is semantically
wrong, a lot of websites are doing it anyway. So the idea here is to
stop stripping headings, including h1 on Readability's side.
Fixes wallabag/wallabag#5805