Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep h1 and other headings #75

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Kdecherf
Copy link
Contributor

Even though using h1 tags for sections inside an article is semantically
wrong, a lot of websites are doing it anyway. So the idea here is to
stop stripping headings, including h1 on Readability's side.

Fixes wallabag/wallabag#5805

@Kdecherf
Copy link
Contributor Author

Request for comment @j0k3r @jtojnar

@jtojnar
Copy link
Contributor

jtojnar commented Jun 11, 2022

I think this sort of clean-up has some merit. Maybe we could only decide to clean out the h1 if there is only a single one, like we do with h2. And only clean up h2 if there is no h1?

@Kdecherf
Copy link
Contributor Author

@jtojnar done

@coveralls
Copy link

coveralls commented Jun 28, 2022

Pull Request Test Coverage Report for Build 2583510249

  • 4 of 4 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 84.669%

Totals Coverage Status
Change from base Build 2486672970: 0.2%
Covered Lines: 602
Relevant Lines: 711

💛 - Coveralls

src/Readability.php Outdated Show resolved Hide resolved
Copy link
Owner

@j0k3r j0k3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, you can squash 👍

Even though using h1 tags for sections inside an article is semantically
wrong, a lot of websites are doing it anyway. So the idea here is to
stop stripping headings, including h1 on Readability's side.

Fixes wallabag/wallabag#5805

Signed-off-by: Kevin Decherf <[email protected]>
@Kdecherf
Copy link
Contributor Author

Kdecherf commented Aug 15, 2022

I'm currently having a second though about this cleanup.

Take this link: https://interestingengineering.com/innovation/china-plans-to-build-the-worlds-first-waterless-nuclear-reactor
The article only contains one h2 entity and is less than 100 characters (« China's plans to curb its carbon emissions »), thus it is removed by the routine. However this heading has its importance in this context.

What should we do? Maybe we could remove the length condition and replace it with something like "if it is similar to the article's title"?

@Kdecherf Kdecherf requested review from jtojnar and j0k3r and removed request for jtojnar August 15, 2022 12:33
@jtojnar
Copy link
Contributor

jtojnar commented Aug 15, 2022

Similarity would make sense but then we would need to decide on the precise metric.

Another possible heuristic would be checking if the heading is the first element in the content. Then it would spuriously preserve the heading in the case of <summary> <heading> <actual content> but that would resolve itself if we drop the summary (which is orthogonal to this). And under-removal is probably better than over-removal anyway.

@j0k3r
Copy link
Owner

j0k3r commented Sep 29, 2022

Another possible heuristic would be checking if the heading is the first element in the content

Just checking if the first child of the content is h*, right?

@jtojnar
Copy link
Contributor

jtojnar commented Sep 29, 2022

Right, that is what I meant.

@Kdecherf
Copy link
Contributor Author

Kdecherf commented Apr 7, 2023

I'm trying to resume work on this PR.

I've made a small check on entries hosted on my instance, using a query like select id, substring(content FOR 250) from wallabag_entry where substring(content FOR 15) ilike '<h_>%' limit 5;.

It seems that there are several cases where the content begins with a legitimate heading entity, for example:

What should we do?

@jtojnar
Copy link
Contributor

jtojnar commented Apr 12, 2023

It seems that there are several cases where the content begins with a legitimate heading entity, for example:

Would not those articles still begin with h1 pre-cleanup? In that case, the second heuristic would only remove the h1 since it is the first heading in the content.

The third one would not be matched by the first heuristic.

Ideally, we would combine both.

We should add a test suite for all these cases so that we can better discuss what is happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong display in wallabag (tangledweb.xyz)
4 participants