Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Exclude some domains from link checker #320

Merged
merged 27 commits into from
Apr 24, 2024

Conversation

erget
Copy link
Member

@erget erget commented Jan 6, 2023

Relates to #318

@erget erget changed the title Exclude some domains from link checker WIP: Exclude some domains from link checker Jan 6, 2023
@erget
Copy link
Member Author

erget commented Jan 6, 2023

erget added 6 commits January 6, 2023 13:52
Just to see if I've restored the state as expected. If this works, I can fail from the link checker once we have the link errors gone.
@erget
Copy link
Member Author

erget commented Jan 6, 2023

Success! At least a little bit. With the exclusions we get this now #323

@erget
Copy link
Member Author

erget commented Jan 6, 2023

Finally works! Now to get rid of those pesky weekly issues...

@erget erget mentioned this pull request Jan 6, 2023
@erget
Copy link
Member Author

erget commented Jan 6, 2023

@sadielbartholomew @DocOtak would either of you have an idea of how to fix this? It's almost there. However:

  • When I run it and it has timed out links, it doesn't fail
  • Since it never fails, the follow-on jobs (upload report, create issue from file) are never triggered

So currently it gives us a false sense of having right links. I'm at the end of my wits here. Any ideas?

The only thing that occurs to me is that we're using an old lychee version. However, if I update to latest, it fails due to some bad characters late in the execution of the Check links job. I'd be grateful if there's any help you could provide here... Or forwarding to others who are more knowledgeable than I ;)

@sadielbartholomew
Copy link
Member

Hi Daniel, sorry for the late response (I am a bit snowed under with various tasks at the moment). I'll have a look-see sometime today, though I suspect I won't be able to deduce anything you haven't already about this! I will however try...

@sadielbartholomew
Copy link
Member

I'm starting to push some commits to try to fix the issue, please bare with me (with apologies for directly committing to the branch PR, it's simply a lot easier to do that and you can always revert any changes I make to the workflow in question!).

@erget
Copy link
Member Author

erget commented Jan 13, 2023

I'm starting to push some commits to try to fix the issue, please bare with me (with apologies for directly committing to the branch PR, it's simply a lot easier to do that and you can always revert any changes I make to the workflow in question!).

No need to apologise, in my view that is a good practice. Thank you!

@sadielbartholomew
Copy link
Member

Clearly the external actions workflow version isn't the issue. Trying some other approaches...

@sadielbartholomew
Copy link
Member

Not having much luck but I reckon I can get it to work, despite what my mess of a commit history here suggests 😄 I blame it being a Friday afternoon pre-pub time... I'll have another go later this evening.

@erget
Copy link
Member Author

erget commented Jan 16, 2023

No worries, this one has me stumped too. Regarding the commit history - in the end we'll rebase it so we can understand what happened, so there's no problem there! That's one of the reasons I love git - as granular as you like, and as messy as you are, but it knows how to dress up when it has to.

@sadielbartholomew
Copy link
Member

sadielbartholomew commented Jan 16, 2023

Thanks Daniel.

Regarding the commit history - in the end we'll rebase it so we can understand what happened, so there's no problem there!

Yes, we'll definitely want to squash down the commits I've made.

That's one of the reasons I love git - as granular as you like, and as messy as you are, but it knows how to dress up when it has to.

Same, I also love git for these reasons, though I think its flexibility is a blessing as well as a curse, in that it can allow you to do pretty much anything and sometimes you really shouldn't be doing something! But in this case, it's a simple case of squashing things so hopefully we can't go wrong...

I'm at a conference today and until Wednesday but will continue a bit on this. At this stage, I think I can get something to work with the newer and latest version v1.5.4, which it looks like you tried but gave up on, I am assuming because of the same reason I first thought it wasn't any good, namely that it (as-was in the workflow) returns:

...
Error: Cannot read input content from file `Data/cf-conventions/cf-conventions-1.2/build/apa.html`

Caused by:
    stream did not contain valid UTF-8
...

but I think that's just because Actions or otherwise doesn't like our output format. I will try updating that to use a different format option to avoid the error, and then hopefully the fail: True key will be recognised in v1.5.4. Otherwise, an idea I have is that if conditional steps in Actions workflows seem to be ignored in terms of exit code, rather than passing and failing as steps, so I can somehow get a non-conditional step which takes the relevant exit code to pass or fail based on the link-checking output. Let's see how it goes...

@sadielbartholomew
Copy link
Member

sadielbartholomew commented Jan 16, 2023

Oh, I see GitHub Actions is being marked as 'Degraded' for performance on the GitHub Status site right now, so I might wait until everything is working to avoid the complication of internal issues...

@sadielbartholomew sadielbartholomew force-pushed the exclude-known-domains-that-timeout branch from 652331d to 7a5dace Compare January 20, 2023 16:09
@sadielbartholomew
Copy link
Member

Hi @erget, I believe I have now fixed the relevant workflow steps such that it finally works (hallelujah!), although I felt this too on the journey to getting there 😬 :

I'm at the end of my wits here

As for the fix, and RE your tip which was a good clue (thanks):

The only thing that occurs to me is that we're using an old lychee version. However, if I update to latest, it fails due to some bad characters late in the execution of the Check links job. I'd be grateful if there's any help you could provide here... Or forwarding to others who are more knowledgeable than I ;)

ultimately the lack of the job step failing was due to using the older version of lycheeverse/lychee-action, but with updating to v1.5.4 or (pointing to) master (I chose the latter to keep up to date automatically, but either works, feel free to adapt it) there was a change in behaviour across the versions so that HTML files were also checked by default as well as markdown, and since the HTML files under Data/cf-conventions/cf-conventions-1.2/build/ and Data/cf-conventions/cf-conventions-1.3/build/ had some characters it didn't like (as you noticed) it would then fail. So what worked was to explicitly set the args key with './*.md' as the input (we might want to extend that though, see my next comment).

See the job run for my final commit as a confirmation of the behaviour (failure for the right reason!). The output of that step is now as it was before (at least, the output nature, the results change a bit but with link timeout results that can be expected), but crucially the non-zero exit code fails the job step as you wanted:

Run lycheeverse/lychee-action@master
Run curl -sLO 'https://github.com/lycheeverse/lychee/releases/download/v0.10.3/lychee-v0.10.3-x86_64-unknown-linux-gnu.tar.gz'
lychee
Run /home/runner/work/_actions/lycheeverse/lychee-action/master/entrypoint.sh
✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2010/053657.html | Failed: Network error: Not Found
✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2012/055875.html | Failed: Network error: Not Found
✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2008/05[23](https://github.com/cf-convention/cf-convention.github.io/actions/runs/3970131343/jobs/6805495779#step:3:25)34.html | Failed: Network error: Not Found
✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2008/052705.html | Failed: Network error: Not Found
✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2010/048064.html | Failed: Network error: Not Found
✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2009/047768.html | Failed: Network error: Not Found
## Summary

| Status        | Count |
|---------------|-------|
| 🔍 Total      | 350   |
| ✅ Successful | 337   |
| ⏳ Timeouts   | 0     |
| 🔀 Redirected | 0     |
| 👻 Excluded   | 7     |
| ❓ Unknown    | 0     |
| 🚫 Errors     | 6     |

## Errors per input

### Errors in faq.md

* [404] [https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2012/055875.html](https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2012/055875.html) | Failed: Network error: Not Found
* [404] [https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2009/047768.html](https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2009/047768.html) | Failed: Network error: Not Found
* [404] [https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2008/052705.html](https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2008/052705.html) | Failed: Network error: Not Found
* [404] [https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2008/052334.html](https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2008/052334.html) | Failed: Network error: Not Found
* [404] [https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2010/048064.html](https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2010/048064.html) | Failed: Network error: Not Found
* [404] [https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2010/05[36](https://github.com/cf-convention/cf-convention.github.io/actions/runs/3970131343/jobs/6805495779#step:3:39)57.html](https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2010/053657.html) | Failed: Network error: Not Found


Error: Process completed with exit code 2.

(The final message is not the most informative but it's clear enough from the report that 2, as would 1, represents a link checking failure overall.)

I've done a self-review but please can you take a quick look and confirm that it is behaving as expected now i.e. failing for the right reason rather than the wrong one! Notably, I don't see a quick way to test that it passes the job step with a successful run finding no bad links i.e. 0 exit code, so if you know a way to test that it would be especially good to verify.

Otherwise, I've tidied up by squashing all of my fix commits into a small number of self-contained blocks. If you could do that with yours (I didn't want to edit your work by squashing them myself) that would be great.

@sadielbartholomew
Copy link
Member

Also, the point (quoting myself in my previous comment):

So what worked was to explicitly set the args key with './*.md' as the input

raises the question of what files exactly we want to check with regards to links. Notably, do we want to check all markdown files (i.e. the glob './**/*.md'), or just those in the top-level directory (glob './*.md'), as currently done (both before and after this PR, at least in the state I have left it with my previous commit)?

FYI, if you want to see the results of the latter, you can check the Actions outputs for my commit e37d569, noting that it checks many more links and finds more errors, but whether they are relevant I am not sure:

...
## Summary

| Status        | Count |
|---------------|-------|
| 🔍 Total      | 499   |
| ✅ Successful | 473   |
| ⏳ Timeouts   | 0     |
| 🔀 Redirected | 0     |
| 👻 Excluded   | 7     |
| ❓ Unknown    | 0     |
| 🚫 Errors     | 19    |
...

Comment on lines +1 to +3
cfeditor.ceda.ac.uk
wps-web1.ceda.ac.uk
http://kitt.llnl.gov/trac/wiki/SatelliteData
Copy link
Member

@sadielbartholomew sadielbartholomew Jan 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could move these link exclusions to the workflow file (check_links.yml) now via using --exclude <link> in the args, perhaps, just to keep everything together and since there are only a small number so that it won't bulk up the arg command too much?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good idea, then everything's in one place 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Daniel - shall I make that conversion now? Or would you like to review first?

Copy link
Member Author

@erget erget Jan 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, go ahead - it'll be transparent in the commit history. I'll cross-check, but with the intent of helping you have peace of mind, not because I feel the need to control.

@erget
Copy link
Member Author

erget commented Jan 23, 2023

@sadielbartholomew thanks for this - prima fascia I think this is great, I'd like to give it the review it deserves so will come back to this in more detail later this week. Hopefully this lets us work on the real issues from here out rather than false positives!

@cofinoa
Copy link
Collaborator

cofinoa commented Apr 23, 2024

@erget and @sadielbartholomew can you update the status of this PR? Currently, has some conflicts, I think minor, but they can grow with more commits.

@erget
Copy link
Member Author

erget commented Apr 24, 2024

@cofinoa thanks for prompting, I was way behind here. I've resolved the conflicts. @sadielbartholomew I'm happy to merge this, are you?

@erget
Copy link
Member Author

erget commented Apr 24, 2024

Looks like the domain exclusions aren't sufficient and there are still a few other problems to solve but it might be good to merge this now so that we're getting fewer errors and continue weeding by keeping this PR alive post-merge. And of course squash when merging.

@JonathanGregory JonathanGregory linked an issue Apr 24, 2024 that may be closed by this pull request
@cofinoa cofinoa merged commit 91e136f into main Apr 24, 2024
1 of 2 checks passed
@cofinoa cofinoa deleted the exclude-known-domains-that-timeout branch April 24, 2024 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Broken web links
3 participants