Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken web links #318

Closed
github-actions bot opened this issue Jan 2, 2023 · 27 comments · Fixed by #320
Closed

Broken web links #318

github-actions bot opened this issue Jan 2, 2023 · 27 comments · Fixed by #320
Labels
defect Content mistake on the website

Comments

@github-actions
Copy link

github-actions bot commented Jan 2, 2023

This issue was opened automatically but led to a discussion by humans

Errors were reported while checking the availability of links:
Issues found in 6 inputs. Find details below.

[faq.md]:
✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2009/047768.html | Failed: Network error: Not Found
⧖ [TIMEOUT] http://coastwatch.pfeg.noaa.gov/erddap/convert/units.html | Timeout
✗ [ERR] http://kitt.llnl.gov/trac/wiki/SatelliteData | Failed: Network error: dns error: no record found for name: kitt.llnl.gov.coi3uxiffnlergb4vem53tdisf.gx.internal.cloudapp.net. type: AAAA class: IN
✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2012/055875.html | Failed: Network error: Not Found
✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2010/053657.html | Failed: Network error: Not Found
✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2008/052705.html | Failed: Network error: Not Found
✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2010/048064.html | Failed: Network error: Not Found
✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2008/052334.html | Failed: Network error: Not Found

[standard_name_rules.md]:
⧖ [TIMEOUT] http://cfeditor.ceda.ac.uk/proposals/1?status=active&namefilter=&proposerfilter=&descfilter=&filter+and+display=filter | Timeout

[discussion.md]:
⧖ [TIMEOUT] http://cfeditor.ceda.ac.uk/proposals/1?status=active&namefilter=&proposerfilter=&descfilter=&filter+and+display=filter | Timeout
⧖ [TIMEOUT] http://cfeditor.ceda.ac.uk/proposals/1?status=inactive&namefilter=&proposerfilter=&descfilter=&filter+and+display=filter | Timeout

[software.md]:
✗ [ERR] http://wps-web1.ceda.ac.uk/submit/form?proc_id=CFChecker | Failed: Network error: dns error: no record found for name: wps-web1.ceda.ac.uk.coi3uxiffnlergb4vem53tdisf.gx.internal.cloudapp.net. type: AAAA class: IN

[vocabularies.md]:
⧖ [TIMEOUT] http://cfeditor.ceda.ac.uk/proposals/1?status=inactive&namefilter=&proposerfilter=&descfilter=&filter+and+display=filter | Timeout
⧖ [TIMEOUT] http://cfeditor.ceda.ac.uk/proposals/1?status=active&namefilter=&proposerfilter=&descfilter=&filter+and+display=filter | Timeout

[constitution.md]:
✗ [ERR] file:///github/workspace/(https:/github.com/cf-convention/cf-conventions/blob/master/CODE_OF_CONDUCT.md) | Failed: Cannot find file

🔍 350 Total ✅ 335 OK 🚫 9 Errors (HTTP:9|Timeouts:6)

@erget
Copy link
Member

erget commented Jan 6, 2023

Starting to follow this up a bit. Some questions:

  • @JonathanGregory , I remember you'd done some work importing the mailman archives - can you refresh me on how we concluded there? Did we end up merging that archive or continuing to point to mailman.cgd.ucar.edu? IIRC we never did merge it, but I could be wrong. It looks to me like the archive isn't reachable so it might be worth discussing our stance here again.
  • @JonathanGregory @davidhassell regarding the failing link under software.md, do you know what the status of that server is? I can't get to it. Since it's also available from NCAS via GitHub, we could consider just deleting the link to the CEDA page. What are your thoughts?
  • @JonathanGregory @davidhassell same thing for all pages under cfeditor.ceda.ac.uk - I understand why they're timing out, I haven't been able to load either of them from my browser! But the connection's alive. Is this to be expected? Would it be possible to get that server more responsive, or should we exclude those links from checking? Whilst writing this comment I haven't managed to load the timed out links, but your mileage may vary...

I see there are 2 options of interest, exclude and timeout. Default timeout is 20s, which is already fairly long, and I tend to find it reasonable considering what I believe to be normal human user behaviour. For the moment I'll try excluding the domains in question. Then I think, beyond the question of what stance we adopt toward those domains, the only remaining "real" issue would be the link in constitution.md.

@JonathanGregory
Copy link
Contributor

Dear Daniel @erget

Thanks for following this up. Yes, we imported the mailman archive. Our copy is linked on the discussion page on the same line, and in front of, the UCAR original. We can fix those missing links by pointing to the appropriate place in our copy, but unfortunately this will take a bit of work because our copy is not grouped into years.

I don't know why the link to the CEDA cf-checker isn't working, but @RosalynHatcher probably could advise.

I don't know why the CEDA editor isn't working. It has worked in the past, although it's always rather slow to answer that query. Maybe Alison @japamment could comment?

Best wishes for 2023

Jonathan

@JonathanGregory
Copy link
Contributor

I agree that the weekly repetition of the same broken links has been a nuisance and thanks for stopping it. But we mustn't forget to fix them! We should keep this issue open until it's done.

@erget
Copy link
Member

erget commented Jan 6, 2023

@JonathanGregory I agree in principle. I've now fixed what I think is everything in #320 . What this does:

  • Run on pull request to main and periodically via crontab
  • Upload issues if and only if issues are found that aren't noted already.
  • Ignored links are described in the GitHub job, and implemented in a .lycheeignore file.

So I propose we merge that PR, close this issue, plus the other one related to link checking, and open one to get address the longer-term issues of migrating the mailman archive and figuring what's up with the domains that are timing out. What do you think?

@erget
Copy link
Member

erget commented Jan 6, 2023

Never mind... Something's not right with how the job isn't failing when I expect it to. I'll need some help to finalise this. Will request that in the requisite PR.

@JonathanGregory
Copy link
Contributor

I've closed #330 because it was the same errors as this one.

@JonathanGregory
Copy link
Contributor

In issue 345, which is the most recent output of the cron job, @DocOtak wrote

@JonathanGregory @erget Do we want to disable the cron tasks for this?

I've been closing the new ones (like 345) every Monday morning, as a human cron daemon. I don't mind doing that, but equally I don't think it helps to have a new one every week until we've fixed the missing links identified by this edition.

@erget
Copy link
Member

erget commented Mar 21, 2023

@JonathanGregory @DocOtak I agree, actually we could disable this until we get it fixed - we've made progress on it but slowly ;) @DocOtak do you have the rights to disable the cron task, and could it be executed manually in that case?

@JonathanGregory JonathanGregory changed the title Link Checker Report Broken web links Jul 3, 2023
@JonathanGregory JonathanGregory added defect Content mistake on the website and removed report automated issue labels Jul 3, 2023
@JonathanGregory
Copy link
Contributor

As we have probably fixed all the recurrent broken links on the website, we don't need to disable the link-checker, as discussed in this issue. I will therefore close this issue, and we will see what the link-checker has to say when it next executes.

@JonathanGregory
Copy link
Contributor

I'm reopening this because Antonio @cofinoa has reinitiated work on it in #320. Thanks, Antonio and others. It'll be good not to have to close an issue every Monday morning. 😄 Also, I'm closing #447, which deals with the same issue.

@cofinoa
Copy link
Collaborator

cofinoa commented Apr 24, 2024

@erget @JonathanGregory and @sadielbartholomew I have come back to the link checker errors and problems. Sorry if a miss some already open issue related.

I will merge #320 .... but that doesn't fix all missing/timeout/vanished (broken) links.

I'm preparing and PR to fix/improve the link checker and try silent some permanent broken links and old documents with wrong UTF-8 encoding where link checker fails with an error.

@JonathanGregory
Copy link
Contributor

Thanks for working on this, @cofinoa.

@JonathanGregory
Copy link
Contributor

For reference I am copying here various comments from #486

@cofinoa #486 (comment)

Dear @cf-convention/info-mgmt team, this PR relates to long-standing issue with the link checker, see: #318 #320

This is a first step to fix the issues with the link checker when PR are made.

The action is triggered when PR are open/re-open and

  1. first check if all **/*.md files in the repo has no broken links, if so the action fails, in the summary of the action run, the resulting output can be seen, including when "everything" is "green". From https://github.com/cf-convention/cf-convention.github.io/actions/runs/8821086583?pr=486 :

Summary

Status Count
🔍 Total 593
✅ Successful 495
⏳ Timeouts 0
🔀 Redirected 0
👻 Excluded 98
❓ Unknown 0
🚫 Errors 0
  • I have had to fix some existing links to the rendered .html files, and change them to the actual .md which is what should be referenced in .md files instead, see commit 468dece
  • the change at vocabularies.md has been tricky because the md file uses HTML tags.
  1. If .md are ok, the action checks that building the site with Jekyll works and, if so, it uploads the artifact.

Please take a moment to review and let me know if this fits. If so, I will continue with the PR to incorporate the link check of the site at regular basis (i.e. cron job every Monday), or just we can merge this PR and open a new one PR for the that.

PS: annotation of the PR with a comment with the link check report it's a challenge due to security issues with PR from forks. If PR are from same repo (not) forks then PR and ISSUE commenting it's possible.

PS2: Checking links to GITHUB may raise and issue with the limit rate of GITHUB HTTP requests

PS: annotation of the PR with a comment with the link check report it's a challenge due to security issues with PR from forks. If PR are from same repo (not) forks then PR and ISSUE commenting it's possible.

PS2: Checking links to GITHUB may raise and issue with the limit rate of GITHUB HTTP requests

@larsbarring #486 (comment)

Hi Antonio,

  • the change at vocabularies.md has been tricky because the md file uses HTML tags.

I was recently adding some minor changes to this file, and noticed that there is actually very little markdown, and a lot of repetitive html links. I though that it maybe would be possible to generate this file dynamically during the build process. Something like a small [python] script looking for through the relevant ../Data/ directories for which versions exists and then assembles the file based on that and md text fragments, either read from file(s) or stored within the script. Could this be something to look further into (I'm afraid it's beyond my skill set)?

@cofinoa #486 (comment)

@larsbarring I have made a new PR at #487 with your suggestion to refactor vocabularies.md

@JonathanGregory
Copy link
Contributor

Thanks for #487, Antonio. I don't fully understand this. Is this problem to the link-checker caused by a link to a markdown page from HTML, which is itself wrapped up as a markdown page? This seems rather convoluted. If the whole page is put in markdown instead, does that resolve it?

@JonathanGregory
Copy link
Contributor

JonathanGregory commented Apr 25, 2024

From @cofinoa #487

@larsbarring, as you have suggested at PR #486 , I have refactored vocabularies.md to actual Markdown, instead HTML, and also I have generated automatically some of the lists.

Jekyll, it's quite limited to manage data and/or strings, and it's needed to create a Jekyll plugin to improve that, but my Ruby skills are also quite limited.

From @larsbarring #487 (comment)

The lists of links will be automatically generated from existing version subdirectories under the ../Data/ directory I think that the workflow will be simplified for publishing a new version of the Standard Name Table, Area Type Table (the versions of the Standardized Regions List is hardcoded?).

Hence pinging @japamment, @efisher008

@cofinoa
Copy link
Collaborator

cofinoa commented Apr 25, 2024

@JonathanGregory

Is this problem to the link-checker caused by a link to a markdown page from HTML, which is itself wrapped up as a markdown page?

It's a problem to link a .HTML page which is build from a .MD page. But, because the content it's HTML, we can no link to the .MD page, as we are doing in other .MD pages, with MD content.

I have rewritten the HTML content of vocabulary.md to MD content in this commit f7f145a, along with commit 468dece which fixes links to .HTML pages which are .MD pages.

@cofinoa
Copy link
Collaborator

cofinoa commented Apr 25, 2024

@larsbarring

The lists of links will be automatically generated from existing version subdirectories under the ../Data/ directory I think that the workflow will be simplified for publishing a new version of the Standard Name Table, Area Type Table (the versions of the Standardized Regions List is hardcoded?).

I have closed PR #487 because automatic generation of links to ./Data content has some pitfalls that need to be overcome, refactoring some directories in Data/. For example, the directory with v84 of standard-names, has also a current directory, and I think this is a BUG, that needs to be fixed.

@cofinoa
Copy link
Collaborator

cofinoa commented Apr 25, 2024

@JonathanGregory

As we have probably fixed all the recurrent broken links on the website, we don't need to disable the link-checker, as discussed in this issue. I will therefore close this issue, and we will see what the link-checker has to say when it next executes.

There are some temporary issues with some links, for example I have had to exclude:
https://mmisw.org/ont

in lychee.toml configuration file (see 65da6b2)

@larsbarring
Copy link
Contributor

@cofinoa OK, I agree that having a closer look at the directory structure under ./Datais worthwhile. I came to the same conclusion when working on the old versions of the standard name table. It also have bearing on this discussion. So, let's come back to this idea in a while.

@JonathanGregory
Copy link
Contributor

There was no new broken links report this morning, I am very pleased to see. Thanks for suppressing it, @cofinoa! Shall we close this issue, or is it still a work in progress?

@cofinoa
Copy link
Collaborator

cofinoa commented Apr 29, 2024

@JonathanGregory, It's still in progress.

If, it's OK, I would like to merge PR #486, which it's an intermediate step, before to solve this issue.

@JonathanGregory
Copy link
Contributor

That's fine. Let's leave it open then. Thanks.

@cofinoa
Copy link
Collaborator

cofinoa commented Apr 29, 2024

I have created 2 workflows/actions:

  1. check_jekyll_build.yml: An action with 2 main jobs triggered when a PR it's created :
    A. to check links in Markdown files (./**/*.md)
    B. to check that Jekyll can build the Website
  2. check_links_cron.yml: The other action that runs on Mondays, and has also 2 main jobs:
    C. to check that Jekyll can build the Website
    D. to check links on the site built on job C, and if it fails a new issue, it's open: Broken links detected in CF Website #490

The exclusion rules are at .lychee/config.toml which are being used for both actions (1 and 2), but we can create different ones for each action, in case it's needed.

Currently, I have excluded the following URL:

exclude = [
    # Data/cf-standard-names/
    "http://glossary.ametsoc.org/wiki",                   
    "https://www.unidata.ucar.edu/software/udunits/udunits-current/doc/udunits",
    "https://www.unidata.ucar.edu/software/udunits/udunits-2.2.28/udunits2.html", 
    "https://www.sciencedirect.com/science/article/pii/0967063793901018",
    "https://www.ipcc.ch/ipccreports/tar/wg1/273.htm",
    "http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata",
    "http://gcmd.nasa.gov/Resources/valids",
    #
    "cfeditor.ceda.ac.uk",                                # standard_name_rule, vocabularies, discussion
    "https://mailman.cgd.ucar.edu/pipermail/cf-metadata", # discussion, governance
    "http://mmisw.org/ont", # faq (TIMEOUT)
    "https://mmisw.org/ont", # faq (TIMEOUT)
    "http://www.cgd.ucar.edu/cms/eaton/cf-metadata/clivar_article.pdf", # Data/cf-documents/cf-governance/cf2_whitepaper_final.html
    "http://www.cgd.ucar.edu/cms/eaton/cf-metadata/CF-current.html", # Data/cf-documents/requirements-recommendations
    "https://www.usbr.gov/lc/socal/reports/SMappend_C.pdf", # Data/area-type-table/**/build/area-type-table.html
    "https://cf-trac.llnl.gov/trac/",                     # 2018-Workshop, 2019-Workshop
    "http://mailman.cgd.ucar.edu/pipermail/cf-metadata",  # 2019-Workshop
    "https://www.wonder.me",                              # 2021-Workshop
    "https://figshare.com/account/articles/24633939",     # 2023-Workshop
    "https://figshare.com/account/articles/24633894",     # 2023-Workshop
]

Some of the excluded URL, are spurious broken links, which are temporarily broken.

Other, are permanently broken, and we need to decided what to do [1].

Also, I have excluded to check some paths, mainly because they contain some documents with invalid encoding, or many broken relatives links (i.e. Trac-tickets):

exclude_path = [
    "_site/Data/cf-standard-names/docs/guidelines.html",
    "_site/Data/cf-conventions/",
    "_site/Data/Trac-tickets/",
]

regards

[1]
For example, for the
https://www.ipcc.ch/ipccreports/tar/wg1/273.htm

we could link to a capture from the Wayback Machine:
https://web.archive.org/web/20181104000136/http://www.ipcc.ch/ipccreports/tar/wg1/273.htm

@cofinoa
Copy link
Collaborator

cofinoa commented May 7, 2024

@JonathanGregory

I have improved the weekly cron workflow for the link checker (check_links_cron.yml) and now, if a ISSUE with labels:
defect, link-checker, report, automated issue
and status open, then the action will insert a new comment. If no ISSUE it's open with those labels, a new ISSUE it's been opened.

You can see a sample at issue #493

@JonathanGregory
Copy link
Contributor

That's a very useful improvement. Thanks, Antonio.

@cofinoa
Copy link
Collaborator

cofinoa commented May 13, 2024

@JonathanGregory et al. the issue #493 with broken link report has been updated, and new comment has been added to the issue for today's checker cron job:
#493 (comment)

I have re-run the checker manually and "new" error appear, and the others disappear. The issue has been updated with the report for this "manual" check:
#493 (comment)

IMO, there are 2 pending actions that we need to discuss:

  1. How to proceed with this spurious check errors? How long an error must persist and/or how many times appears, to consider an action on it, i.e. to be excluded?
  2. There are already some link and exclusions that need to review similarly as previous point

It maybe would be useful to add this to the next meeting for the Information Management Team @cf-convention/info-mgmt

@cofinoa
Copy link
Collaborator

cofinoa commented May 21, 2024

I'm closing this to continue discussion at https://github.com/orgs/cf-convention/discussions/320

@cofinoa cofinoa closed this as completed May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Content mistake on the website
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants