Broken web links #318

github-actions · 2023-01-02T08:35:41Z

This issue was opened automatically but led to a discussion by humans

Errors were reported while checking the availability of links:
Issues found in 6 inputs. Find details below.

[faq.md]:
✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2009/047768.html | Failed: Network error: Not Found
⧖ [TIMEOUT] http://coastwatch.pfeg.noaa.gov/erddap/convert/units.html | Timeout
✗ [ERR] http://kitt.llnl.gov/trac/wiki/SatelliteData | Failed: Network error: dns error: no record found for name: kitt.llnl.gov.coi3uxiffnlergb4vem53tdisf.gx.internal.cloudapp.net. type: AAAA class: IN
✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2012/055875.html | Failed: Network error: Not Found
✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2010/053657.html | Failed: Network error: Not Found
✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2008/052705.html | Failed: Network error: Not Found
✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2010/048064.html | Failed: Network error: Not Found
✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2008/052334.html | Failed: Network error: Not Found

[standard_name_rules.md]:
⧖ [TIMEOUT] http://cfeditor.ceda.ac.uk/proposals/1?status=active&namefilter=&proposerfilter=&descfilter=&filter+and+display=filter | Timeout

[discussion.md]:
⧖ [TIMEOUT] http://cfeditor.ceda.ac.uk/proposals/1?status=active&namefilter=&proposerfilter=&descfilter=&filter+and+display=filter | Timeout
⧖ [TIMEOUT] http://cfeditor.ceda.ac.uk/proposals/1?status=inactive&namefilter=&proposerfilter=&descfilter=&filter+and+display=filter | Timeout

[software.md]:
✗ [ERR] http://wps-web1.ceda.ac.uk/submit/form?proc_id=CFChecker | Failed: Network error: dns error: no record found for name: wps-web1.ceda.ac.uk.coi3uxiffnlergb4vem53tdisf.gx.internal.cloudapp.net. type: AAAA class: IN

[vocabularies.md]:
⧖ [TIMEOUT] http://cfeditor.ceda.ac.uk/proposals/1?status=inactive&namefilter=&proposerfilter=&descfilter=&filter+and+display=filter | Timeout
⧖ [TIMEOUT] http://cfeditor.ceda.ac.uk/proposals/1?status=active&namefilter=&proposerfilter=&descfilter=&filter+and+display=filter | Timeout

[constitution.md]:
✗ [ERR] file:///github/workspace/(https:/github.com/cf-convention/cf-conventions/blob/master/CODE_OF_CONDUCT.md) | Failed: Cannot find file

🔍 350 Total ✅ 335 OK 🚫 9 Errors (HTTP:9|Timeouts:6)

erget · 2023-01-06T11:51:31Z

Starting to follow this up a bit. Some questions:

@JonathanGregory , I remember you'd done some work importing the mailman archives - can you refresh me on how we concluded there? Did we end up merging that archive or continuing to point to mailman.cgd.ucar.edu? IIRC we never did merge it, but I could be wrong. It looks to me like the archive isn't reachable so it might be worth discussing our stance here again.
@JonathanGregory @davidhassell regarding the failing link under software.md, do you know what the status of that server is? I can't get to it. Since it's also available from NCAS via GitHub, we could consider just deleting the link to the CEDA page. What are your thoughts?
@JonathanGregory @davidhassell same thing for all pages under cfeditor.ceda.ac.uk - I understand why they're timing out, I haven't been able to load either of them from my browser! But the connection's alive. Is this to be expected? Would it be possible to get that server more responsive, or should we exclude those links from checking? Whilst writing this comment I haven't managed to load the timed out links, but your mileage may vary...

I see there are 2 options of interest, exclude and timeout. Default timeout is 20s, which is already fairly long, and I tend to find it reasonable considering what I believe to be normal human user behaviour. For the moment I'll try excluding the domains in question. Then I think, beyond the question of what stance we adopt toward those domains, the only remaining "real" issue would be the link in constitution.md.

Relates to #318

JonathanGregory · 2023-01-06T13:33:01Z

Dear Daniel @erget

Thanks for following this up. Yes, we imported the mailman archive. Our copy is linked on the discussion page on the same line, and in front of, the UCAR original. We can fix those missing links by pointing to the appropriate place in our copy, but unfortunately this will take a bit of work because our copy is not grouped into years.

I don't know why the link to the CEDA cf-checker isn't working, but @RosalynHatcher probably could advise.

I don't know why the CEDA editor isn't working. It has worked in the past, although it's always rather slow to answer that query. Maybe Alison @japamment could comment?

Best wishes for 2023

Jonathan

JonathanGregory · 2023-01-06T13:47:03Z

I agree that the weekly repetition of the same broken links has been a nuisance and thanks for stopping it. But we mustn't forget to fix them! We should keep this issue open until it's done.

erget · 2023-01-06T14:07:13Z

@JonathanGregory I agree in principle. I've now fixed what I think is everything in #320 . What this does:

Run on pull request to main and periodically via crontab
Upload issues if and only if issues are found that aren't noted already.
Ignored links are described in the GitHub job, and implemented in a .lycheeignore file.

So I propose we merge that PR, close this issue, plus the other one related to link checking, and open one to get address the longer-term issues of migrating the mailman archive and figuring what's up with the domains that are timing out. What do you think?

erget · 2023-01-06T14:17:04Z

Never mind... Something's not right with how the job isn't failing when I expect it to. I'll need some help to finalise this. Will request that in the requisite PR.

JonathanGregory · 2023-01-17T18:34:27Z

I've closed #330 because it was the same errors as this one.

JonathanGregory · 2023-03-20T18:49:50Z

In issue 345, which is the most recent output of the cron job, @DocOtak wrote

@JonathanGregory @erget Do we want to disable the cron tasks for this?

I've been closing the new ones (like 345) every Monday morning, as a human cron daemon. I don't mind doing that, but equally I don't think it helps to have a new one every week until we've fixed the missing links identified by this edition.

erget · 2023-03-21T08:11:43Z

@JonathanGregory @DocOtak I agree, actually we could disable this until we get it fixed - we've made progress on it but slowly ;) @DocOtak do you have the rights to disable the cron task, and could it be executed manually in that case?

JonathanGregory · 2023-09-14T14:04:39Z

As we have probably fixed all the recurrent broken links on the website, we don't need to disable the link-checker, as discussed in this issue. I will therefore close this issue, and we will see what the link-checker has to say when it next executes.

JonathanGregory · 2024-04-24T11:08:29Z

I'm reopening this because Antonio @cofinoa has reinitiated work on it in #320. Thanks, Antonio and others. It'll be good not to have to close an issue every Monday morning. 😄 Also, I'm closing #447, which deals with the same issue.

cofinoa · 2024-04-24T12:36:23Z

@erget @JonathanGregory and @sadielbartholomew I have come back to the link checker errors and problems. Sorry if a miss some already open issue related.

I will merge #320 .... but that doesn't fix all missing/timeout/vanished (broken) links.

I'm preparing and PR to fix/improve the link checker and try silent some permanent broken links and old documents with wrong UTF-8 encoding where link checker fails with an error.

JonathanGregory · 2024-04-24T14:34:31Z

Thanks for working on this, @cofinoa.

JonathanGregory · 2024-04-25T08:59:59Z

For reference I am copying here various comments from #486

@cofinoa #486 (comment)

Dear @cf-convention/info-mgmt team, this PR relates to long-standing issue with the link checker, see: #318 #320

This is a first step to fix the issues with the link checker when PR are made.

The action is triggered when PR are open/re-open and

first check if all **/*.md files in the repo has no broken links, if so the action fails, in the summary of the action run, the resulting output can be seen, including when "everything" is "green". From https://github.com/cf-convention/cf-convention.github.io/actions/runs/8821086583?pr=486 :

Summary

Status Count

🔍 Total 593

✅ Successful 495

⏳ Timeouts 0

🔀 Redirected 0

👻 Excluded 98

❓ Unknown 0

🚫 Errors 0

I have had to fix some existing links to the rendered .html files, and change them to the actual .md which is what should be referenced in .md files instead, see commit 468dece
the change at vocabularies.md has been tricky because the md file uses HTML tags.

If .md are ok, the action checks that building the site with Jekyll works and, if so, it uploads the artifact.

Please take a moment to review and let me know if this fits. If so, I will continue with the PR to incorporate the link check of the site at regular basis (i.e. cron job every Monday), or just we can merge this PR and open a new one PR for the that.

PS: annotation of the PR with a comment with the link check report it's a challenge due to security issues with PR from forks. If PR are from same repo (not) forks then PR and ISSUE commenting it's possible.

PS2: Checking links to GITHUB may raise and issue with the limit rate of GITHUB HTTP requests

PS: annotation of the PR with a comment with the link check report it's a challenge due to security issues with PR from forks. If PR are from same repo (not) forks then PR and ISSUE commenting it's possible.

PS2: Checking links to GITHUB may raise and issue with the limit rate of GITHUB HTTP requests

@larsbarring #486 (comment)

Hi Antonio,

the change at vocabularies.md has been tricky because the md file uses HTML tags.

I was recently adding some minor changes to this file, and noticed that there is actually very little markdown, and a lot of repetitive html links. I though that it maybe would be possible to generate this file dynamically during the build process. Something like a small [python] script looking for through the relevant ../Data/ directories for which versions exists and then assembles the file based on that and md text fragments, either read from file(s) or stored within the script. Could this be something to look further into (I'm afraid it's beyond my skill set)?

@cofinoa #486 (comment)

@larsbarring I have made a new PR at #487 with your suggestion to refactor vocabularies.md

JonathanGregory · 2024-04-25T09:06:25Z

Thanks for #487, Antonio. I don't fully understand this. Is this problem to the link-checker caused by a link to a markdown page from HTML, which is itself wrapped up as a markdown page? This seems rather convoluted. If the whole page is put in markdown instead, does that resolve it?

JonathanGregory · 2024-04-25T09:08:58Z

From @cofinoa #487

@larsbarring, as you have suggested at PR #486 , I have refactored vocabularies.md to actual Markdown, instead HTML, and also I have generated automatically some of the lists.

Jekyll, it's quite limited to manage data and/or strings, and it's needed to create a Jekyll plugin to improve that, but my Ruby skills are also quite limited.

From @larsbarring #487 (comment)

The lists of links will be automatically generated from existing version subdirectories under the ../Data/ directory I think that the workflow will be simplified for publishing a new version of the Standard Name Table, Area Type Table (the versions of the Standardized Regions List is hardcoded?).

Hence pinging @japamment, @efisher008

cofinoa · 2024-04-25T16:20:10Z

@JonathanGregory

Is this problem to the link-checker caused by a link to a markdown page from HTML, which is itself wrapped up as a markdown page?

It's a problem to link a .HTML page which is build from a .MD page. But, because the content it's HTML, we can no link to the .MD page, as we are doing in other .MD pages, with MD content.

I have rewritten the HTML content of vocabulary.md to MD content in this commit f7f145a, along with commit 468dece which fixes links to .HTML pages which are .MD pages.

cofinoa · 2024-04-25T16:21:34Z

@larsbarring

The lists of links will be automatically generated from existing version subdirectories under the ../Data/ directory I think that the workflow will be simplified for publishing a new version of the Standard Name Table, Area Type Table (the versions of the Standardized Regions List is hardcoded?).

I have closed PR #487 because automatic generation of links to ./Data content has some pitfalls that need to be overcome, refactoring some directories in Data/. For example, the directory with v84 of standard-names, has also a current directory, and I think this is a BUG, that needs to be fixed.

cofinoa · 2024-04-25T16:29:21Z

@JonathanGregory

As we have probably fixed all the recurrent broken links on the website, we don't need to disable the link-checker, as discussed in this issue. I will therefore close this issue, and we will see what the link-checker has to say when it next executes.

There are some temporary issues with some links, for example I have had to exclude:
https://mmisw.org/ont

in lychee.toml configuration file (see 65da6b2)

larsbarring · 2024-04-25T18:17:00Z

@cofinoa OK, I agree that having a closer look at the directory structure under ./Datais worthwhile. I came to the same conclusion when working on the old versions of the standard name table. It also have bearing on this discussion. So, let's come back to this idea in a while.

JonathanGregory · 2024-04-29T10:30:12Z

There was no new broken links report this morning, I am very pleased to see. Thanks for suppressing it, @cofinoa! Shall we close this issue, or is it still a work in progress?

cofinoa · 2024-04-29T12:27:15Z

@JonathanGregory, It's still in progress.

If, it's OK, I would like to merge PR #486, which it's an intermediate step, before to solve this issue.

JonathanGregory · 2024-04-29T14:06:04Z

That's fine. Let's leave it open then. Thanks.

cofinoa · 2024-04-29T17:49:00Z

I have created 2 workflows/actions:

check_jekyll_build.yml: An action with 2 main jobs triggered when a PR it's created :
A. to check links in Markdown files (./**/*.md)
B. to check that Jekyll can build the Website
check_links_cron.yml: The other action that runs on Mondays, and has also 2 main jobs:
C. to check that Jekyll can build the Website
D. to check links on the site built on job C, and if it fails a new issue, it's open: Broken links detected in CF Website #490

The exclusion rules are at .lychee/config.toml which are being used for both actions (1 and 2), but we can create different ones for each action, in case it's needed.

Currently, I have excluded the following URL:

exclude = [
    # Data/cf-standard-names/
    "http://glossary.ametsoc.org/wiki",                   
    "https://www.unidata.ucar.edu/software/udunits/udunits-current/doc/udunits",
    "https://www.unidata.ucar.edu/software/udunits/udunits-2.2.28/udunits2.html", 
    "https://www.sciencedirect.com/science/article/pii/0967063793901018",
    "https://www.ipcc.ch/ipccreports/tar/wg1/273.htm",
    "http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata",
    "http://gcmd.nasa.gov/Resources/valids",
    #
    "cfeditor.ceda.ac.uk",                                # standard_name_rule, vocabularies, discussion
    "https://mailman.cgd.ucar.edu/pipermail/cf-metadata", # discussion, governance
    "http://mmisw.org/ont", # faq (TIMEOUT)
    "https://mmisw.org/ont", # faq (TIMEOUT)
    "http://www.cgd.ucar.edu/cms/eaton/cf-metadata/clivar_article.pdf", # Data/cf-documents/cf-governance/cf2_whitepaper_final.html
    "http://www.cgd.ucar.edu/cms/eaton/cf-metadata/CF-current.html", # Data/cf-documents/requirements-recommendations
    "https://www.usbr.gov/lc/socal/reports/SMappend_C.pdf", # Data/area-type-table/**/build/area-type-table.html
    "https://cf-trac.llnl.gov/trac/",                     # 2018-Workshop, 2019-Workshop
    "http://mailman.cgd.ucar.edu/pipermail/cf-metadata",  # 2019-Workshop
    "https://www.wonder.me",                              # 2021-Workshop
    "https://figshare.com/account/articles/24633939",     # 2023-Workshop
    "https://figshare.com/account/articles/24633894",     # 2023-Workshop
]

Some of the excluded URL, are spurious broken links, which are temporarily broken.

Other, are permanently broken, and we need to decided what to do [1].

Also, I have excluded to check some paths, mainly because they contain some documents with invalid encoding, or many broken relatives links (i.e. Trac-tickets):

exclude_path = [
    "_site/Data/cf-standard-names/docs/guidelines.html",
    "_site/Data/cf-conventions/",
    "_site/Data/Trac-tickets/",
]

regards

[1]
For example, for the
https://www.ipcc.ch/ipccreports/tar/wg1/273.htm

we could link to a capture from the Wayback Machine:
https://web.archive.org/web/20181104000136/http://www.ipcc.ch/ipccreports/tar/wg1/273.htm

cofinoa · 2024-05-07T11:33:52Z

@JonathanGregory

I have improved the weekly cron workflow for the link checker (check_links_cron.yml) and now, if a ISSUE with labels:
defect, link-checker, report, automated issue
and status open, then the action will insert a new comment. If no ISSUE it's open with those labels, a new ISSUE it's been opened.

You can see a sample at issue #493

JonathanGregory · 2024-05-07T11:45:27Z

That's a very useful improvement. Thanks, Antonio.

cofinoa · 2024-05-13T12:23:03Z

@JonathanGregory et al. the issue #493 with broken link report has been updated, and new comment has been added to the issue for today's checker cron job:
#493 (comment)

I have re-run the checker manually and "new" error appear, and the others disappear. The issue has been updated with the report for this "manual" check:
#493 (comment)

IMO, there are 2 pending actions that we need to discuss:

How to proceed with this spurious check errors? How long an error must persist and/or how many times appears, to consider an action on it, i.e. to be excluded?
There are already some link and exclusions that need to review similarly as previous point

It maybe would be useful to add this to the next meeting for the Information Management Team @cf-convention/info-mgmt

cofinoa · 2024-05-21T17:12:19Z

I'm closing this to continue discussion at https://github.com/orgs/cf-convention/discussions/320

github-actions bot added automated issue report labels Jan 2, 2023

erget added a commit that referenced this issue Jan 6, 2023

Exclude some domains from link checker

86cf0f5

Relates to #318

erget mentioned this issue Jan 6, 2023

WIP: Exclude some domains from link checker #320

Merged

JonathanGregory changed the title ~~Link Checker Report~~ Broken web links Jul 3, 2023

JonathanGregory added defect Content mistake on the website and removed report automated issue labels Jul 3, 2023

JonathanGregory closed this as completed Sep 14, 2023

JonathanGregory reopened this Apr 24, 2024

JonathanGregory linked a pull request Apr 24, 2024 that will close this issue

WIP: Exclude some domains from link checker #320

Merged

JonathanGregory mentioned this issue Apr 24, 2024

Can we silence Link Checker Report if no errors? #447

Closed

cofinoa closed this as completed in #320 Apr 24, 2024

JonathanGregory reopened this Apr 24, 2024

cofinoa mentioned this issue Apr 24, 2024

Resolving issues link checker #486

Merged

cofinoa closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken web links #318

Broken web links #318

github-actions bot commented Jan 2, 2023 •

edited by JonathanGregory

Loading

erget commented Jan 6, 2023

JonathanGregory commented Jan 6, 2023

JonathanGregory commented Jan 6, 2023

erget commented Jan 6, 2023

erget commented Jan 6, 2023

JonathanGregory commented Jan 17, 2023

JonathanGregory commented Mar 20, 2023

erget commented Mar 21, 2023

JonathanGregory commented Sep 14, 2023

JonathanGregory commented Apr 24, 2024

cofinoa commented Apr 24, 2024

JonathanGregory commented Apr 24, 2024

JonathanGregory commented Apr 25, 2024

Summary

JonathanGregory commented Apr 25, 2024

JonathanGregory commented Apr 25, 2024 •

edited

Loading

cofinoa commented Apr 25, 2024

cofinoa commented Apr 25, 2024

cofinoa commented Apr 25, 2024

larsbarring commented Apr 25, 2024

JonathanGregory commented Apr 29, 2024

cofinoa commented Apr 29, 2024

JonathanGregory commented Apr 29, 2024

cofinoa commented Apr 29, 2024

cofinoa commented May 7, 2024

JonathanGregory commented May 7, 2024

cofinoa commented May 13, 2024

cofinoa commented May 21, 2024

Broken web links #318

Broken web links #318

Comments

github-actions bot commented Jan 2, 2023 • edited by JonathanGregory Loading

erget commented Jan 6, 2023

JonathanGregory commented Jan 6, 2023

JonathanGregory commented Jan 6, 2023

erget commented Jan 6, 2023

erget commented Jan 6, 2023

JonathanGregory commented Jan 17, 2023

JonathanGregory commented Mar 20, 2023

erget commented Mar 21, 2023

JonathanGregory commented Sep 14, 2023

JonathanGregory commented Apr 24, 2024

cofinoa commented Apr 24, 2024

JonathanGregory commented Apr 24, 2024

JonathanGregory commented Apr 25, 2024

Summary

JonathanGregory commented Apr 25, 2024

JonathanGregory commented Apr 25, 2024 • edited Loading

cofinoa commented Apr 25, 2024

cofinoa commented Apr 25, 2024

cofinoa commented Apr 25, 2024

larsbarring commented Apr 25, 2024

JonathanGregory commented Apr 29, 2024

cofinoa commented Apr 29, 2024

JonathanGregory commented Apr 29, 2024

cofinoa commented Apr 29, 2024

cofinoa commented May 7, 2024

JonathanGregory commented May 7, 2024

cofinoa commented May 13, 2024

cofinoa commented May 21, 2024

github-actions bot commented Jan 2, 2023 •

edited by JonathanGregory

Loading

JonathanGregory commented Apr 25, 2024 •

edited

Loading