Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Public regex #77

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

RiccardoM
Copy link

This PR adds a new Public function that returns a regex that matches public URLs. Such URLs are defined as:

  • optionally using either http or https as their protocol (all other protocols will not be matched)
  • having a domain that uses a valid TLD
  • not being a direct IP address

Probably the Public name is not the best one there could be (as IP addresses are also public), so if anyone has any suggestion please feel free to chime in.

@mvdan
Copy link
Owner

mvdan commented Feb 11, 2024

Probably the Public name is not the best one there could be (as IP addresses are also public), so if anyone has any suggestion please feel free to chime in.

Is this concept of "public URL" well understood or defined somewhere, or did you come up with it here?

Personally, I think the better approach would be to expose more information with the existing regular expressions, like we did in 09d66fb, and then you can use Relaxed and do any filtering that you see fit. You could, for example, discard any matches with a scheme other than http or https, discard any hosts which use an IP address, or discard any TLD not part of an allowlist.

This approach wouldn't be significantly faster or slower I think, but what matters is that it would be more configurable to one's needs. Unless "public URL" is a very well defined and understood concept, I think that would be the way to go.

@orn1983
Copy link

orn1983 commented Mar 12, 2024

I was considering forking and doing a PR for the a very similar thing.

I would agree that there might be a lacking concensus of what a Public URL actually is, but the function could be renamed to be WebURL or something like that to filter out all other shcemas other than http(s). Then you could have a RelaxedWebURL that would allow for either http(s) or schemaless.

Alternatively, it would be nice to have a function that would accept a slice of schemas as its input, along with a relaxed boolean to accept schemaless as well.

I can see in the commit you referenced that schemeless URLs were suggested to be filtered in the same manner. Maybe this is the right way, but it seems a bit user unfriendly.

@cspeidel
Copy link
Contributor

The public suffix list initiative from Mozilla has defined a public suffix (aka effective TLD or eTLD). A good description here:
https://pkg.go.dev/golang.org/x/net/publicsuffix. Often people care about eTLD+1.

A "public suffix" is one under which Internet users can (or historically could) directly register names. Some examples of public suffixes are .com, .co.uk and pvt.k12.ma.us. The Public Suffix List is a list of all known public suffixes.

@mvdan
Copy link
Owner

mvdan commented Mar 30, 2024

@cspeidel we already use the publicsuffix list for TLDs:

fromURL("https://publicsuffix.org/list/effective_tld_names.dat", `^[^/.]+$`)

It also occurs to me that this is almost exactly xurls.StrictMatchingScheme(`https?://`) as shown in the existing example. The only difference from what @RiccardoM lists in the original three requirements is that IP addresses are also matched. If we expose a bit more information about what parts of a regular expression were matched, like in 09d66fb, then one could do something like:

rx, err := xurls.StrictMatchingScheme(`https?://`)
if err != nil {
	panic(err)
}
idxTLD := rx.SubexpIndex("tld")
for _, match := range rx.FindAllStringSubmatch(s, -1) {
	if match[idxTLD] == "" {
		continue // skip matches without a TLD, e.g. ipv4 or ipv6
	}
	fmt.Println(match[0])
}

This is slightly more code, but it gives the end user a lot more flexibility in choosing what schemes, TLDs, or hostnames are acceptable. For example, I would argue that github.com/foo is as much of a "web URL" as https://github.com/foo, and both are linkified by many apps the same way, so I'm not sure we can agree on a single definition for "public web URL". Similarly, the example I share above matches punycode TLDs as well - if you wanted to not include those, the addition of the tld subexpression index allows you to filter the results however you want.

All the above said, I agree that there should be top-level funcs for common patterns, and that's why I added xurls.StrictMatchingScheme with the example in the first place. I'm just not convinced that we all want the same out of a "public web URL" feature, so the only option that I think is reasonable is to offer more flexibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants