-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a standard email field verification Regular Expression (or find and verify one) #39
Comments
This is a promising solution if we could figure out a way to standardize it for our projects. Plus lots of examples and resources here: |
Stackoverflow answer has a pretty awesome regexp pattern with a state machine diagram: https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression But considering that different languages have different regexp syntaxes it might be better to designate a validation library for each language we use. For nodejs |
I would like to humbly propose a solution which performs as well as the RFC5322 Official Standard (in my particular test set) but is much easier to understand and verify.
This regex can be tested here: https://regex101.com/r/A9jZZ4/4 The following email addresses expectedly pass this validation:
The following email addresses expectedly fail this validation:
I would appreciate if others would throw some other test cases against this regex and try to break it. For reference, here is the RFC 5322 Standard I am comparing against.
Found at http://emailregex.com/ |
Running through the validation examples from
Most notable is the lack of UTF8 support and hyphen handling. False positives:
False negatives:
|
Hyphen support I'm not as concerned with, in terms of hitting that balance of simplicity vs. complete accuracy to RFC 5322. A false positive is not a big deal, vs a false negative which would stop a valid user from accessing a service. With that in mind, the false negatives do seem like a problem. How common is UTF8 support with the major email providers? And what percentage of users would hit that use case? If we're talking < 1 %, I would rather just tell a user to use a different email address. Let me know what you guys think. |
Personally, I've known people from multiple people groups in various parts of the world, and as far as I recall, almost all of them used plain ANSI characters in their email addresses, web addresses, and IM'ing. So I don't think UTF-8 support is a big deal. |
Frankly, I think it is more important to adopt a library for this concern then to bless a regex to be copied for all projects. Having a small clever regex pattern to stamp out is cool but it runs afoul with the DRY principle: https://en.wikipedia.org/wiki/Don%27t_repeat_yourself. The argument for simplicity makes more sense if we're the one's maintaining the code, which for something as common as email validation, can we not? Emoji is another reason to support UTF8: https://medium.com/@zackbloom/i-have-a-unicode-email-address-fbecd630ec12 If we're good at out jobs, our software should live to see a day when UTF-8 is more common in email addresses. Since we're here to address email validation, let's do it so we don't have to again. |
I don't disagree. My goal in this particular task was to discover a good front-end validation for email which gives a user immediate feedback to avoid typos, not necessarily to vet and validate all possible correct email addresses (we can leave that to the 3rd party email service). The issue I see with using someone else's library for this is that we support and develop for many frontend frameworks (ionic, react, .net mvc, nativescript, xamarin..... ) One library would not work across all of those. A regex line would. I imagine this being the beginning of a shift3 internal library of common functions, which we could build out for all of our primary development . If these things were rolled into our own libraries, we'd be respecting DRY way more than we do nowadays (across projects, not necessarily per individual project). @zbyte64 I'm definitely open for other suggestions as well. Let me know if there was a particular library you had in mind, or if there is something you're already doing on your projects that you really like. |
@michaelachrisco 3 years later and this is still a recurring issue in projects. Now that we have boilerplates to implement a standard, I think this is a good time to resurface this. Now that we're supporting locale translation in the boilerplates, I think the UTF-8 argument has some more strength behind it. I suspect for client-side validation we will still be served best by simple and permissive validation, as opposed to strict and technical. What do you think? |
Adding that I agree with Justin Schiff's assessment here:
|
I noticed we do have an example documented in best practices here: For this to be a completed standard, we should include a definition for our goal on what should and shouldn't pass this validation. It should also include a set of unit tests to verify that goal. |
The current RegEx in the Angular boilerplate is the following: /^[a-z0-9!#$%&'*+\/=?^_\`{|}~.-]+@[a-z0-9]([a-z0-9-])+(\.[a-z0-9]([a-z0-9-]*[a-z0-9])?)*$/i For the test sets you provided above, all of the ones that should match do, and the commented out ones below that should fail pass. const failingValues: string[] = [
// '[email protected]', //
'@test.com',
// 'admin@mailserver1', //
`"()<>[]:,;@\\\"!#$%&'-/=?^_\`{}| ~.a"@example.org`,
'user@[2001:DB8::1]',
'" "@example.org',
'[email protected]',
'"very.(),:;<>[]".VERY."very@\\ "very".unusual"@strange.example.com',
'Abc.example.com ',
'A@b@[email protected]',
'a"b(c)d,e:f;g<h>i[jk][email protected]',
'just"not"[email protected]',
'this is"[email protected]',
'this still"not\\[email protected]',
// '[email protected]', //
'[email protected]',
'[email protected].',
]; I do have unit tests for the validator using the regular expression, but I can add the test sets as follows: describe('[Unit] EmailValidation validEmail() Required', () => {
const urlValidator = EmailValidation.validEmail(true);
const emailControl = new FormControl('');
const matchingValues: string[] = [
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'[email protected]',
'1234567890123456789012345678901234567890123456789012345678901234+x@example.com',
];
const failingValues: string[] = [
'@test.com',
`"()<>[]:,;@\\\"!#$%&'-/=?^_\`{}| ~.a"@example.org`,
'user@[2001:DB8::1]',
'" "@example.org',
'[email protected]',
'"very.(),:;<>[]".VERY."very@\\ "very".unusual"@strange.example.com',
'Abc.example.com ',
'A@b@[email protected]',
'a"b(c)d,e:f;g<h>i[jk][email protected]',
'just"not"[email protected]',
'this is"[email protected]',
'this still"not\\[email protected]',
'[email protected]',
'[email protected].',
];
it(`should return null if value matches a list of values that should work`, () => {
matchingValues.forEach((value) => {
emailControl.setValue(value);
expect(urlValidator(emailControl)).toEqual(null);
});
});
it(`should return { invalidEmail: 'Please enter a valid email.' } if value matches a list of values that should fail`, () => {
failingValues.forEach((value) => {
emailControl.setValue(value);
const expectedValue = {
invalidEmail: 'Please enter a valid email.',
};
expect(urlValidator(emailControl)).toEqual(expectedValue);
});
});
}); We can decide if we want to keep the current RegEx, change it, and add the above test values. Either way, the boilerplate also follows the recommendations that @DropsOfSerenity posted above: it requires confirming the email address and sends an activation email to that account. |
@coreyshuman I agree with making validation simple and permissive as you stated. If we get too strict with the REGEX/standard, we may get quite a few false positives (I remember a few horror projects I worked on in the EDI world). Emojis are now valid email addresses. Its a strange world we live in. I also like the example @Karvel shows by adding real email addresses to the unit tests for each of the valid/invalid emails. As time goes on, this list will naturally expand as we find a user with some strange valid email address that we will need to accommodate and we can just add that to the unit test/fix. Most of the projects I have worked on in the past has stolen or use thee default MDN example here: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/input/email and called it a day.
This, of course, leaves in bugs (like https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489) but it does seem to be "good enough" for most. I feel like we could add unit tests to the examples https://github.com/Shift3/standards-and-practices/tree/main/best-practices/development-tools/validation#code-example but a better place would probably be in the boilerplate projects. |
FWIW, I also agree with making validation simple and permissive. And with requiring confirmation emails. I think something like @Karvel 's regex or the MDN one @michaelachrisco mentioned would probably work well. |
https://en.wikipedia.org/wiki/Email_address
There are some crazy email addresses allowed in RFC 5321 and RFC 5322. Here is the above articles set of rules, and examples of valid and invalid addresses.
Local-part
The local-part of the email address may use any of these [[ASCII]] characters:
uppercase and lowercase [[Basic Latin (Unicode block)|Latin]] letters
A
toZ
anda
toz
;digits
0
to9
;special characters
!#$%&'*+-/=?^_`{|}~
;dot
.
, provided that it is not the first or last character unless quoted, and provided also that it does not appear consecutively unless quoted (e.g.[email protected]
is not allowed but"John..Doe"@example.com
is allowed);Note that some mail servers wildcard local parts, typically the characters following a plus and less often the characters following a minus, so fred+bah@domain and fred+foo@domain might end up in the same inbox as fred+@Domain or even as fred@domain. This can be useful for tagging emails for sorting, see below, and for spam control. Braces
{
and}
are also used in that fashion, although less often."(),:;<>@[]
characters are allowed with restrictions (they are only allowed inside a quoted string, as described in the paragraph below, and in addition, a backslash or double-quote must be preceded by a backslash);john.smith(comment)@example.com
and(comment)[email protected]
are both equivalent to[email protected]
.In addition to the above ASCII characters, international characters above U+007F, encoded as [[UTF-8]], are permitted by RFC 6531, though even mail systems that support SMTPUTF8 and 8BITMIME may restrict which characters to use when assigning local-parts.
Domain
The [[domain name]] part of an email address has to conform to strict guidelines: it must match the requirements for a [[hostname]], a list of dot-separated [[DNS]] labels, each label being limited to a length of 63 characters and consisting of:{{rp|§2}}
A
toZ
anda
toz
;0
to9
, provided that top-level domain names are not all-numeric;-
, provided that it is not the first or last character.This rule is known as the ''LDH rule'' (letters, digits, hyphen). In addition, the domain may be an [[IP address]] literal, surrounded by square brackets
[]
, such asjsmith@[192.168.2.1]
orjsmith@[IPv6:2001:db8::1]
, although this is rarely seen except in [[email spam]]. [[Internationalized domain name]]s (which are encoded to comply with the requirements for a [[hostname]]) allow for presentation of non-ASCII domains. In mail systems compliant with RFC 6531 and RFC 6532 an email address may be encoded as [[UTF-8]], both a local-part as well as a domain name.Comments are allowed in the domain as well as in the local-part; for example,
john.smith@(comment)example.com
and[email protected](comment)
are equivalent to[email protected]
.Examples
Valid email addresses
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
(may go to[email protected]
inbox depending on mail server)[email protected]
(one-letter local-part)"very.(),:;<>[]".VERY."very@\ "very".unusual"@strange.example.com
[email protected]
admin@mailserver1
(local domain name with no [[Top-level domain|TLD]], although ICANN [https://www.icann.org/news/announcement-2013-08-30-en highly discourages] dotless email addresses)#!$%&'*+-/=?^_`{}|[email protected]
"()<>[]:,;@\"!#$%&'-/=?^_`{}| ~.a"@example.org
[email protected]
(see the [[List of Internet top-level domains]])user@[2001:DB8::1]
" "@example.org
(space between the quotes)Invalid email addresses
Abc.example.com
(no @ character)A@b@[email protected]
(only one @ is allowed outside quotation marks)a"b(c)d,e:f;gi[j\k][email protected]
(none of the special characters in this local-part are allowed outside quotation marks)just"not"[email protected]
(quoted strings must be dot separated or the only element making up the local-part)this is"not\[email protected]
(spaces, quotes, and backslashes may only exist when within quoted strings and preceded by a backslash)this\ still"not\[email protected]
(even if escaped (preceded by a backslash), spaces, quotes, and backslashes must still be contained by quotes)1234567890123456789012345678901234567890123456789012345678901234+x@example.com
(local part is longer than 64 characters)[email protected]
(double dot before @)[email protected]
(double dot after @)The text was updated successfully, but these errors were encountered: