Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a standard email field verification Regular Expression (or find and verify one) #39

Open
coreyshuman opened this issue Jul 25, 2018 · 15 comments
Labels
Good First Contribution If you're new to contributing, these are PERFECT for you help wanted

Comments

@coreyshuman
Copy link
Member

https://en.wikipedia.org/wiki/Email_address

There are some crazy email addresses allowed in RFC 5321 and RFC 5322. Here is the above articles set of rules, and examples of valid and invalid addresses.

Local-part

The local-part of the email address may use any of these [[ASCII]] characters:

  • uppercase and lowercase [[Basic Latin (Unicode block)|Latin]] letters A to Z and a to z;

  • digits 0 to 9;

  • special characters !#$%&'*+-/=?^_`{|}~;

  • dot ., provided that it is not the first or last character unless quoted, and provided also that it does not appear consecutively unless quoted (e.g. [email protected] is not allowed but "John..Doe"@example.com is allowed);

Note that some mail servers wildcard local parts, typically the characters following a plus and less often the characters following a minus, so fred+bah@domain and fred+foo@domain might end up in the same inbox as fred+@Domain or even as fred@domain. This can be useful for tagging emails for sorting, see below, and for spam control. Braces { and } are also used in that fashion, although less often.

  • space and "(),:;<>@[] characters are allowed with restrictions (they are only allowed inside a quoted string, as described in the paragraph below, and in addition, a backslash or double-quote must be preceded by a backslash);
  • comments are allowed with parentheses at either end of the local-part; e.g. john.smith(comment)@example.com and (comment)[email protected] are both equivalent to [email protected].

In addition to the above ASCII characters, international characters above U+007F, encoded as [[UTF-8]], are permitted by RFC 6531, though even mail systems that support SMTPUTF8 and 8BITMIME may restrict which characters to use when assigning local-parts.

Domain

The [[domain name]] part of an email address has to conform to strict guidelines: it must match the requirements for a [[hostname]], a list of dot-separated [[DNS]] labels, each label being limited to a length of 63 characters and consisting of:{{rp|§2}}

  • uppercase and lowercase [[Basic Latin (Unicode block)|Latin]] letters A to Z and a to z;
  • digits 0 to 9, provided that top-level domain names are not all-numeric;
  • hyphen -, provided that it is not the first or last character.
    This rule is known as the ''LDH rule'' (letters, digits, hyphen). In addition, the domain may be an [[IP address]] literal, surrounded by square brackets [], such as jsmith@[192.168.2.1] or jsmith@[IPv6:2001:db8::1], although this is rarely seen except in [[email spam]]. [[Internationalized domain name]]s (which are encoded to comply with the requirements for a [[hostname]]) allow for presentation of non-ASCII domains. In mail systems compliant with RFC 6531 and RFC 6532 an email address may be encoded as [[UTF-8]], both a local-part as well as a domain name.

Comments are allowed in the domain as well as in the local-part; for example, john.smith@(comment)example.com and [email protected](comment) are equivalent to [email protected].

Examples

Valid email addresses

Invalid email addresses

@coreyshuman
Copy link
Member Author

coreyshuman commented Jul 25, 2018

This is a promising solution if we could figure out a way to standardize it for our projects.
https://github.com/django/django/blob/master/django/core/validators.py#L164

Plus lots of examples and resources here:
http://emailregex.com/

@zbyte64
Copy link
Contributor

zbyte64 commented Jul 26, 2018

Stackoverflow answer has a pretty awesome regexp pattern with a state machine diagram: https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression

But considering that different languages have different regexp syntaxes it might be better to designate a validation library for each language we use. For nodejs isemail looks pretty robust: https://github.com/hapijs/isemail/blob/master/test/tests.json

@coreyshuman
Copy link
Member Author

coreyshuman commented Jul 27, 2018

I would like to humbly propose a solution which performs as well as the RFC5322 Official Standard (in my particular test set) but is much easier to understand and verify.

^(?!\.)(?!.*?\.(\.|@))[\w\d.!#$%&'*+\-\/=?^_`{|}~]+@[\w\d.-]+\.[\w\d]{2,}$
  • ^ - start of line
  • (?!\.) - don't allow the line to start with .
  • (?!.*?\.(\.|@)) - don't allow consecutive periods, ex. ([email protected]). Also don't allow a period at the end of the local part, ex ([email protected])
  • [\w\d.!#$%&'*+\-\/=?^_`{|}~]+ - match one or more letters, numbers, and these special characters: .!#$%&'*+-/=?^_`{|}~
  • @ - match the literal character @
  • [\w\d.-]+ - match one or more letter, digit, period (.), or hyphen (-)
  • \. - match a period (.)
  • [\w\d]{2,} - match 2 or more letters and numbers
  • $ - end of line

This regex can be tested here: https://regex101.com/r/A9jZZ4/4
This is not meant to be a perfect solution, but should cover 99% of email addresses Shift3 would expect to deal with, while catching some basic mistakes for user convenience. It does NOT handle extended ASCII / international characters, which the RFC 5322 standard does.

The following email addresses expectedly pass this validation:

The following email addresses expectedly fail this validation:

[email protected]
@test.com
admin@mailserver1
"()<>[]:,;@\\\"!#$%&'-/=?^_`{}| ~.a"@example.org
user@[2001:DB8::1]
" "@example.org
[email protected]
"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com
Abc.example.com 
A@b@[email protected]
a"b(c)d,e:f;g<h>i[j\k][email protected]
just"not"[email protected]
this is"not\[email protected]
this\ still\"not\\[email protected]
[email protected]
[email protected]
[email protected].

I would appreciate if others would throw some other test cases against this regex and try to break it.

For reference, here is the RFC 5322 Standard I am comparing against.

^(?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])$

Found at http://emailregex.com/

@ggoforth
Copy link

screen shot 2018-07-27 at 3 13 12 pm

But for reals, reading this issue is fantastic. I like the solution proposed at the end, and the amount of testing done against it. 👍

@zbyte64
Copy link
Contributor

zbyte64 commented Aug 1, 2018

Running through the validation examples from isemail against

^(?!\.)(?!.*?\.(\.|@))[\w\d.!#$%&'*+\-\/=?^_`{|}~]+@[\w\d.-]+\.[\w\d]{2,}$

Most notable is the lack of UTF8 support and hyphen handling.

False positives:

[email protected]
a@abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefg.abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefg.abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefg.abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefg.abcdefghijkl.hijk
[email protected]
[email protected]
[email protected]

False negatives:

ê[email protected]
ñoñó[email protected]
test@\uD800\uD800ñoñó郵件ñoñó郵件.ñoñó郵件ñoñoñó郵件ñoñó.郵件ñoñó郵件ñoñó郵件.ñoñó郵件ñoñó郵件ñoñó郵件.ñoñó郵件ñoñó郵件.ñoñó郵件ñoñó郵件.ñoñó郵件ñoñó郵件.ñoñó郵件ñoñó郵件.ñoñó郵件ñoñó郵件.oñó郵件ñoñó郵件ñoñó郵件.商務

@coreyshuman
Copy link
Member Author

Hyphen support I'm not as concerned with, in terms of hitting that balance of simplicity vs. complete accuracy to RFC 5322. A false positive is not a big deal, vs a false negative which would stop a valid user from accessing a service. With that in mind, the false negatives do seem like a problem. How common is UTF8 support with the major email providers? And what percentage of users would hit that use case? If we're talking < 1 %, I would rather just tell a user to use a different email address.

Let me know what you guys think.

@stephengtuggy
Copy link
Contributor

Personally, I've known people from multiple people groups in various parts of the world, and as far as I recall, almost all of them used plain ANSI characters in their email addresses, web addresses, and IM'ing. So I don't think UTF-8 support is a big deal.

@zbyte64
Copy link
Contributor

zbyte64 commented Dec 21, 2018

Frankly, I think it is more important to adopt a library for this concern then to bless a regex to be copied for all projects. Having a small clever regex pattern to stamp out is cool but it runs afoul with the DRY principle: https://en.wikipedia.org/wiki/Don%27t_repeat_yourself. The argument for simplicity makes more sense if we're the one's maintaining the code, which for something as common as email validation, can we not?

Emoji is another reason to support UTF8: https://medium.com/@zackbloom/i-have-a-unicode-email-address-fbecd630ec12

If we're good at out jobs, our software should live to see a day when UTF-8 is more common in email addresses. Since we're here to address email validation, let's do it so we don't have to again.

@coreyshuman
Copy link
Member Author

I don't disagree. My goal in this particular task was to discover a good front-end validation for email which gives a user immediate feedback to avoid typos, not necessarily to vet and validate all possible correct email addresses (we can leave that to the 3rd party email service).

The issue I see with using someone else's library for this is that we support and develop for many frontend frameworks (ionic, react, .net mvc, nativescript, xamarin..... ) One library would not work across all of those. A regex line would.

I imagine this being the beginning of a shift3 internal library of common functions, which we could build out for all of our primary development . If these things were rolled into our own libraries, we'd be respecting DRY way more than we do nowadays (across projects, not necessarily per individual project).

@zbyte64 I'm definitely open for other suggestions as well. Let me know if there was a particular library you had in mind, or if there is something you're already doing on your projects that you really like.

@coreyshuman
Copy link
Member Author

@michaelachrisco 3 years later and this is still a recurring issue in projects. Now that we have boilerplates to implement a standard, I think this is a good time to resurface this.

Now that we're supporting locale translation in the boilerplates, I think the UTF-8 argument has some more strength behind it.

I suspect for client-side validation we will still be served best by simple and permissive validation, as opposed to strict and technical. What do you think?

@coreyshuman
Copy link
Member Author

Adding that I agree with Justin Schiff's assessment here:

@coreyshuman I would normally agree, but what i'm trying to make clear here is that complicated email regex is not the preferred pattern for signup or email validation anyway. Attempting to send an email to the address specified is. Provided a permissive regex, or none at all (or just asking the user to enter their email twice) while sending a confirmation email, is a 100% method to ensure you end up with a valid email address, and 100% method to make sure you have no false negatives.

When you run into an "edge case" in your complicated regular expression you have to do the follow -> find the fix, hope you don't implement a regression possibly in other untested parts of the regex -> backport to all running applications using the old regex -> make sure all old versions of applications are updated -> etc. etc. etc.

I think that have an email regex may be valuable for things other than sign up fields, but I want it to be clear that in my opinion for sign in/sign up this is not the preferred pattern of validation, nor does it enhance security.

Originally posted by @DropsOfSerenity in #130 (comment)

@coreyshuman
Copy link
Member Author

I noticed we do have an example documented in best practices here:
https://github.com/Shift3/standards-and-practices/tree/main/best-practices/development-tools/validation#code-example

For this to be a completed standard, we should include a definition for our goal on what should and shouldn't pass this validation. It should also include a set of unit tests to verify that goal.

@Karvel
Copy link
Contributor

Karvel commented Apr 8, 2021

The current RegEx in the Angular boilerplate is the following:

/^[a-z0-9!#$%&'*+\/=?^_\`{|}~.-]+@[a-z0-9]([a-z0-9-])+(\.[a-z0-9]([a-z0-9-]*[a-z0-9])?)*$/i

For the test sets you provided above, all of the ones that should match do, and the commented out ones below that should fail pass.

        const failingValues: string[] = [
          // '[email protected]', //
          '@test.com',
          // 'admin@mailserver1', //
          `"()<>[]:,;@\\\"!#$%&'-/=?^_\`{}| ~.a"@example.org`,
          'user@[2001:DB8::1]',
          '" "@example.org',
          '[email protected]',
          '"very.(),:;<>[]".VERY."very@\\ "very".unusual"@strange.example.com',
          'Abc.example.com ',
          'A@b@[email protected]',
          'a"b(c)d,e:f;g<h>i[jk][email protected]',
          'just"not"[email protected]',
          'this is"[email protected]',
          'this still"not\\[email protected]',
          // '[email protected]', //
          '[email protected]',
          '[email protected].',
        ];

I do have unit tests for the validator using the regular expression, but I can add the test sets as follows:

      describe('[Unit] EmailValidation validEmail() Required', () => {
        const urlValidator = EmailValidation.validEmail(true);
        const emailControl = new FormControl('');
        const matchingValues: string[] = [
          '[email protected]',
          '[email protected]',
          '[email protected]',
          '[email protected]',
          '[email protected]',
          '[email protected]',
          '[email protected]',
          '[email protected]',
          '[email protected]',
          '[email protected]',
          '1234567890123456789012345678901234567890123456789012345678901234+x@example.com',
        ];

        const failingValues: string[] = [
          '@test.com',
         `"()<>[]:,;@\\\"!#$%&'-/=?^_\`{}| ~.a"@example.org`,
          'user@[2001:DB8::1]',
          '" "@example.org',
          '[email protected]',
          '"very.(),:;<>[]".VERY."very@\\ "very".unusual"@strange.example.com',
          'Abc.example.com ',
          'A@b@[email protected]',
          'a"b(c)d,e:f;g<h>i[jk][email protected]',
          'just"not"[email protected]',
          'this is"[email protected]',
          'this still"not\\[email protected]',
          '[email protected]',
          '[email protected].',
        ];

        it(`should return null if value matches a list of values that should work`, () => {
          matchingValues.forEach((value) => {
            emailControl.setValue(value);
            expect(urlValidator(emailControl)).toEqual(null);
          });
        });

        it(`should return { invalidEmail: 'Please enter a valid email.' } if value matches a list of values that should fail`, () => {
          failingValues.forEach((value) => {
            emailControl.setValue(value);
            const expectedValue = {
              invalidEmail: 'Please enter a valid email.',
            };
            expect(urlValidator(emailControl)).toEqual(expectedValue);
          });
        });
      });

We can decide if we want to keep the current RegEx, change it, and add the above test values.

Either way, the boilerplate also follows the recommendations that @DropsOfSerenity posted above: it requires confirming the email address and sends an activation email to that account.

@michaelachrisco
Copy link
Contributor

michaelachrisco commented Apr 8, 2021

@michaelachrisco 3 years later and this is still a recurring issue in projects. Now that we have boilerplates to implement a standard, I think this is a good time to resurface this.

Now that we're supporting locale translation in the boilerplates, I think the UTF-8 argument has some more strength behind it.

I suspect for client-side validation we will still be served best by simple and permissive validation, as opposed to strict and technical. What do you think?

@coreyshuman I agree with making validation simple and permissive as you stated. If we get too strict with the REGEX/standard, we may get quite a few false positives (I remember a few horror projects I worked on in the EDI world). Emojis are now valid email addresses. Its a strange world we live in.

I also like the example @Karvel shows by adding real email addresses to the unit tests for each of the valid/invalid emails. As time goes on, this list will naturally expand as we find a user with some strange valid email address that we will need to accommodate and we can just add that to the unit test/fix.

Most of the projects I have worked on in the past has stolen or use thee default MDN example here: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/input/email and called it a day.

/^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}
[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/

This, of course, leaves in bugs (like https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489) but it does seem to be "good enough" for most.

I feel like we could add unit tests to the examples https://github.com/Shift3/standards-and-practices/tree/main/best-practices/development-tools/validation#code-example but a better place would probably be in the boilerplate projects.

@stephengtuggy
Copy link
Contributor

FWIW, I also agree with making validation simple and permissive. And with requiring confirmation emails. I think something like @Karvel 's regex or the MDN one @michaelachrisco mentioned would probably work well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good First Contribution If you're new to contributing, these are PERFECT for you help wanted
Projects
None yet
Development

No branches or pull requests

6 participants