Code Butchering: Regular Expressions: learning with an email regex

Tuesday, February 26, 2008

Regular Expressions: learning with an email regex

In computing, regular expressions provide a concise and flexible means for identifying text of interest, such as particular characters, words, or patterns of characters. Regular expressions are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.

Wikipedia

So a regular expression is a way to identify a definite sequence of characters, useful in the search inside a long text or to validate a user input.
The "simplest" (are we sure??) example is the processing of an email address obtained by a user input from a, e.g., registration form.
Email adress are composed in this way:

alphanumeric characters mixed with (. and/or - and/or _) not in the start/end

alphanumeric characters and/or (. and/or - and/or _) not in the start/end

. followed by 2/4 letters

A valid representetion for this kind of regex sounds like this:

^[a-z0-9]+([\._-]*[a-z0-9]+)*@[a-z0-9]+([\._-]*[a-z0-9]+)*\.[a-z]{2,4}$

I'm not blind, this is surely not cool, but in few lines I'll explain you what this mess means.

Firs of all the ^ and $ characters stands for the start and the end of the searched sequence: they are mandatory, infact the regex ^Hello finds all strings that begin with Hello, while end$ those which ends with end, middle stands for sequences with one or more occurence of the word middle and at last ^just this$ match correctly only the just this string.

Square brackets [ ], in couple, stands for a set of characters, so for example the regex [12345] match every sequence that contains al least one number between 1 or 5, so are correctly matched 1hello and abcde51z2 but not 6a.
Of course, using a back slash \ you can use all protected characters for your sequences.
For numbers and letters, you can use the - to obtain a range of characters ([a-z] stands for the whole alphabet in small caps).

The + after a sequence means that that sequence should be repeated at least once, while the * states that that sequence can be present 0 or more times; moreover the ? means that the preceding sequence is optional, so it can appear 0 or 1 times.

In the above regex I have written a backslahed dot, because the . is a special character, meaning wathever character except for new line caracter (\n\r or \n\n or \r\n depending on your operating system). Infact the regex ^.+$ recognizes every strings, except one with only new lines or null.

Ending we can group together diverse sequences with round brackets ( ).

And now a brief explaination of the complex regex of an email addess:

^: beginning of the sequence
[a-z0-9]+: the first past begins with an alphanumeric character (one or more)
( [\._-]*[a-z0-9]+ )*: the first part can contain dots, underscores and dashes but they must be followed by alphanumeric characters (it can end with a non alphanumeric); moreover this kind of sequence could not be present (*), so the previous part can recognize alone a simple email address without non alphanumerical characters (such as pippo82@x.us)
@: simply the @ character
[a-z0-9]+([\._-]*[a-z0-9]+)*: the same as above
\.[a-z]{2,4}: a dot followed by a simple sequence of small caps letters from 2 to 4 units
$: end of the sequence

2 comments:

Unknown said...: Good article. And if you're looking for practical examples on how to use RegEx on Delphi, Visual, VB.NET Basic and Classic ASP you can find it on the link below:

Replacing and filtering text with regular expression using Delphi, Visual Basic and ASP

Happy Coding :); February 27, 2008 at 7:58 PM
Anonymous said...: I need to learn more about invoking Regexes in C#. I did one program with that but I'm not an expert.; March 8, 2008 at 12:57 AM