Tuesday, February 26, 2008

Regular Expressions: learning with an email regex



In computing, regular expressions provide a concise and flexible means for identifying text of interest, such as particular characters, words, or patterns of characters. Regular expressions are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.

Wikipedia


So a regular expression is a way to identify a definite sequence of characters, useful in the search inside a long text or to validate a user input.
The "simplest" (are we sure??) example is the processing of an email address obtained by a user input from a, e.g., registration form.
Email adress are composed in this way:
  • alphanumeric characters mixed with (. and/or - and/or _) not in the start/end

  • @

  • alphanumeric characters and/or (. and/or - and/or _) not in the start/end

  • . followed by 2/4 letters


A valid representetion for this kind of regex sounds like this:
^[a-z0-9]+([\._-]*[a-z0-9]+)*@[a-z0-9]+([\._-]*[a-z0-9]+)*\.[a-z]{2,4}$

I'm not blind, this is surely not cool, but in few lines I'll explain you what this mess means.

Firs of all the ^ and $ characters stands for the start and the end of the searched sequence: they are mandatory, infact the regex ^Hello finds all strings that begin with Hello, while end$ those which ends with end, middle stands for sequences with one or more occurence of the word middle and at last ^just this$ match correctly only the just this string.

Square brackets [ ], in couple, stands for a set of characters, so for example the regex [12345] match every sequence that contains al least one number between 1 or 5, so are correctly matched 1hello and abcde51z2 but not 6a.
Of course, using a back slash \ you can use all protected characters for your sequences.
For numbers and letters, you can use the - to obtain a range of characters ([a-z] stands for the whole alphabet in small caps).

The + after a sequence means that that sequence should be repeated at least once, while the * states that that sequence can be present 0 or more times; moreover the ? means that the preceding sequence is optional, so it can appear 0 or 1 times.

In the above regex I have written a backslahed dot, because the . is a special character, meaning wathever character except for new line caracter (\n\r or \n\n or \r\n depending on your operating system). Infact the regex ^.+$ recognizes every strings, except one with only new lines or null.

Ending we can group together diverse sequences with round brackets ( ).

And now a brief explaination of the complex regex of an email addess:

^

beginning of the sequence

[a-z0-9]+

the first past begins with an alphanumeric character (one or more)

( [\._-]*[a-z0-9]+ )*

the first part can contain dots, underscores and dashes but they must be followed by alphanumeric characters (it can end with a non alphanumeric); moreover this kind of sequence could not be present (*), so the previous part can recognize alone a simple email address without non alphanumerical characters (such as pippo82@x.us)

@

simply the @ character

[a-z0-9]+([\._-]*[a-z0-9]+)*

the same as above

\.[a-z]{2,4}

a dot followed by a simple sequence of small caps letters from 2 to 4 units

$

end of the sequence


2 comments:

Unknown said...

Good article. And if you're looking for practical examples on how to use RegEx on Delphi, Visual, VB.NET Basic and Classic ASP you can find it on the link below:

Replacing and filtering text with regular expression using Delphi, Visual Basic and ASP

Happy Coding :)

Anonymous said...

I need to learn more about invoking Regexes in C#. I did one program with that but I'm not an expert.