In computing, regular expressions provide a concise and flexible means for identifying text of interest, such as particular characters, words, or patterns of characters. Regular expressions are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.
So a regular expression is a way to identify a definite sequence of characters, useful in the search inside a long text or to validate a user input.
The "simplest" (are we sure??) example is the processing of an email address obtained by a user input from a, e.g., registration form.
Email adress are composed in this way:
- alphanumeric characters mixed with (. and/or - and/or _) not in the start/end
- @
- alphanumeric characters and/or (. and/or - and/or _) not in the start/end
- . followed by 2/4 letters
A valid representetion for this kind of regex sounds like this:
^[a-z0-9]+([\._-]*[a-z0-9]+)*@[a-z0-9]+([\._-]*[a-z0-9]+)*\.[a-z]{2,4}$
I'm not blind, this is surely not cool, but in few lines I'll explain you what this mess means.
Firs of all the ^ and $ characters stands for the start and the end of the searched sequence: they are mandatory, infact the regex ^Hello finds all strings that begin with Hello, while end$ those which ends with end, middle stands for sequences with one or more occurence of the word middle and at last ^just this$ match correctly only the just this string.
Square brackets [ ], in couple, stands for a set of characters, so for example the regex [12345] match every sequence that contains al least one number between 1 or 5, so are correctly matched 1hello and abcde51z2 but not 6a.
Of course, using a back slash \ you can use all protected characters for your sequences.
For numbers and letters, you can use the - to obtain a range of characters ([a-z] stands for the whole alphabet in small caps).
The + after a sequence means that that sequence should be repeated at least once, while the * states that that sequence can be present 0 or more times; moreover the ? means that the preceding sequence is optional, so it can appear 0 or 1 times.
In the above regex I have written a backslahed dot, because the . is a special character, meaning wathever character except for new line caracter (\n\r or \n\n or \r\n depending on your operating system). Infact the regex ^.+$ recognizes every strings, except one with only new lines or null.
Ending we can group together diverse sequences with round brackets ( ).
And now a brief explaination of the complex regex of an email addess:
^
- beginning of the sequence
[a-z0-9]+
- the first past begins with an alphanumeric character (one or more)
( [\._-]*[a-z0-9]+ )*
- the first part can contain dots, underscores and dashes but they must be followed by alphanumeric characters (it can end with a non alphanumeric); moreover this kind of sequence could not be present (*), so the previous part can recognize alone a simple email address without non alphanumerical characters (such as pippo82@x.us)
@
- simply the @ character
[a-z0-9]+([\._-]*[a-z0-9]+)*
- the same as above
\.[a-z]{2,4}
- a dot followed by a simple sequence of small caps letters from 2 to 4 units
$
- end of the sequence
2 comments:
Good article. And if you're looking for practical examples on how to use RegEx on Delphi, Visual, VB.NET Basic and Classic ASP you can find it on the link below:
Replacing and filtering text with regular expression using Delphi, Visual Basic and ASP
Happy Coding :)
I need to learn more about invoking Regexes in C#. I did one program with that but I'm not an expert.
Post a Comment