In computing, regular expressions provide a concise and flexible means for identifying text of interest, such as particular characters, words, or patterns of characters. Regular expressions are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.
Wikipedia
So a regular expression is a way to identify a definite sequence of characters, useful in the search inside a long text or to validate a user input.
The "simplest" (are we sure??) example is the processing of an email address obtained by a user input from a, e.g., registration form.
Email adress are composed in this way:
- alphanumeric characters mixed with (. and/or - and/or _) not in the start/end
- @
- alphanumeric characters and/or (. and/or - and/or _) not in the start/end
- . followed by 2/4 letters
A valid representetion for this kind of regex sounds like this:
^[a-z0-9]+([\._-]*[a-z0-9]+)*@[a-z0-9]+([\._-]*[a-z0-9]+)*\.[a-z]{2,4}$
I'm not blind, this is surely not cool, but in few lines I'll explain you what this mess means.
Firs of all the
^ and
$ characters stands for the start and the end of the searched sequence: they are mandatory, infact the regex
^Hello finds all strings that begin with
Hello, while
end$ those which ends with
end,
middle stands for sequences with one or more occurence of the word
middle and at last
^just this$ match correctly only the
just this string.
Square brackets
[ ], in couple, stands for a set of characters, so for example the regex
[12345] match every sequence that contains al least one number between 1 or 5, so are correctly matched
1hello and
abcde51z2 but not
6a.
Of course, using a back slash
\ you can use all protected characters for your sequences.
For numbers and letters, you can use the
- to obtain a range of characters (
[a-z] stands for the whole alphabet in small caps).
The
+ after a sequence means that that sequence should be repeated at least once, while the
* states that that sequence can be present 0 or more times; moreover the
? means that the preceding sequence is optional, so it can appear 0 or 1 times.
In the above regex I have written a backslahed dot, because the
. is a special character, meaning
wathever character except for new line caracter (\n\r or \n\n or \r\n depending on your operating system). Infact the regex
^.+$ recognizes every strings, except one with only new lines or null.
Ending we can group together diverse sequences with round brackets
( ).
And now a brief explaination of the complex regex of an email addess:
^
- beginning of the sequence
[a-z0-9]+
- the first past begins with an alphanumeric character (one or more)
( [\._-]*[a-z0-9]+ )*
- the first part can contain dots, underscores and dashes but they must be followed by alphanumeric characters (it can end with a non alphanumeric); moreover this kind of sequence could not be present (*), so the previous part can recognize alone a simple email address without non alphanumerical characters (such as pippo82@x.us)
@
- simply the @ character
[a-z0-9]+([\._-]*[a-z0-9]+)*
- the same as above
\.[a-z]{2,4}
- a dot followed by a simple sequence of small caps letters from 2 to 4 units
$
- end of the sequence