Regular expressions intro and a useful testing tool

Intro

So I have been doing some work on building a web scraper using OOP PHP, it is my first project using objects and it got off to rocky start. After I finagled with it a bit I got my classes and even some methods to work!

RegexPal

This brings me to my topic tonight, regular expressions. I have an excellent book I use for reference Mastering Regular Expressions, you may want to pick up a book or perhaps a Cheatsheet I was still having some trouble though because I was never sure exactly what I was selecting. It was hard to debug something you can’t see working. The errors seemed rather ambiguous and discouraging. That is until I found this really cool free online tool, RegexPal. You can see the results of your regular expression instantly, so it takes a lot of the guesswork out of it. Granted, like any program it isn’t 100% perfect, but it is still a
useful tool when testing.

The basics of Regular Expression

The first thing you should learn is how to start and end a regular expression.

  • ^ – This is the start of a string, use it if the match should always start like with the character or pattern following the ^.
  • $ – This is the end of the string, use it if you the match should always end with the character or pattern preceding the $

Example 1:

If you have these three strings:

1- Regular Expression
2- This is another Regular Expression
3- Regular Expression to the third power

And this is your regular expression : ^Regular

This is what will match up.

1- Regular Expression
2- This is another Regular Expression
3- Regular Expression to the third power

The reason the second string has no match is that if you remember the ^ declares the start of a string. So when you say ^Regular, you are saying the string must start with Regular to match.

The second thing you should learn are the quantifiers. They are responsible for how long you match something for or how many times to match it.

  • * – This is probably the most wild of them all. It matches the preceding character 0 or more times. So essentially it says, ‘hey I don’t care what character you are you can have as many more after this as you want’.
  • + – Matches preceding character 1 or more times. So unlike the *, you the preceding character must be the same as in the expression.
  • ? – This means you can have 0 or more, I usually think of it as saying the match is optional.
  • {2} – Match 2 times, the number can be whatever you want. {5} will match 5 times.

The third and last thing to note are the basic ways of selecting a range of values.

  • [a-z] – any lowercase character between a and z
  • [A-Z] – any uppercase character between A and Z
  • [0-9] – and number between 0 and 9. This can be any range, [50-100] is also valid.

So some examples would be cool right?

Say you have a string that like so, That is a really cool computer, where did you get it?. Now how about if you wanted to match from the comma to the end of the sentence?

([,][a-z? ]*)
  • Let’s break it down, the parenthesis () group the match together, just like with conditional statements in PHP, PERL, Javascript, etc…
  • [,] matches the comma first.
  • [a-z? ] matches any lowercase character a-z, question marks, and also a space.
  • The final * means I want to match the preceding ‘a-z’ or ‘?’ or ‘ ‘, it doesn’t matter which one.

So if you are still looking for some more reading on the subject, I found these sites to be helpful:

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>