Regular expressions intro and a useful testing tool

Intro

So I have been doing some work on building a web scraper using OOP PHP, it is my first project using objects and it got off to rocky start. After I finagled with it a bit I got my classes and even some methods to work!

RegexPal

This brings me to my topic tonight, regular expressions. I have an excellent book I use for reference Mastering Regular Expressions, you may want to pick up a book or perhaps a Cheatsheet I was still having some trouble though because I was never sure exactly what I was selecting. It was hard to debug something you can’t see working. The errors seemed rather ambiguous and discouraging. That is until I found this really cool free online tool, RegexPal. You can see the results of your regular expression instantly, so it takes a lot of the guesswork out of it. Granted, like any program it isn’t 100% perfect, but it is still a
useful tool when testing.

The basics of Regular Expression

The first thing you should learn is how to start and end a regular expression.

  • ^ – This is the start of a string, use it if the match should always start like with the character or pattern following the ^.
  • $ – This is the end of the string, use it if you the match should always end with the character or pattern preceding the $

Example 1:

If you have these three strings:

1- Regular Expression
2- This is another Regular Expression
3- Regular Expression to the third power

And this is your regular expression : ^Regular

This is what will match up.

1- Regular Expression
2- This is another Regular Expression
3- Regular Expression to the third power

The reason the second string has no match is that if you remember the ^ declares the start of a string. So when you say ^Regular, you are saying the string must start with Regular to match.

The second thing you should learn are the quantifiers. They are responsible for how long you match something for or how many times to match it.

  • * – This is probably the most wild of them all. It matches the preceding character 0 or more times. So essentially it says, ‘hey I don’t care what character you are you can have as many more after this as you want’.
  • + – Matches preceding character 1 or more times. So unlike the *, you the preceding character must be the same as in the expression.
  • ? – This means you can have 0 or more, I usually think of it as saying the match is optional.
  • {2} – Match 2 times, the number can be whatever you want. {5} will match 5 times.

The third and last thing to note are the basic ways of selecting a range of values.

  • [a-z] – any lowercase character between a and z
  • [A-Z] – any uppercase character between A and Z
  • [0-9] – and number between 0 and 9. This can be any range, [50-100] is also valid.

So some examples would be cool right?

Say you have a string that like so, That is a really cool computer, where did you get it?. Now how about if you wanted to match from the comma to the end of the sentence?

([,][a-z? ]*)
  • Let’s break it down, the parenthesis () group the match together, just like with conditional statements in PHP, PERL, Javascript, etc…
  • [,] matches the comma first.
  • [a-z? ] matches any lowercase character a-z, question marks, and also a space.
  • The final * means I want to match the preceding ‘a-z’ or ‘?’ or ‘ ‘, it doesn’t matter which one.

So if you are still looking for some more reading on the subject, I found these sites to be helpful:

Arrays and cookies in javascript

So I was just working on a web project involving cookies. The site had a couple categories of pages that each had their own ‘tips’ for the user. The ‘tips’ can be hidden or shown by clicking on a button. So I figured rather than creating a page specific function to tell whether to keep the tip open or closed I would write one for all the pages. It started out rather easy, making the array, setting the cookie, reading the cookie, took less than an half an hour (I had previously created a set,get,and delete cookie functions in javascipt).

The Problem

I went to test it out and it worked on the first test. I saw ,,,0, meaning the first 3 pages had not been visited and the 4th had the ‘tips’ window closed. I was pumped, ready to move onto another task. I went to test another page and when I clicked the button guess what happened? Yeah I figured you would know what, but in case you didn’t, the array/cookie now looked like this ,,,,0,,,

I puttered around for an hour or so trying various things, I kept focusing on where I pulled the info from the cookie and set it to the new array.

showCookie= getCookie('showCookie');
showArray = new Array();
if(showCookie!= ''){
	for(i=0;i<=3;i++){
		showArray[i] = showCookie[i];
	}
}

It looked right to me. So I spent the next half hour digging around looking for what else it could have been. I looked at where I set the array, the length of the array before and after setting it. It always came out as what it was supposed to.

The Conclusion

Then it hit me. What if the array was no longer an array? What if when you set an array as a cookie value it becomes a string? Maybe the explanation for the cookie having these extra commas was that in fact when I was looping through the cookie ‘array’ I was actually appending a string of 4 commas into my new array. I tested this theory out by revising my code.

showCookie = getCookie('showCookie');

//When pulling the cookie array back in, it is a string not an array.
showCookieSplit = showCookie.split(",");

showArray = new Array();

if(showCookie!= ''){
	for(i=0;i<=3;i++){
		showArray[i] = showCookieSplit[i];
	}
}

It worked! Long story short, now I know that I should test everything, even the things I think to be true.

Google Caffeine

I was just reading some articles and came across something I should have seen months ago! Google is working on a speedier search engine which appears to have a slightly different method of determining page rank. When I Google my name with Caffeine, I am in the 8th position on the first page. When I Google my name normally, I am in the 6th position on the first page. So, I like that I have a better page rank now, but at the same time the pages that overtook me had to do with a former Yale football star with the same name. So the fact that a page that is not considered relevant now, is relevant with Caffeine is good news.

The reason for the changes in the search are not evident from a glance. The architecture how the search engine works has changes, along with that the algorithms that determine page rank have also changed. Even though Caffeine is still in development it is something you should keep an eye on so it doesn’t take you by surprise when your web site either takes a small dive in rank or jumps up a couple spots!

Cookie doesn’t set using PHP 4, but they do using PHP5?

So yesterday when working on a client site all of the cookies stopped working. I had not changed a single thing since I had last worked on it and everything worked. I did some research for a couple hours where people wrote about cookies not setting properly. It yielded a vast amount of results, mostly pointing to how I was setting the cookie. My cookie code appeared to be perfect in structure though so I cast that aside.

setcookie("cookieName", '1' , time()+7200,'/','.levijackson.net');

The best answer I could fine over the two hours was this:

Check the cookie settings of the other browsers and if they’re set to block all or empty on exit.
If the cookies work in one browser, but not another, you will need to make sure that the other browser is letting you set cookies in the first place.
Sometimes it will look like you can create the cookie, but then it will disappear or be deleted with each page reload.

It’s also possible that because you’re setting the cookies in an iframe, that the browsers may view it as a third-party cookie and reject it unless explicitly set out in the browser preferences to allow third-party cookies.

In that case you would need a compact privacy policy (or a compact P3P header) on the pages from where you’re trying to set the cookies from.
For PHP, you would add this as your header for the page setting the cookie:
header(‘P3P:CP=”IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT”‘);

Posted on Stackoverflow

The problem?

It didn’t work for me. For one, I wasn’t using iframes and therefore had no path issues. I also came across another solution of using httpOnly cookies. This solved the problem in Chrome but not IE. I depend on expiring and resetting cookie values for a key part of the site. This didn’t have the functionality of expiring from what I could find.

I always use it as my last option, as I like to figure things out on my own, but I finally turned to Stackoverflow. A user on the site, Kip, suggested I use Web Sniffer to see how the headers were being sent. When I viewed them I at first didn’t notice anything different. I was getting the following:

<em>(PHP 4 site)</em> eventCookie=2; expires=Wed, 30-Sep-2009 02:16:37 GMT; path=/; domain=.levijackson.net <br />
<em>(PHP 5 site)</em> eventCookie=2; expires=Wed, 30 Sep 2009 05:33:37 GMT; path=/; domain=.levijackson.net <br />

The part that finally clicked with me was when the cookies expired. For the PHP 5 site the cookie expired 5:30am GMT which was the equivalent to around midnight here (at the time of testing it was around 11pm). The cookie for the PHP 4 site however was set to expire around 2am GMT which translated to 9pm. As a result the cookie was setting and immediately unsetting.

The solution:

I set the cookie to expire in 24 hours rather than just 2. This allowed for the difference in server time.

Side notes:

I still don’t know how the server changed its time, but it quite literally went from working to not working. So I can only assume the admin of the server either updated or changed a setting. Either way I gained a tool and some more knowledge of cookies so I can’t complain.

elementFormDefault – what is it?

As I researched and wrote schema’s I realized I was writing

<schema 
	xmlns="http://www.w3.org/2001/XMLSchema"
	targetNamespace="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"
>

What was wrong with that you might be asking? Well for starters I had no idea what the attribute elementFormDefault did. I had been writing it over and over without knowing what it did just that it worked, this was just asking for trouble. So I did some research and more research. It finally all came together and I figured I would share my findings for everyone to enjoy.

elementFormDefault is an attribute of the schema tag. It is used, in a way, with the targetNamespace attribute of that schema tag. The targetNamespace attribute does what it sounds like it does, it specifies the location of the namespace being used. The two values for elementFormDefault attribute, qualifed and unqualified. Depending on the value of elementFormDefault, the namespace will either be used for all of the elements in the XML document or it will not. If the form is qualified, all of the elements in the XML document use the targetNamespace. If the form is unqualified the elements belong to no namespace unless they are explicitly specified. You would do this inline by adding the attribute form=”qualified” to the element you would like to be included in the namespace. An example of this would be:

<element name="assignment" type="stringo" minOccurs="0" maxOccurs="unbounded" form="qualified">

So to conclude, there aren’t too many scenarios where you would want to set elementFormDefault=”unqualified”, just know it is out there.

“The only source of knowledge is experience”
-Albert Einstein

When to use attributes and when to use elements in XML

So I was wondering today if there is a ruling on when to use attributes in XML. It seems to me it is much easier to parse out data if there are no attributes, but on the other hand I don’t want to put data that isn’t pertinent to the information being held. So my first option was this:

<location>
	<id>11</id>
	<name>Downtown</name>
	<address>225 Main Street Nowheresville</address>
</location>

My second option was this:

<location id="11">
	<name>Downtown</name>
	<address>225 Main Street Nowheresville</address>
</location>

I did some research and found that everyone and their mother was saying, data goes in elements, metadata in attributes. So I did some more research just to make sure I was understanding it completely. I found these two tips to be clearer.

  1. If the information is pertainant to the whole information as a whole, then it is best to make it an element.
  2. Conversely, if the information is used as a reference for other aspects of an application or web site, it is best to set it as an attribute.

Quite possibly the best tips I found were from the X12 Reference Model For XML Design.

  1. “Attributes are atomic and cannot be extended and its existence should serve to remove any and all possible ambiguity of the element it describes. They are ‘adjectives’ to the element ‘noun’.”
  2. “Use elements for data that will be produced or consumed by a business application, and attributes for metadata.”

Further reading: Principles of XML design: When to use elements versus attributes
Elements vs. attributes

MySQL notes

It has been some time since I took my intro to Relational Databases back in the spring of
07. So today I decided to do some research on MySQL and the database in general. So without further adieu, here are some helpful notes.

Data Types

  1. char vs varchar – If a string is inserted of 20 character in length is inserted into a char(30) field, it will take up 30 characters in the database. If that same string 20 characters in length is inserted into a varchar(30) field, it will take up just 20 characters of space.
  2. Int vs smallint – Smallint is used if your number will not exceed the range of -32768 – 32768. Int is used when the number falls between -2147483648 – 2147483648. It is also worthy to mention the other types of integers, tinyint, mediumint, and bigint

Normalization

Database normalization is quite possibly one of the blurriest areas for me. Going from teachers that profess a web site needs to be normalized to other teachers that say ‘if it works don’t fix it’ I have had to find a middle ground for now until I can make a decision and learn more about it. For most projects I do, I follow through to third normal form (3NF). This prevents from having redundancy, duplicate data, as well as prevents data anomalies from occurring. Data anomalies are incorrect/inconsistent data that is the result of updates or insertions that create “duplicate” records.

On another note, I just purchased a new MySQL book MySQL Crash Course by Ben Forta. It looks to be an excellent replacement for my current book. Just looking through the table of contents and reading a few pages, it covers way more than my previous book I purchased for my relational database class. So stay tuned for MySQL Notes ver.2.

Make all external links open in a new window/tab using jQuery

Alright, so I wanted to make a nice easy way to make all of the external links I post to open in a new tab. It just make sense to not facilitate users leaving your web site.

Step 1)
Target all of the anchor tags on the page when they are clicked.

Alright, so if that worked it means you did something right.

Step 2)
Get the url of the current page and split it down to the base URI for the page.

If that worked you will of course see your domain popup.

Step 3)
Check if the url is your own, if it isn’t add in the attribute of target=”_blank”

$(document).ready(function(){
	$('a').click(function(){
		var fullUrl = $(this).attr('href');
		var splitUrl = fullUrl.split("/");
		<strong>
		if((splitUrl[2] != 'www.levijackson.net') && (splitUrl[2] != 'levijackson.net')){
			$(this).attr("target","_blank");
		}
		</strong>
	});
});

If you aren’t familiar with the usage of $(this), it is shorthand for the element clicked on. To finish, obviously replace my domain with your own, and voilà.