Parsing the IMDB movie list

I’ve been working on a personal project to help index the movies I own so it’s easier to search and pick a movie when we feel like watching one. The site will allow you to add movies by name, and it will auto populate the release date, actors, plot, and any other pertinent info.

What I found after some research is that there are not many options for searching the IMDB movie database, most certainly nothing official.  After a few “close but no cigar” moments with some APIs others had built I was determined to make use of what IMDB does provide.  IMDB provides text files with movie, actor, actress, etc.. info. You might be thinking, “Levi that is awesome, why didn’t you start with those?!”. Well if you take a peak at them you will quickly find out the trouble with working with them.

  1. The do not have IDs to tie things together, they rely on the titles. So that means you must rely on the movie title to match all of the bits of info together. That isn’t such a problem until you realize there is also a “movie AKA” file. So each movie can be known by more than one name.
  2. The data can be hard to figure out sometimes. For examples:
    • “2010-11 Regular Season” (2007) {Hurricanes vs. Red Wings: November 21, 2013 (#2013.17)} 2013 – what this means is that the title is from 2010-11, it was released in 2007, and it is from season 2013 episode 17. Confusing, right?
    • Inherit the Earth (????) 2004 – release date is normally in the parenthesis. In this example it is not in the parentheses but it is at the end of the line.


After a lot of work this is the regular expression I ended up using to match as many titles as I could. I’m sure i’ll still need to tweak it over time to get it 100% working.


It’ll probably be easiest to understand this using an example movie title.

“Blaulicht” (1959) {Das Gitter (#4.1)}1962

([\s\S]*) – this will match any whitespace and non-whitespace character
\( – this will stop that first matching when it runs into an opening parenthesis, normally signifying the start of a release date.
([\d{4}]*|\?*) – this matches the year
(?:\/)? – sometimes the year was entered as (2004/I) or (2005/IV).  This allows an optional forward slash after the year.
([\w]*)? – Along with the previous item,  this accommodates finding characters after the year. Still not sure exactly what that means though.
\) – this just signified the end of the release date.
(\s*{([\w!\s:;\/\.\-\'”?`_&@$%^*<>~+=\|\,\(\)]*) – this looks way more complex than it is (it goes along with the next bit, so look at those as a whole as opposed to separate chunks). It looks for a space after the year and then it matches the content inside the curly brace until it hits a pound sign signaling an episode.
(\s*\(#([\d]*)\.([\d]*)\))?})? – This matches the episode name, season, and episode number if it runs into #.
\s* – just allows for as many spaces as needed between this and anything after.
([\d{4}]*)? – these last 3 go hand in hand. They will match an optional year range. Some of the shows have year ranges for how long it has been running.


The last important bit is the modifiers on the preg_match. “i” ensures a case insensitive match and “u” makes it do a utf-8 search. Otherwise it fails on some foreign characters not in ISO-8859-1.

json_encode setting a value to null?

I was recently working on generating a JSON object using data from a database and ran into the problem that my “description” field was showing as null after I ran it through json_encode(). I checked the array prior to running it into json_encode() and it showed the description like it should. After pondering, testing, and mentally throwing my computer out the window… I wondered if the problems might be that there was a character json_encode() couldn’t handle. I didn’t spot anything when I was looking at the value in the database, so I turned to Google. I read some posts suggesting that it may be caused because of an encoding issues (the text was supposed to be encoded as UTF-8, but had been inserted with some invalid characters). The docs for json_encode say that the string being encoded must be UTF-8 to work.

If you need to clean a string before doing json_encode, this will ensure only valid UTF-8 characters are used:

iconv('UTF-8', 'UTF-8//IGNORE', $string);

CKEditor 3.x is stripping out font awesome tags

Took a little bit of digging because a lot of the solutions were related to different versions of CKEditor. I finally came across a solution on stackoverflow.

Add this to your CKEditor config.js file and the font awesome tags will stick around.


What the answer doesn’t say is why this works though. If you look at the documentation it says that what you push into protectedSource is “List of regular expressions to be executed on input HTML, indicating HTML source code that when matched, must not be available in the WYSIWYG mode for editing.” So this keeps the <i> tags intact because source mode of CKEditor does not get HTML validation, whereas the WYSIWYG mode does.

Index Hinting

If you do not specify any indexes for a table, MySQL needs to search every row in the table in order to return the results. This is known a full table scan. It is extremely undesirable for tables with a large row count such as products, users, or session information as it can take a long time to pull the data you need.


I’m still learning a bit about how indexes work, but my high level understanding of them is that they create “sub-tables” (that’s my own way to think about them). In these sub-tables it has a column for the column being indexed that is ordered low to high and then a second column that contains a reference the main table rows. For instance:

ID First Last Age
1 Levi Jackson 25
2 Lindsay Webster 26
3 Ellie Webster 1
CREATE INDEX person_age ON People (Age);

The index would look like this in my head:

Age ID
1 3
25 1
26 2


So by creating indexes you make it easier to find results by age because it can do something like a binary search to find the row quicker. Normally the query optimizer will pick out the best index to use, but sometimes you may find yourself needing to pick the index you want the query to use, or making sure it DOES NOT use a certain index.

Use Index

Tell MySQL to use a particular index. The optimizer can still choose to do a full table scan if it deems that to be quicker than using the index.

SELECT * FROM people USE INDEX(person_age) WHERE ID > 3 AND age > 20;

This will end up doing a full table scan instead of using the person_age index as it is quicker.

Force Index

Tell MySQL to use a particular index, but to only fall back on doing a full table scan if it absolutely cannot use the index to return results.

SELECT * FROM people FORCE INDEX(person_age) WHERE ID > 3 AND age > 20;

Without the FORCE INDEX, it would use the primary key to find the results.

Ignore Index

Tell MySQL not to use a particular index.

SELECT * FROM people IGNORE INDEX(person_age) WHERE ID > 3 AND age > 20;

This will use just the primary key to find the results. Without the IGNORE INDEX it would have used the person_age index to narrow the result set down a bit.


Something I found to be interesting is that when comparing USE INDEX and FORCE INDEX, that the plain ol’ queries would have used the primary key by default, but that by using USE INDEX it was casting aside the possibility of using the primary key and instead was choosing between the key you specified and a full table scan. So using a USE INDEX may have quite the negative repercussions should the query optimizer find it is not an efficient key to use.

Tab completion TextExt

I was working with TextExt for the first time and wanted to do some tab completion for efficiency. I was surprised to find it wasn’t a default option (or if it is I couldn’t find it!) This is how I did it though.

$('#tags').on('keydown', function(e) { 
   var keyCode = e.keyCode || e.which; 

   if (keyCode == 9) { 
   // the prevent default allow you to stay in focus in the input and keep adding tags in
   $('#tags').textext()[0].tags().addTags([$('.text-selected span').text()]);

MySQL Joins Rehashing

I was reading this post from Jeff Atwood about visually showing the different types of SQL joins. It got me thinking about the different MySQL joins and I wanted to learn more about them. So this post is my rehashing of the different type of joins. Enjoy!

I’ll be using tables like the following for my explanations:


ID Name
1 Levi
2 Lindsay
3 Ellie


ID Location
1 Rhode Island
2 Vermont
3 Maine
4 New Hampshire

Inner Join

Most people have heard of an inner join before. In fact, i’d go as far as saying it’s the most commonly used join. This joins together all the rows specified into one big result set.

// This will return all of the columns from both tables
SELECT * FROM tableA INNER JOIN tableB ON tableA.ID = tableB.productID;

// This will return just the 3 fields specified
SELECT a.ID, a.Name, b.ProductID FROM tableA AS a INNER JOIN tableB AS b ON a.ID = b.productID;

Cross Join

A cross join will take two tables and mesh them together in every possible combination.
The output of this query would be:

ID Name ID Location
1 Levi 1 Rhode Island
1 Lindsay 1 Rhode Island
1 Ellie 1 Rhode Island
1 Levi 2 Vermont
1 Lindsay 2 Vermont
1 Ellie 2 Vermont
1 Levi 3 Maine
1 Lindsay 3 Maine
1 Ellie 3 Maine
1 Levi 4 New Hampshire
1 Lindsay 4 New Hampshire
1 Ellie 4 New Hampshire

On the surface the Cross Join doesn’t seem very useful. I had to Google around quite a bit to find use cases for it. As best I can tell it is useful when you have a standard set of values in table 2 that pertain to all of the values from table 1, such as color options or times.

Straight Join

Normally the query optimizer will figure out the quickest path to join tables together. This may end up being that it parses the last table joined and working back to the first table. Straight joins are just a way to force MySQL to read the tables left to right.

SELECT People.Name, Places.Location FROM People STRAIGHT JOIN Places ON People.ID = Places.ID;

Left Join (or Left Outer Join)

Returns all of the rows from the left table along with the matching rows from the right table. If there is no match it will return null for the column.

SELECT People.Name, Places.Location FROM People LEFT JOIN Places ON People.ID = Places.ID;
Name Location
Levi Rhode Island
Lindsay Vermont
Ellie Maine

Right Join (or Right Outer Join)

The opposite of the Left Join, it returns all of the rows from the right table along with the matching rows from the left table.

SELECT People.Name, Places.Location FROM People RIGHT JOIN Places ON People.ID = Places.ID;
Name Location
Levi Rhode Island
Lindsay Vermont
Ellie Maine
null New Hampshire

Natural Join

A natural join is when MySQL looks for columns with the same name to join on. For instance:


ID State Abbreviation
1 Rhode Island RI
2 Vermont VT
3 Maine ME
4 New Hampshire NH


ID Town
1 Warwick Rhode Island
2 Burlington Vermont
3 Berwick Maine
4 Laconia New Hampshire
ID State
1 Rhode Island Warwick RI
2 Vermont Burlington VT
3 Maine Berwick ME
4 New Hampshire Laconia NH

Natural Left/Right Join

Same as a normal left/right join, only it used the magic of the natural join to join on columns with the same name.


The most useful in my opinion seem to be inner, left, and right joining. Inner gives you combined results excluding null values, and left and right allow you to search for rows that are null in certain spots easily.

Natural joining is a bit too open for my liking. I prefer to explicitly list the columns to join on.

While I understand cross joining, I don’t think it will be useful in my day to day work at this point. It seems like it has a pretty specific use, and that use isn’t very often.

Brainstorming Interfaces

I’ve been reading a bit about interfaces, and what it takes to come up with the best one possible. There is a lot of information to wade through but so far one of the best suggestions i’ve seen was a slightly different procedure to brainstorming than i’m used to.


A good place to start planning an interface is by sitting down with your colleagues or group members to come up with different ideas. In theory this will allow multiple ideas to be presented and tweaked so that you end up with one great idea. However, once an idea takes hold within a group it can be tough to sway yourself and others that there are other good/potential solutions. So an inadvertent side effect is that one idea takes over when there are other potential good ideas that go unshared.

One way to circumvent the “group think” that results from brainstorming is to add a step to the process before coming together to share ideas. This new step has everyone in the group create rough sketches/ideas for the interface before coming together. This makes sure that everyone in the group contributes ideas and it will usually result in ideas being shared that otherwise would not have been.

After each idea has been presented the group members critique and pick out the good parts of the idea like normal.

Crashplan Pro won’t start

I had a strange issue where CrashPlan Pro wouldn’t start up. I kept getting the error “Cannot connect to backup engine”. After reviewing some help docs from CrashPlan it was still not working. Every time I tried to start the CrashPlanService it would start and immediately stop. The log files weren’t much help to me, they were just letting me know what I already knew, it had started and stopped for some strange reason.

A lot of the results that come up when you Google “CrashPlan Pro cannot connect to backup engine” show solutions where users were able to increase their memory from 512M to 1024M and it would resolve it. For me, when I went to C:\Program Files\CrashPlan\CrashPlanService.ini and checked what was there, it already had 1024M, so I figured maybe it wanted more. So I set it to 2048M to see if that would work. Nope.

After fruitlessly trying other things such as updating Java, turning it off and then on again, saying pretty please, I finally caved and sent their support team a message about it along with the log files from my attempts to start the service. They replied back within  a few hours that I should try lowering the memory used from 1024M to 512M. So I gave that a shot, and to my surprise it started up perfectly! My guess is that since the computer has some other memory intensive applications running, it was trying to use more memory than was available at the time.


So the final solution was to open up the file C:\Program Files\CrashPlan\CrashPlanService.ini and to update -Xmx1024m to -Xmx512m

“not valid UTF-8” in your phabricator diff?

I have gotten the ” This diff includes files which are not valid UTF-8 (they contain invalid byte sequences). You can either stop this workflow and fix these files, or continue. If you continue, these files will be marked as binary.” quite a few times lately. I tried various things like using SublimeText to save as UTF-8, and nothing seems to work. I was convinced there was an invalid character being appended somewhere in the file on save. I stumbled across this post about stripping invalid UTF-8 characters.

Following that suggestion I tried the following:

iconv -c -f UTF8 -t UTF8 /Users/Levi/Desktop/file.php > /Users/Levi/Desktop/file.php

It kept spitting out an empty file though for some reason. After a bit of head scratching I realized it may be writing to the file it was reading to it, which would mean that it was creating a new blank file.php, reading from file.php and encoding, and saving file.php. So it was doing this on a blank file! So what you need to do is save it to a new file name.

iconv -c -f UTF8 -t UTF8 /Users/Levi/Desktop/file.php > /Users/Levi/Desktop/file-new.php

After all that you may be thinking that fixed this, right? I was still getting the same error message so I decided to do some debugging and try and isolate the line it had an issue with in arcanist. After var_dump’ing down the path the diff takes in libphutil and arcanist I found out the reason I was continually getting the error was because the string it was sending to phabricator was the git diff of the file. So even though I removed the invalid character in the new revision, the old revision still had the invalid character.

So how do you get around this?

The only way I could get around this issue was to:
– edit the original file, remove the invalid character
– push that live
– merge that branch in with my dev branch
– arc diff

So after all the debugging the issue was not with the current file, but with the original file I was editing… which was to say the least a little frustrating to find out. So if the original file has an invalid character, your best bet is to edit the original to remove that character and push that live before diff’ing your branch.

Invalid UTF-8 Sitemap contains CO or C1 control codes

I came across “Invalid UTF-8 Sitemap contains CO or C1 control codes” in WMT tools the other day. The sitemaps had recenlty been changed from a normal to (the compressed version).

I looked through the uncompressed version of the sitemap and didn’t see anything out of the ordinary characters-wise. When I ran it through some UTF-8 checks it came back as valid, so I was at a loss as to what was causing the error.

What do you do when you need some answers? Google it! The only post I found with that error message had 0 replies and it wasn’t posted by someone having a PHP problem. After broadening my search a little bit I came across this useful stackoverflow post that alluded it was because fopen/fclose were appending  characters that were not being encoded. Switching to gzopen, gzwrite, and gzclose worked perfect for me. gzwrite is binary-safe, so best guess would be that fwrite was including a PHP_EOL in the file that was resulting in the error.

$xml = '<!--?xml version="1.0" encoding="UTF-8"?-->
$gz = gzopen('sitemap.xml.gz','w9');
gzwrite($gz, $xml);