Parsing the IMDB movie list

I’ve been working on a personal project to help index the movies I own so it’s easier to search and pick a movie when we feel like watching one. The site will allow you to add movies by name, and it will auto populate the release date, actors, plot, and any other pertinent info.

What I found after some research is that there are not many options for searching the IMDB movie database, most certainly nothing official.  After a few “close but no cigar” moments with some APIs others had built I was determined to make use of what IMDB does provide.  IMDB provides text files with movie, actor, actress, etc.. info. You might be thinking, “Levi that is awesome, why didn’t you start with those?!”. Well if you take a peak at them you will quickly find out the trouble with working with them.

  1. The do not have IDs to tie things together, they rely on the titles. So that means you must rely on the movie title to match all of the bits of info together. That isn’t such a problem until you realize there is also a “movie AKA” file. So each movie can be known by more than one name.
  2. The data can be hard to figure out sometimes. For examples:
    • “2010-11 Regular Season” (2007) {Hurricanes vs. Red Wings: November 21, 2013 (#2013.17)} 2013 – what this means is that the title is from 2010-11, it was released in 2007, and it is from season 2013 episode 17. Confusing, right?
    • Inherit the Earth (????) 2004 – release date is normally in the parenthesis. In this example it is not in the parentheses but it is at the end of the line.

 

After a lot of work this is the regular expression I ended up using to match as many titles as I could. I’m sure i’ll still need to tweak it over time to get it 100% working.

/^([\s\S]*)\(([\d{4}]*|\?*)(?:\/)?([\w]*)?\)(\s*{([\w!\s:;\/\.\-\'"?`_&@$%^*<>~+=\|\,\(\)]*)(\s*\(#([\d]*)\.([\d]*)\))?})?\s*([\d{4}]*)?(?:-)?([\d{4}]*)?/iu

It’ll probably be easiest to understand this using an example movie title.

“Blaulicht” (1959) {Das Gitter (#4.1)}1962

([\s\S]*) – this will match any whitespace and non-whitespace character
\( – this will stop that first matching when it runs into an opening parenthesis, normally signifying the start of a release date.
([\d{4}]*|\?*) – this matches the year
(?:\/)? – sometimes the year was entered as (2004/I) or (2005/IV).  This allows an optional forward slash after the year.
([\w]*)? – Along with the previous item,  this accommodates finding characters after the year. Still not sure exactly what that means though.
\) – this just signified the end of the release date.
(\s*{([\w!\s:;\/\.\-\'”?`_&@$%^*<>~+=\|\,\(\)]*) – this looks way more complex than it is (it goes along with the next bit, so look at those as a whole as opposed to separate chunks). It looks for a space after the year and then it matches the content inside the curly brace until it hits a pound sign signaling an episode.
(\s*\(#([\d]*)\.([\d]*)\))?})? – This matches the episode name, season, and episode number if it runs into #.
\s* – just allows for as many spaces as needed between this and anything after.
([\d{4}]*)? – these last 3 go hand in hand. They will match an optional year range. Some of the shows have year ranges for how long it has been running.
(?:-)?
([\d{4}]*)?

 

The last important bit is the modifiers on the preg_match. “i” ensures a case insensitive match and “u” makes it do a utf-8 search. Otherwise it fails on some foreign characters not in ISO-8859-1.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>