Saturday, October 30, 2010

FREE! PHP IMDb Scraper/API for new IMDb Template


IMDb is undoubtedly the leading information source for media information and is the top target of web scraping for movie lovers around the world. Unfortunately IMDb does not provide an API to access its database so web scraping is the only resort for us. PHP being one of the most commonly used and powerful web development language enables easy web scraping with the power of PCRE (Perl Compatible Regular Expressions).

For my recent project on a Movie Catalog (http://movies.abhinayrathore.com), I needed a  IMDb scraper and found one built by Tyler Hall. His version was not robust enough to scrap all kind of movie pages so I extended it and made it more robust to support different type of titles, BUT recently IMDb changed its page template and most of the old scrapers stopped working including mine. So, I modified my scraper to accommodate the new template changes and considered it as my moral responsibility to contribute back to the developer community.

This new scraper is very robust and capable enough to handle a wide variety of new template modifications. Apart from the regular information it even goes deep to scan extra media images and release dates.

Click here for a Demo

Last Updated: Feb 1, 2014

Major changes in Feb 20, 2013 version:

  1. Now we use the combined information page to scrape the data. This page doesn't change quite often and we can get complete list of individual departments.
  2. Add a few more entities; producers, musicians, cinematographers, editors etc. Removed metascore information. Removed small poster url.
  3. You can now pass a second boolean parameter to the getMovieInfo() and getMovieInfoById() functions to disable the extra information. By default it is set to true and may slow down the scraping. If you don't need all the extra info like Storyline, Release Dates, Recommendations or Media Images, just pass false as second parameter to these methods. Example $movieArray = $imdb->getMovieInfo("The Godfather", false);.
  4. Information for individuals in the list of directors, cast, writers etc. is now in an associative array with key being the IMDb id of the individual.

UPDATE:
As some of you might have noticed, Google is preventing automated script access to its search result pages. I have created 2 search functions for Google and Bing so you can use whichever one works best for you. I have converted the code to use Bing as of now and will look for other alternatives if we run into some hurdles. Keep me updated if you have any better ideas :)

Here is a list of all the attributes it scraps from the IMDb page:

  1. TITLE_ID
  2. TITLE
  3. YEAR
  4. RATING
  5. GENRES
  6. STARS
  7. DIRECTORS
  8. WRITERS
  9. CAST
  10. PRODUCERS
  11. MUSICIANS
  12. CINEMATOGRAPHERS
  13. EDITORS
  14. ALSO_KNOWN_AS
  15. RELEASE_DATE
  16. RELEASE_DATES
  17. PLOT
  18. POSTER
  19. POSTER_LARGE
  20. RUNTIME
  21. TOP_250
  22. OSCARS
  23. AWARDS
  24. NOMINATIONS
  25. STORYLINE
  26. TAGLINE
  27. MEDIA_IMAGES
  28. MPAA_RATING
  29. VOTES
  30. RECOMMENDED_TITLES
  31. VIDEOS

How to use this PHP Scraper?
Include the class file on your php page
include("imdb.php");
Instantiate the class and get the results in an array:
$imdb = new Imdb();
$movieArray = $imdb->getMovieInfo("The Godfather");

You can try this scraper on my lab page: http://lab.abhinayrathore.com/imdb/

To download the PHP Source Code directly use this link: http://lab.abhinayrathore.com/imdb/imdb_php.htm

Fork it on GitHub: https://github.com/abhinayrathore/PHP-IMDb-Scraper

Example usage: http://lab.abhinayrathore.com/imdb/usage.htm

Proxy script for downloading or displaying Media images on your website: http://lab.abhinayrathore.com/imdb/imdbImage.txt

To implement you own IMDb Web Service API to return data in XML, JSON or JSONP format, use this script along with the API: http://lab.abhinayrathore.com/imdb/imdbWebService.htm

To implement IMDb.com's search suggestions on your website, please follow this post: http://web3o.blogspot.com/2011/10/imdb-search-suggestions-with-jquery.html

If you find any part of this scraper broken or incorrect, please drop a comment here and I’ll try to fix it as soon as possible.

IMDb has a leechers policy in place for media images. You may not be able to use the URL for some of the images to display on your website. As a workaround you can use a PHP Proxy to display or download those images. I’ve written a small proxy script to grab the images: http://lab.abhinayrathore.com/imdb/imdbImage.txt. To use this script you just need to pass the image URL as a request parameter:
<img src="imdbImage.php?url=<?=$url?>" />

NOTE: For users outside of USA
IMDb will automatically redirect you to titles listed in the language used for release in your country (Read more).
To see films listed under their original titles regardless of your country region you will have to modify this script to scrap the titles from http://akas.imdb.com because http://www.imdb.com will automatically redirect you to your country specific title page.

Happy Scraping :)