Screen Scraping for RSS

I like to read online comics. Unfortunately some of them do not publish RSS feeds which is retarded. I ranted about this on Monday. But hey, if they don’t make one, I will do it for them.

I wrote a nice little perl script that screen scrapes a page for an image, and then generates an RSS feed. It requires WWW::Mechanize and XML::RSS modules that can be downloaded from CPAN or some other repository.

How does it work? You simply call it with:

perl grab.pl url pattern

Where url is the url of your web comic, and pattern is some string that is unique to the URL of the actual comic image. For example, extralife is easy because the front page image is always current.gif (you can use this as a pattern). DorkTower on the other uses variable image names, but all the pictures are stored in /comics/dorktower/images/comics/ directory. Furthermore, none of the advertisement, or background images are stored in a dir called comics – so I picked “comics” as a pattern.

Essentially, you have to look closely at the code of the page you are scraping once, and pick a good pattern attribute. The feed is created in the same directory as the script. To generate the file name I drop the http:// part from the url, remove all the slashes and append .xml at the end. I could add another optional attribute to specify the feed name, but I don’t really care about it. Feel free to do it yourself.

Just a side note, if you plan running this on windows with ActiveState perl and you use ppm for your module management make sure you get WWW::Mechanize 1.4 or higher. The 0.72 package that can be downloaded from the ActiveState repository does not support the find_image function I’m using.

You might want to add http://theoryx5.uwinnipeg.ca/ppms/ to the ppm repository list. You can download a more recent version from there.

Advertisements

One Response to “Screen Scraping for RSS”

  1. Shahriar Hyder Says:

    Here is a post regarding techniques for ‘Scraping your way to RSS feeds’ albeit in a non-programmatic (layman) way:

    http://technosiastic.wordpress.com/2009/04/08/scraping-your-way-to-rss-feeds/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: