Kunal Anand's XML Exam, Question 4 · 899 words posted 04/21/2006 06:40 PM

Earlier this month Kunal Anand posted Some XML Exam Questions designed as “fun and practical bedtime exercises.” Hey buddy, bedtime’s all about The Daily Show with Jon Stewart, but afternoons, ah… afternoons were made for XML.

Here’s question #4:

Scrape a dynamic list from a web site (i.e. the Google Zeitgeist) and serialize a well-formed Atom feed.

The only other requirement: you can only implement the solution using Perl, Python, or Ruby. While I’m learning Ruby as part of picking up Rails, I have to admit I’ve never coded in Perl other than tweaking the occasional MovableType script, so a new (to me) language seemed like a fun way to solve the problem.

A lot of you out there will roll your eyes because this is so obvious, but Perl rocks! Trust me: if you earn your living coding PHP, or ColdFusion, or ActionScript, or C#, spend an afternoon with Perl. All of the problems are already solved. There’s a Perl library for everything under the sun.

The short answer to Kunal’s question: Simon Cozens has already solved the problem for us. See Painless RSS with Template::Extract, Hack #24 in O’Reilly’s excellent Spidering Hacks. But the HTML Simon extracts is simpler than the one I wanted to extract, and he generates RSS instead of Atom.

So here, step by step, is one way to scrape content from a page and atomize it. My solution is based on Simon’s code, tweaking and building on it when necessary. If you’re new to Perl, download the code and follow along. If you’re an old Perl hand, this might be boring. Skip it and read about converting illuminated Persian manuscripts into Flash applications instead.

See lines 34-46 for the template I came up with to match the popular searches widget. The key: once we’ve found the first ordered list beginning inside the mostSearched div (lines 36 and 37), we loop through the contents of the list and populate an array called records. The contents of the list are simple: we only need to extract a url, which will allow us to click a link and conduct the chosen search on the NYT site, and the query, or search string.

my $data = $x->extract($template, $page);

The hard work is done; all that’s left is to loop through the $data and output it as an Atom feed (Simon’s original tutorial used RSS but the logic is the same).

To run the script yourself, install the required modules, open your terminal and type:

perl atomize_nyt.pl

And that’s it.

Footnote: I’ll parse Kunal’s requirements like any good J.D. should; he said the feed had to be well-formed but he didn’t say it had to be valid. I typically use RSS and not Atom, but after I wrote my script I discovered that the library I used, XML::Atom::SimpleFeed produces an older flavor of Atom which, while well-formed XM, is no longer favored. For a valid feed, use the XML::Atom module instead.

* * *


1. On Apr 24, 09:34 AM jim collins said:

“The only other requirement: you can only implement the solution using Perl, Python, or Ruby.” Why this arbitrary choice? I have an open-source ColdFusion tool, Crouton, thats designed to do exactly this.
There are plenty of Java libraries too and you dont have to deal with hard-to-read Perl code. My feelings about Perl are best expressed here:
http://www.underlevel.net/jordan/erik-perl.txt

#

2. On Apr 24, 10:24 AM since1968 said:

Why this arbitrary choice?

You’d have to ask Kunal. I took it as a fun challenge.

#