Kunal Anand's XML Exam, Question 4 · 899 words posted 04/21/2006 06:40 PM
Earlier this month Kunal Anand posted Some XML Exam Questions designed as “fun and practical bedtime exercises.” Hey buddy, bedtime’s all about The Daily Show with Jon Stewart, but afternoons, ah… afternoons were made for XML.
Here’s question #4:
Scrape a dynamic list from a web site (i.e. the Google Zeitgeist) and serialize a well-formed Atom feed.
The only other requirement: you can only implement the solution using Perl, Python, or Ruby. While I’m learning Ruby as part of picking up Rails, I have to admit I’ve never coded in Perl other than tweaking the occasional MovableType script, so a new (to me) language seemed like a fun way to solve the problem.
A lot of you out there will roll your eyes because this is so obvious, but Perl rocks! Trust me: if you earn your living coding PHP, or ColdFusion, or ActionScript, or C#, spend an afternoon with Perl. All of the problems are already solved. There’s a Perl library for everything under the sun.
The short answer to Kunal’s question: Simon Cozens has already solved the problem for us. See Painless RSS with Template::Extract, Hack #24 in O’Reilly’s excellent Spidering Hacks. But the HTML Simon extracts is simpler than the one I wanted to extract, and he generates RSS instead of Atom.
So here, step by step, is one way to scrape content from a page and atomize it. My solution is based on Simon’s code, tweaking and building on it when necessary. If you’re new to Perl, download the code and follow along. If you’re an old Perl hand, this might be boring. Skip it and read about converting illuminated Persian manuscripts into Flash applications instead.

- First step: find some content to scrape. I picked the New York Times list of Most Popular Searches on their home page because the Times’ HTML is clean, and as far as I know their popular searches are not available as an RSS feed.
- View the page source. You’ll see that there’s an HTML comment
MOST POPULAR MODULE STARTSand adivwith the idtabsContainer. Within thetabContainerdiv there’s another nested div calledmostSearched, which in turn holds an unordered list of search terms. That’s what we want to extract. Now let’s turn to the Perl code. - Lines 15-19 import Perl modules, or chunks of code that do a lot of the heavy lifting for us. If a module isn’t obvious I’ll address it in the steps below. If you’re learning Perl and want to follow along, you’ll need to install these modules. This google link can point you in the right direction.
- At lines 21-22, we grab the entire home page of the New York Times, or throw an error message if for some reason we can’t reach it. We then clean up the code so Perl doesn’t choke on it later. The regular expressions on lines 24-26 are lifted directly from Simon’s script.
- Next, we have to figure out how to extract the list of popular search terms from the HTML and discard the rest. Fortunately, the Perl module Template::Extract will do exactly that. Think of Template::Extract as a simplified regular expression library for HTML. Simon’s tutorial walks you through the steps of matching a pattern using a template, but our HTML is slightly more complex: where Simon was able to find a page with a single looping set of elements, we have some different challenges:
- We can’t just grab every unordered list element on the page; there may be dozens, and most won’t be popular searches.
- Even after we’ve found the right unordered list, we have to match a pattern for each element within the list.
See lines 34-46 for the template I came up with to match the popular searches widget. The key: once we’ve found the first ordered list beginning inside the mostSearched div (lines 36 and 37), we loop through the contents of the list and populate an array called records. The contents of the list are simple: we only need to extract a url, which will allow us to click a link and conduct the chosen search on the NYT site, and the query, or search string.
- But we don’t have the data yet: once we’ve created a template and grabbed a web page, we have to apply the template to the page (line 49):
my $data = $x->extract($template, $page);
The hard work is done; all that’s left is to loop through the $data and output it as an Atom feed (Simon’s original tutorial used RSS but the logic is the same).
- Lines 59-76 create an Atom feed and populate the feed with the query and url data extracted from the NYT home page. Line 76 outputs a well-formed feed to the terminal window; you could just as easily save it to a file.
To run the script yourself, install the required modules, open your terminal and type:
perl atomize_nyt.pl
And that’s it.
Footnote: I’ll parse Kunal’s requirements like any good J.D. should; he said the feed had to be well-formed but he didn’t say it had to be valid. I typically use RSS and not Atom, but after I wrote my script I discovered that the library I used, XML::Atom::SimpleFeed produces an older flavor of Atom which, while well-formed XM, is no longer favored. For a valid feed, use the XML::Atom module instead.
* * *


1. On Apr 24, 09:34 AM jim collins said:
“The only other requirement: you can only implement the solution using Perl, Python, or Ruby.” Why this arbitrary choice? I have an open-source ColdFusion tool, Crouton, thats designed to do exactly this.
#There are plenty of Java libraries too and you dont have to deal with hard-to-read Perl code. My feelings about Perl are best expressed here:
http://www.underlevel.net/jordan/erik-perl.txt