DOM + XPath = Catalogue Screen Scraping Goodness
There is something very nice about liberating your data. For instance take predictable html structure, and xpath, and you get a nice way to screen scrape. I've been working on creating a mobile version of an OPAC for a little while and I came across some premade solutions from other Libraries but each one seemed to be missing a few pieces of information that I want in my mobile display. So I got the great idea to screen scrape and evaluate the results with XPath.
Turns out things were easier then I imagined. (This is working with a Millennium OPAC) I needed the Syndetics cover image and the links to connect the actual resource. Here is the simple version of the code in the working man's programming language PHP:
<?php
$url ="http://catalogue.library.brocku.ca/search~S0?/Yjava+a+beginners&SORT=D/Yjava+a+beginners&SORT=D&SUBKEY=java%20a%20beginners/1%2C4%2C4%2CB/frameset&FF=Yjava+a+beginners&SORT=D&1%2C1%2C";
$data= file_get_contents($url);
$dom = new DOMDocument();
@$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//img");
for ($i = 0; $i < $hrefs->length; $i++)
{
$href = $hrefs->item($i);
$url = $href->getAttribute('src');
if(strstr($url,"http://syndetics.com"))
echo '<img src="'.$url.'\">';
}
$hrefs = $xpath->evaluate("/html/body//div[@id='main']//table[@class='bibLinks']//tr//a");
for ($i = 0; $i < $hrefs->length; $i++)
{
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo '<div align="center"><a href="'.$url.'">View Item</a><br></div>';
}
?>
So the magic comes in with the catalogue item. If you load a page and view source you'll see that the tables and other divs all have names on them. This allows the CSS rules to be applied to objects but it also adds extra granularity to the page that helps with the scraping:
/html/body//img - This is the simpler XPath expression. Essential it will go parse the DOM'ed webpage and grab anything that is an image tag. Then you iterate over the results until there is one that has a domain name of Syndetics. Echo the results if available.
/html/body//div[@id='main']//table[@class='bibLinks']//tr//a - This is the more complex XPath statement. In this case I am trying to find all the links that are in a table that has class name 'bibLink'. The great news is that every book type item in our catalogue fills this requirement. If something is there echo it.
It might take a bit of investigation to find out what CSS classes you can use to XPath with your catalogue but it shouldn't be that difficult a task. Thinking about it now... What else could you cleverly scrape out of a catalogue record? Status, Call Number, any of the MARC fields? There must be some good use of being able to grab that information.
Resources
XPather - A great Firefox plugin that helps a great deal with creating XPath statements
XPath @ WC3 Schools - The 10 minute read through that will set you on your way.
