Scraping the Internet Archive for Dspace
Recently my Library starting a Digital Repository based on the DSpace platform. The first large set of data to be added to it was our collection of graduate dissertations. The first obstacle to adding this content to DSpace was digitization. There was approximately 1800 theses that needed to be scanned into PDF with OCR information. A portion of the digitization was outsourced to the Internet Archive. The Internet Archive is a very interesting initiative. In short they will take any piece of information and create a persistent digital representation of it and archive it for you. The archive itself already has a huge selection of material and makes for interesting browsing. The second challenge was to get the digitized archive information and get it to DSpace. Basically the archive will create a nice page online with your objects and all associated metadata. In my case I was interested in the Dublin Core meta-data (DSpace's native metadata format) and PDF's of the theses.
To solve these two challenges I created two helper apps that would essentially screen scrape the Internet Archive and then prepare the information for DSpace to import it. These two apps have now been hosted on Google Code and hopefully might help others with similar projects.
IA_Scraper - This utility will monitor an Internet Archive RSS feed and download all the items that are added to it. It is also capable of bulk downloading items that have been posted on the Archive. (more info)
DS_Ingestor - This second utility will process the downloaded information and get it ready for the import process for Dspace. (more info)
I'm currently working on a draft article that describes the background of the project and how the software works.
