Perl O-O Harvester


Description

This is a set of object-oriented Perl modules to perform harvesting of an OAI-1.1-compliant Data Provider. The primary module "Harvester" can easily be subclassed to support the data processing needs of a Service Provider.

Features of this software include:

  • simple to install
  • few files to keep track of
  • supports multiple archives with one Harvester
  • supports overlapped harvesting of different sites while avoiding hitting the same site twice
  • seamless handling of resumptionTokens and retry-after's
  • able to generate output only on finding new data (by default)
  • configurable intervals between harvesting, between checking for need to harvest and between resumptions
  • on errors, will not continue hitting a problematic site
  • can harvest for maximum consistency or minimum traffic
  • all modules have "man" pages
  • sample descendent class (TestHarvest) included
  • sample code to harvest included (harvest.pl)


Installation

1. Copy or unzip/untar all files into a directory.


2. Install the following pre-requisite Perl modules as 'root' or 
   into a local Perl module directory.
   
      XML::DOM
      Cwd
      LWP
   
   If you can install packages as root, run 
     
     perl -e "use CPAN; shell"
     then "install " for each package
     
   (this is highly recommended since there are lots of module
   dependencies for XML::DOM)     

     
   If you cannot install packages as root, select a local directory,
   download each package from CPAN (www.cpan.org) and follow the
   installation instructions. Specifically, for each module (and their
   dependencies) you may need to do something like
   
     perl Makefile.PL PREFIX=/home//perlpackages
     make      
     make test  
     make install
     
   see the following sites for more information on how to do this
     http://www.singlesheaven.com/stas/TULARC/webmaster/myfaq.html#7
     http://www.iserver.com/support/virtual/perl/mod/install.html

     
3. Edit the configuration file (harvest.pl) to include all the archives 
   you want to harvest from and their ids/metadata formats/sets/etc..
   

4. If necessary, run "make" to compile the man pages. These will not
   be installed in the regular locations so to access them you may need
   to specify something like
     man ./Harvester.3
     

5. Test the harvester by running harvest.pl


6. Write your own Perl modules that subclass Harvester (using TestHarvest
   as a sample) to perform whatever you need done.

   
7. Add a line to "chdir /" at the top of 
   harvest.pl so that the script is always run in the correct context.
   
   Install a line in your crontab file to run the harvester as often as
   you want it to check for the need for more harvesting (the schedule
   in the configuration file determines whether or not harvesting is
   needed). For example, if you want to check every 10 minutes:
     */10 * * * * //harvest.pl


Download


Contact

Contact hussein@vt.edu if you have any queries.


Back to VTOAI Home Page     Back to DLRL Home Page     Back to Open Archives Home Page
Last updated : 14 August 2001