de.pangaea.metadataportal.harvester
Class OAIHarvesterBase

java.lang.Object
  extended by de.pangaea.metadataportal.harvester.Harvester
      extended by de.pangaea.metadataportal.harvester.OAIHarvesterBase
Direct Known Subclasses:
OAIHarvester, OAIStaticRepositoryHarvester

public abstract class OAIHarvesterBase
extends Harvester

Abstract base class for OAI harvesting support in panFMP. Use one of the subclasses for harvesting OAI-PMH or OAI Static Repositories.

This harvester supports the following additional harvester properties:

Author:
Uwe Schindler

Field Summary
static int DEFAULT_RETRY_COUNT
           
static int DEFAULT_RETRY_TIME
           
static int DEFAULT_TIMEOUT
           
protected  boolean filterIncomingSets
          The harvester should filter incoming documents according to its set metadata.
protected  String metadataPrefix
          the used metadata prefix from the configuration
static String OAI_NS
           
static String OAI_STATICREPOSITORY_NS
           
protected  int retryCount
          the retryCount from configuration
protected  int retryTime
          the retryTime from configuration
protected  Set<String> sets
          the sets to harvest from the configuration, null to harvest all
protected  int timeout
          the timeout from configuration
 
Fields inherited from class de.pangaea.metadataportal.harvester.Harvester
fromDateReference, harvestCount, harvestMessageStep, iconfig, index, log
 
Constructor Summary
OAIHarvesterBase()
           
 
Method Summary
 void addDocument(MetadataDocument mdoc)
          Adds a document to the Harvester.index working in the background.
 void close(boolean cleanShutdown)
          Closes harvester.
protected  MetadataDocument createMetadataDocumentInstance()
          Creates an instance of MetadataDocument and initializes it with the index config.
protected  boolean doParse(ExtendedDigester dig, String url, AtomicReference<Date> checkModifiedDate)
          Harvests a URL using the suplied digester.
protected  void enumerateValidHarvesterPropertyNames(Set<String> props)
          This method is used by subclasses to enumerate all available harvester properties that are implemented by them.
protected  EntityResolver getEntityResolver(EntityResolver parent)
          Returns an EntityResolver that resolves all HTTP-URLS using getInputSource(java.net.URL, java.util.concurrent.atomic.AtomicReference).
protected  InputSource getInputSource(URL url, AtomicReference<Date> checkModifiedDate)
          Returns a SAX InputSource for retrieving stream data of an URL.
protected  org.apache.commons.digester.ObjectCreationFactory getMetadataDocumentFactory()
          Returns a factory for creating the MetadataDocuments in Digester code (using FactoryCreateRule).
 void open(SingleIndexConfig iconfig)
          Opens harvester for harvesting documents into the index described by the given SingleIndexConfig.
protected  void reset()
          Resets the internal variables.
 
Methods inherited from class de.pangaea.metadataportal.harvester.Harvester
getValidHarvesterPropertyNames, harvest, isClosed, isDocumentOutdated, isDocumentOutdated, main, runHarvester, runHarvester, setHarvestingDateReference
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

OAI_NS

public static final String OAI_NS
See Also:
Constant Field Values

OAI_STATICREPOSITORY_NS

public static final String OAI_STATICREPOSITORY_NS
See Also:
Constant Field Values

DEFAULT_RETRY_TIME

public static final int DEFAULT_RETRY_TIME
See Also:
Constant Field Values

DEFAULT_RETRY_COUNT

public static final int DEFAULT_RETRY_COUNT
See Also:
Constant Field Values

DEFAULT_TIMEOUT

public static final int DEFAULT_TIMEOUT
See Also:
Constant Field Values

metadataPrefix

protected String metadataPrefix
the used metadata prefix from the configuration


sets

protected Set<String> sets
the sets to harvest from the configuration, null to harvest all


retryCount

protected int retryCount
the retryCount from configuration


retryTime

protected int retryTime
the retryTime from configuration


timeout

protected int timeout
the timeout from configuration


filterIncomingSets

protected boolean filterIncomingSets
The harvester should filter incoming documents according to its set metadata. Should be disabled for OAI-PMH protocol with only one set. Default is true.

Constructor Detail

OAIHarvesterBase

public OAIHarvesterBase()
Method Detail

open

public void open(SingleIndexConfig iconfig)
          throws Exception
Description copied from class: Harvester
Opens harvester for harvesting documents into the index described by the given SingleIndexConfig. Opens Harvester.index for usage in Harvester.harvest() method.

Overrides:
open in class Harvester
Throws:
Exception - if an exception occurs during opening (various types of exceptions can be thrown).

addDocument

public void addDocument(MetadataDocument mdoc)
                 throws IndexBuilderBackgroundFailure,
                        InterruptedException
Description copied from class: Harvester
Adds a document to the Harvester.index working in the background.

Overrides:
addDocument in class Harvester
Throws:
IndexBuilderBackgroundFailure - if an error occurred in background thread. Exceptions can be thrown asynchronous and may not affect the currect document. The real exception is thrown again in Harvester.close(boolean).
InterruptedException - if wait operation was interrupted.

createMetadataDocumentInstance

protected MetadataDocument createMetadataDocumentInstance()
Description copied from class: Harvester
Creates an instance of MetadataDocument and initializes it with the index config. This method should be overwritten, if a harvester uses another class.

Overrides:
createMetadataDocumentInstance in class Harvester

getMetadataDocumentFactory

protected org.apache.commons.digester.ObjectCreationFactory getMetadataDocumentFactory()
Returns a factory for creating the MetadataDocuments in Digester code (using FactoryCreateRule).

See Also:
createMetadataDocumentInstance()

doParse

protected boolean doParse(ExtendedDigester dig,
                          String url,
                          AtomicReference<Date> checkModifiedDate)
                   throws Exception
Harvests a URL using the suplied digester.

Parameters:
dig - the digester instance.
url - the URL is parsed by this digester instance.
checkModifiedDate - for static repositories, it is possible to give a reference to a Date for checking the last modification, in this case false is returned, if the URL was not modified. If it was modified, the reference contains a new Date object with the new modification date. Supply null for no checking of last modification, a last modification date is then not returned back (as there is no reference).
Returns:
true if harvested, false if not modified and no harvesting was done.
Throws:
Exception

getEntityResolver

protected EntityResolver getEntityResolver(EntityResolver parent)
Returns an EntityResolver that resolves all HTTP-URLS using getInputSource(java.net.URL, java.util.concurrent.atomic.AtomicReference).

Parameters:
parent - an EntityResolver that receives all unprocessed requests
See Also:
getInputSource(java.net.URL, java.util.concurrent.atomic.AtomicReference)

getInputSource

protected InputSource getInputSource(URL url,
                                     AtomicReference<Date> checkModifiedDate)
                              throws IOException
Returns a SAX InputSource for retrieving stream data of an URL. It is optimized for compression of the HTTP(S) protocol and timeout checking.

Parameters:
url - the URL to open
checkModifiedDate - for static repositories, it is possible to give a reference to a Date for checking the last modification, in this case null is returned, if the URL was not modified. If it was modified, the reference contains a new Date object with the new modification date. Supply null for no checking of last modification, a last modification date is then not returned back (as there is no reference).
Throws:
IOException
See Also:
getEntityResolver(org.xml.sax.EntityResolver)

reset

protected void reset()
Resets the internal variables.


close

public void close(boolean cleanShutdown)
           throws Exception
Description copied from class: Harvester
Closes harvester. All ressources are freed and the Harvester.index is closed.

Overrides:
close in class Harvester
Parameters:
cleanShutdown - enables writing of status information to the index for the next harvesting. If an error occured during harvesting this should not be done.
Throws:
Exception - if an exception occurs during closing (various types of exceptions can be thrown). Exceptions can be thrown asynchronous and may not affect the currect document.

enumerateValidHarvesterPropertyNames

protected void enumerateValidHarvesterPropertyNames(Set<String> props)
Description copied from class: Harvester
This method is used by subclasses to enumerate all available harvester properties that are implemented by them. Overwrite this method in your own implementation and append all harvester names to the supplied Set. The public API for client code requesting property names is Harvester.getValidHarvesterPropertyNames().

Overrides:
enumerateValidHarvesterPropertyNames in class Harvester
See Also:
Harvester.getValidHarvesterPropertyNames()


Copyright ©2007-2011 panFMP Developers c/o Uwe Schindler