de.pangaea.metadataportal.harvester
Class WebCrawlingHarvester

java.lang.Object
  extended by de.pangaea.metadataportal.harvester.Harvester
      extended by de.pangaea.metadataportal.harvester.SingleFileEntitiesHarvester
          extended by de.pangaea.metadataportal.harvester.WebCrawlingHarvester

public class WebCrawlingHarvester
extends SingleFileEntitiesHarvester

Harvester for traversing websites and harvesting XML documents. If the baseURL (from config) contains a XML file with the correct MIME type, it is directly harvested. A html webpage is analyzed and all links are followed and checked for XML files with correct MIME type. This is done recursively, but harvesting does not escape the server and baseURL directory.

This harvester supports the following additional harvester properties:

Author:
Uwe Schindler

Field Summary
static int DEFAULT_RETRY_COUNT
           
static int DEFAULT_RETRY_TIME
           
static int DEFAULT_TIMEOUT
           
static Set<String> HTML_CONTENT_TYPES
           
static String HTML_SAX_PARSER_CLASS
          This is the parser class used to parse HTML documents to collect URLs for crawling.
 
Fields inherited from class de.pangaea.metadataportal.harvester.Harvester
fromDateReference, harvestCount, harvestMessageStep, iconfig, index, log
 
Constructor Summary
WebCrawlingHarvester()
           
 
Method Summary
 void close(boolean cleanShutdown)
          Closes harvester.
protected  void enumerateValidHarvesterPropertyNames(Set<String> props)
          This method is used by subclasses to enumerate all available harvester properties that are implemented by them.
 void harvest()
          This method is called by the harvester after Harvester.open(de.pangaea.metadataportal.config.SingleIndexConfig)'ing it.
 void open(SingleIndexConfig iconfig)
          Opens harvester for harvesting documents into the index described by the given SingleIndexConfig.
 
Methods inherited from class de.pangaea.metadataportal.harvester.SingleFileEntitiesHarvester
addDocument, addDocument, cancelMissingDocumentDelete
 
Methods inherited from class de.pangaea.metadataportal.harvester.Harvester
addDocument, createMetadataDocumentInstance, getValidHarvesterPropertyNames, isClosed, isDocumentOutdated, isDocumentOutdated, main, runHarvester, runHarvester, setHarvestingDateReference
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_RETRY_TIME

public static final int DEFAULT_RETRY_TIME
See Also:
Constant Field Values

DEFAULT_RETRY_COUNT

public static final int DEFAULT_RETRY_COUNT
See Also:
Constant Field Values

DEFAULT_TIMEOUT

public static final int DEFAULT_TIMEOUT
See Also:
Constant Field Values

HTML_SAX_PARSER_CLASS

public static final String HTML_SAX_PARSER_CLASS
This is the parser class used to parse HTML documents to collect URLs for crawling. If this class is not in your classpath, the harvester will fail on startup in open(de.pangaea.metadataportal.config.SingleIndexConfig). If you change the implementation (possibly in future a HTML parser is embedded in XERCES), change this. Do not forget to revisit the features for this parser in the parsing method.

See Also:
Constant Field Values

HTML_CONTENT_TYPES

public static final Set<String> HTML_CONTENT_TYPES
Constructor Detail

WebCrawlingHarvester

public WebCrawlingHarvester()
Method Detail

open

public void open(SingleIndexConfig iconfig)
          throws Exception
Description copied from class: Harvester
Opens harvester for harvesting documents into the index described by the given SingleIndexConfig. Opens Harvester.index for usage in Harvester.harvest() method.

Overrides:
open in class SingleFileEntitiesHarvester
Throws:
Exception - if an exception occurs during opening (various types of exceptions can be thrown).

close

public void close(boolean cleanShutdown)
           throws Exception
Description copied from class: Harvester
Closes harvester. All ressources are freed and the Harvester.index is closed.

Overrides:
close in class SingleFileEntitiesHarvester
Parameters:
cleanShutdown - enables writing of status information to the index for the next harvesting. If an error occured during harvesting this should not be done.
Throws:
Exception - if an exception occurs during closing (various types of exceptions can be thrown). Exceptions can be thrown asynchronous and may not affect the currect document.

harvest

public void harvest()
             throws Exception
Description copied from class: Harvester
This method is called by the harvester after Harvester.open(de.pangaea.metadataportal.config.SingleIndexConfig)'ing it. Overwrite this method in your harvester class. This method should harvest files from somewhere, generate MetadataDocuments and add them with Harvester.addDocument(de.pangaea.metadataportal.harvester.MetadataDocument).

Specified by:
harvest in class Harvester
Throws:
Exception - of any type.

enumerateValidHarvesterPropertyNames

protected void enumerateValidHarvesterPropertyNames(Set<String> props)
Description copied from class: Harvester
This method is used by subclasses to enumerate all available harvester properties that are implemented by them. Overwrite this method in your own implementation and append all harvester names to the supplied Set. The public API for client code requesting property names is Harvester.getValidHarvesterPropertyNames().

Overrides:
enumerateValidHarvesterPropertyNames in class SingleFileEntitiesHarvester
See Also:
Harvester.getValidHarvesterPropertyNames()


Copyright ©2007-2011 panFMP Developers c/o Uwe Schindler