|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectde.pangaea.metadataportal.harvester.Harvester
de.pangaea.metadataportal.harvester.SingleFileEntitiesHarvester
de.pangaea.metadataportal.harvester.WebCrawlingHarvester
public class WebCrawlingHarvester
Harvester for traversing websites and harvesting XML documents.
If the baseURL (from config) contains a XML file with the correct MIME type, it is directly harvested.
A html webpage is analyzed and all links are followed and checked for XML files with correct MIME type.
This is done recursively, but harvesting does not escape the server and baseURL directory.
This harvester supports the following additional harvester properties:
baseUrl: URL to start crawling (should point to a HTML page).retryCount: how often retry on HTTP errors? (default: 5) retryAfterSeconds: time between retries in seconds (default: 60)timeoutAfterSeconds: HTTP Timeout for harvesting in secondsfilenameFilter: regex to match the filename. The regex is applied against the whole filename (this is like ^pattern$)! (default: none)contentTypes: MIME types of documents to index (maybe additionally limited by filenameFilter). (default: "text/xml,application/xml")excludeUrlPattern: A regex that is applied to all URLs appearing during harvesting process. URLs with matching patterns (partial matches allowed, use ^,$ for start/end matches) are excluded and not further traversed. (default: none)pauseBetweenRequests: to not overload server that is harvested, wait XX milliseconds after each HTTP request (default: none)
| Field Summary | |
|---|---|
static int |
DEFAULT_RETRY_COUNT
|
static int |
DEFAULT_RETRY_TIME
|
static int |
DEFAULT_TIMEOUT
|
static Set<String> |
HTML_CONTENT_TYPES
|
static String |
HTML_SAX_PARSER_CLASS
This is the parser class used to parse HTML documents to collect URLs for crawling. |
| Fields inherited from class de.pangaea.metadataportal.harvester.Harvester |
|---|
fromDateReference, harvestCount, harvestMessageStep, iconfig, index, log |
| Constructor Summary | |
|---|---|
WebCrawlingHarvester()
|
|
| Method Summary | |
|---|---|
void |
close(boolean cleanShutdown)
Closes harvester. |
protected void |
enumerateValidHarvesterPropertyNames(Set<String> props)
This method is used by subclasses to enumerate all available harvester properties that are implemented by them. |
void |
harvest()
This method is called by the harvester after Harvester.open(de.pangaea.metadataportal.config.SingleIndexConfig)'ing it. |
void |
open(SingleIndexConfig iconfig)
Opens harvester for harvesting documents into the index described by the given SingleIndexConfig. |
| Methods inherited from class de.pangaea.metadataportal.harvester.SingleFileEntitiesHarvester |
|---|
addDocument, addDocument, cancelMissingDocumentDelete |
| Methods inherited from class de.pangaea.metadataportal.harvester.Harvester |
|---|
addDocument, createMetadataDocumentInstance, getValidHarvesterPropertyNames, isClosed, isDocumentOutdated, isDocumentOutdated, main, runHarvester, runHarvester, setHarvestingDateReference |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public static final int DEFAULT_RETRY_TIME
public static final int DEFAULT_RETRY_COUNT
public static final int DEFAULT_TIMEOUT
public static final String HTML_SAX_PARSER_CLASS
open(de.pangaea.metadataportal.config.SingleIndexConfig).
If you change the implementation (possibly in future a HTML parser is embedded in XERCES),
change this. Do not forget to revisit the features for this parser in the parsing method.
public static final Set<String> HTML_CONTENT_TYPES
| Constructor Detail |
|---|
public WebCrawlingHarvester()
| Method Detail |
|---|
public void open(SingleIndexConfig iconfig)
throws Exception
HarvesterSingleIndexConfig.
Opens Harvester.index for usage in Harvester.harvest() method.
open in class SingleFileEntitiesHarvesterException - if an exception occurs during opening (various types of exceptions can be thrown).
public void close(boolean cleanShutdown)
throws Exception
HarvesterHarvester.index is closed.
close in class SingleFileEntitiesHarvestercleanShutdown - enables writing of status information to the index for the next harvesting. If an error occured during harvesting this should not be done.
Exception - if an exception occurs during closing (various types of exceptions can be thrown).
Exceptions can be thrown asynchronous and may not affect the currect document.
public void harvest()
throws Exception
HarvesterHarvester.open(de.pangaea.metadataportal.config.SingleIndexConfig)'ing it. Overwrite this
method in your harvester class.
This method should harvest files from somewhere, generate MetadataDocuments and add
them with Harvester.addDocument(de.pangaea.metadataportal.harvester.MetadataDocument).
harvest in class HarvesterException - of any type.protected void enumerateValidHarvesterPropertyNames(Set<String> props)
HarvesterSet.
The public API for client code requesting property names is Harvester.getValidHarvesterPropertyNames().
enumerateValidHarvesterPropertyNames in class SingleFileEntitiesHarvesterHarvester.getValidHarvesterPropertyNames()
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||