java.lang.Object
- de.pangaea.metadataportal.harvester.Harvester
- - de.pangaea.metadataportal.harvester.SingleFileEntitiesHarvester
  - - de.pangaea.metadataportal.harvester.WebCrawlingHarvester

```
public class WebCrawlingHarvester
extends SingleFileEntitiesHarvester
```
Harvester for traversing websites and harvesting XML documents. If the baseURL (from config) contains a XML file with the correct MIME type, it is directly harvested. A html webpage is analyzed and all links are followed and checked for XML files with correct MIME type. This is done recursively, but harvesting does not escape the server and baseURL directory.
This harvester supports the following additional harvester properties:
- baseUrl: URL to start crawling (should point to a HTML page).
- retryCount: how often retry on HTTP errors? (default: 5)
- retryAfterSeconds: time between retries in seconds (default: 60)
- timeoutAfterSeconds: HTTP Timeout for harvesting in seconds
- authorizationHeader: Optional 'Authorization' HTTP header contents to be sent with request.
- filenameFilter: regex to match the filename. The regex is applied against the whole filename (this is like ^pattern$)! (default: none)
- contentTypes: MIME types of documents to index (maybe additionally limited by filenameFilter). (default: "text/xml,application/xml")
- excludeUrlPattern: A regex that is applied to all URLs appearing during harvesting process. URLs with matching patterns (partial matches allowed, use ^,$ for start/end matches) are excluded and not further traversed. (default: none)
- pauseBetweenRequests: to not overload server that is harvested, wait XX milliseconds after each HTTP request (default: none)
Author:

Uwe Schindler

Field Summary

Fields
Modifier and Type	Field	Description
`static int`	`DEFAULT_RETRY_COUNT`
`static int`	`DEFAULT_RETRY_TIME`
`static int`	`DEFAULT_TIMEOUT`
`static Set<String>`	`HTML_CONTENT_TYPES`
`static String`	`HTML_SAX_PARSER_CLASS`	This is the parser class used to parse HTML documents to collect URLs for crawling.
`static String`	`USER_AGENT`

Fields inherited from class de.pangaea.metadataportal.harvester.Harvester
fromDateReference, harvestCount, HARVESTER_METADATA_FIELD_LAST_HARVESTED, harvestMessageStep, iconfig, log, processor

Constructor Summary

Constructors
Constructor Description

WebCrawlingHarvester(HarvesterConfig iconfig)

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`protected void`	`enumerateValidHarvesterPropertyNames(Set<String> props)`	This method is used by subclasses to enumerate all available harvester properties that are implemented by them.
`void`	`harvest()`	This method is called by the harvester after `Harvester.open(de.pangaea.metadataportal.processor.ElasticsearchConnection, java.lang.String)`'ing it.

Methods inherited from class de.pangaea.metadataportal.harvester.SingleFileEntitiesHarvester
addDocument, addDocument, cancelMissingDocumentDelete, close

Methods inherited from class de.pangaea.metadataportal.harvester.Harvester
addDocument, createMetadataDocumentInstance, deleteDocument, finishReindex, getValidHarvesterPropertyNames, isAllIndexes, isClosed, isDocumentOutdated, main, open, prepareReindex, runHarvester, runHarvester, setHarvestingDateReference, setValidIdentifiers

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - DEFAULT_RETRY_TIME
```
public static final int DEFAULT_RETRY_TIME
```
    See Also:
    
    Constant Field Values
  - DEFAULT_RETRY_COUNT
```
public static final int DEFAULT_RETRY_COUNT
```
    See Also:
    
    Constant Field Values
  - DEFAULT_TIMEOUT
```
public static final int DEFAULT_TIMEOUT
```
    See Also:
    
    Constant Field Values
  - HTML_SAX_PARSER_CLASS
```
public static final String HTML_SAX_PARSER_CLASS
```
    This is the parser class used to parse HTML documents to collect URLs for crawling. If this class is not in your classpath, the harvester will fail on startup in Harvester.open(de.pangaea.metadataportal.processor.ElasticsearchConnection, java.lang.String). If you change the implementation (possibly in future a HTML parser is embedded in XERCES), change this. Do not forget to revisit the features for this parser in the parsing method.
    
    See Also:
    
    Constant Field Values
  - HTML_CONTENT_TYPES
```
public static final Set<String> HTML_CONTENT_TYPES
```
  - USER_AGENT
```
public static final String USER_AGENT
```
- Constructor Detail
  - WebCrawlingHarvester
```
public WebCrawlingHarvester(HarvesterConfig iconfig)
                     throws Exception
```
    Throws:
    
    Exception
- Method Detail
  - harvest
```
public void harvest()
             throws Exception
```
    Description copied from class: Harvester
    
    This method is called by the harvester after Harvester.open(de.pangaea.metadataportal.processor.ElasticsearchConnection, java.lang.String)'ing it. Overwrite this method in your harvester class. This method should harvest files from somewhere, generate MetadataDocuments and add them with Harvester.addDocument(de.pangaea.metadataportal.processor.MetadataDocument).
    
    Specified by:
    
    harvest in class Harvester
    
    Throws:
    
    Exception - of any type.
  - enumerateValidHarvesterPropertyNames
```
protected void enumerateValidHarvesterPropertyNames(Set<String> props)
```
    Description copied from class: Harvester
    
    This method is used by subclasses to enumerate all available harvester properties that are implemented by them. Overwrite this method in your own implementation and append all harvester names to the supplied Set. The public API for client code requesting property names is Harvester.getValidHarvesterPropertyNames().
    
    Overrides:
    
    enumerateValidHarvesterPropertyNames in class SingleFileEntitiesHarvester
    
    See Also:
    
    Harvester.getValidHarvesterPropertyNames()

Class WebCrawlingHarvester

Field Summary

Fields inherited from class de.pangaea.metadataportal.harvester.Harvester

Constructor Summary

Method Summary

Methods inherited from class de.pangaea.metadataportal.harvester.SingleFileEntitiesHarvester

Methods inherited from class de.pangaea.metadataportal.harvester.Harvester

Methods inherited from class java.lang.Object

Field Detail

DEFAULT_RETRY_TIME

DEFAULT_RETRY_COUNT

DEFAULT_TIMEOUT

HTML_SAX_PARSER_CLASS

HTML_CONTENT_TYPES

USER_AGENT

Constructor Detail

WebCrawlingHarvester

Method Detail

harvest

enumerateValidHarvesterPropertyNames