java.lang.Object
- de.pangaea.metadataportal.harvester.Harvester
- - de.pangaea.metadataportal.harvester.OAIHarvesterBase

Direct Known Subclasses:

OAIHarvester, OAIStaticRepositoryHarvester
```
public abstract class OAIHarvesterBase
extends Harvester
```
Abstract base class for OAI harvesting support in panFMP. Use one of the subclasses for harvesting OAI-PMH or OAI Static Repositories.
This harvester supports the following additional harvester properties:
- setSpec: OAI set to harvest (default: none)
- retryCount: how often retry on HTTP errors? (default: 5)
- retryAfterSeconds: time between retries in seconds (default: 60)
- timeoutAfterSeconds: HTTP Timeout for harvesting in seconds
- authorizationHeader: Optional 'Authorization' HTTP header contents to be sent with request.
- metadataPrefix: OAI metadata prefix to harvest
- identifierPrefix: prepend all identifiers returned by OAI with this string
- ignoreDatestamps: does full harvesting, while ignoring all datestamps. They are saved, but ignored, if invalid.
- deleteMissingDocuments: remove documents after harvesting that were deleted from source (maybe a heavy operation). The harvester only does this on full (not on incremental harvesting). (default: true)
Author:

Uwe Schindler

Field Summary

Fields
Modifier and Type	Field	Description
`protected String`	`authorizationHeader`	the authorizationHeader from configuration
`static int`	`DEFAULT_RETRY_COUNT`
`static int`	`DEFAULT_RETRY_TIME`
`static int`	`DEFAULT_TIMEOUT`
`protected boolean`	`deleteMissingDocuments`	If enabled, on any kind of full harvesting it will track all valid identifiers and delete all of them not seen in index.
`protected boolean`	`filterIncomingSets`	The harvester should filter incoming documents according to its set metadata.
`protected HttpClient`	`httpClient`	HttpClient to use, configured with correct connect timeout.
`protected String`	`identifierPrefix`	prepend all identifiers returned by OAI with this string
`protected boolean`	`ignoreDatestamps`	If enabled, does full harvesting, while ignoring all datestamps (default is `false`).
`protected String`	`metadataPrefix`	the used metadata prefix from the configuration
`static String`	`OAI_NS`
`static String`	`OAI_STATICREPOSITORY_NS`
`protected int`	`retryCount`	the retryCount from configuration
`protected int`	`retryTime`	the retryTime from configuration
`protected Set<String>`	`sets`	the sets to harvest from the configuration, `null` to harvest all
`protected Duration`	`timeout`	the timeout from configuration
`static String`	`USER_AGENT`

Fields inherited from class de.pangaea.metadataportal.harvester.Harvester
fromDateReference, harvestCount, HARVESTER_METADATA_FIELD_LAST_HARVESTED, harvestMessageStep, iconfig, log, processor

Constructor Summary

Constructors
Constructor Description

OAIHarvesterBase(HarvesterConfig iconfig)

Method Summary

All Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method	Description
`void`	`addDocument(MetadataDocument mdoc)`	Adds a document to the `Harvester.processor` working in the background.
`protected void`	`cancelMissingDocumentDelete()`	Disable the property "deleteMissingDocuments" for this instance.
`void`	`close(boolean cleanShutdown)`	Closes harvester.
`MetadataDocument`	`createMetadataDocumentInstance()`	Creates an instance of MetadataDocument and initializes it with the harvester config.
`protected boolean`	`doParse(Supplier<ExtendedDigester> digSupplier, String url, AtomicReference<Instant> checkModifiedDate)`	Harvests a URL using the suplied digester.
`protected void`	`enableMissingDocumentDelete()`	Enable unseen document deletes.
`protected void`	`enumerateValidHarvesterPropertyNames(Set<String> props)`	This method is used by subclasses to enumerate all available harvester properties that are implemented by them.
`protected EntityResolver`	`getEntityResolver(EntityResolver parent)`	Returns an `EntityResolver` that resolves all HTTP-URLS using `getInputSource(java.net.URI, java.util.concurrent.atomic.AtomicReference<java.time.Instant>)`.
`protected InputSource`	`getInputSource(URI url, AtomicReference<Instant> checkModifiedDate)`	Returns a SAX `InputSource` for retrieving stream data of an URL.
`protected org.apache.commons.digester.ObjectCreationFactory`	`getMetadataDocumentFactory()`	Returns a factory for creating the `MetadataDocument`s in Digester code (using `FactoryCreateRule`).
`void`	`open(ElasticsearchConnection es, String targetIndex)`	Opens harvester for harvesting documents described by the given `HarvesterConfig`.
`protected abstract void`	`recreateDigester()`	Recreates all digesters that are used by parsing the OAI XML.
`protected void`	`reset()`	Resets the internal variables.

Methods inherited from class de.pangaea.metadataportal.harvester.Harvester
deleteDocument, finishReindex, getValidHarvesterPropertyNames, harvest, isAllIndexes, isClosed, isDocumentOutdated, main, prepareReindex, runHarvester, runHarvester, setHarvestingDateReference, setValidIdentifiers

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - OAI_NS
```
public static final String OAI_NS
```
    See Also:
    
    Constant Field Values
  - OAI_STATICREPOSITORY_NS
```
public static final String OAI_STATICREPOSITORY_NS
```
    See Also:
    
    Constant Field Values
  - DEFAULT_RETRY_TIME
```
public static final int DEFAULT_RETRY_TIME
```
    See Also:
    
    Constant Field Values
  - DEFAULT_RETRY_COUNT
```
public static final int DEFAULT_RETRY_COUNT
```
    See Also:
    
    Constant Field Values
  - DEFAULT_TIMEOUT
```
public static final int DEFAULT_TIMEOUT
```
    See Also:
    
    Constant Field Values
  - USER_AGENT
```
public static final String USER_AGENT
```
  - metadataPrefix
```
protected final String metadataPrefix
```
    the used metadata prefix from the configuration
  - identifierPrefix
```
protected final String identifierPrefix
```
    prepend all identifiers returned by OAI with this string
  - sets
```
protected final Set<String> sets
```
    the sets to harvest from the configuration, null to harvest all
  - retryCount
```
protected final int retryCount
```
    the retryCount from configuration
  - retryTime
```
protected final int retryTime
```
    the retryTime from configuration
  - timeout
```
protected final Duration timeout
```
    the timeout from configuration
  - authorizationHeader
```
protected final String authorizationHeader
```
    the authorizationHeader from configuration
  - ignoreDatestamps
```
protected final boolean ignoreDatestamps
```
    If enabled, does full harvesting, while ignoring all datestamps (default is false). They are saved, but ignored, if invalid.
  - deleteMissingDocuments
```
protected final boolean deleteMissingDocuments
```
    If enabled, on any kind of full harvesting it will track all valid identifiers and delete all of them not seen in index.
  - httpClient
```
protected final HttpClient httpClient
```
    HttpClient to use, configured with correct connect timeout.
  - filterIncomingSets
```
protected boolean filterIncomingSets
```
    The harvester should filter incoming documents according to its set metadata. Should be disabled for OAI-PMH protocol with only one set. Default is true.
- Constructor Detail
  - OAIHarvesterBase
```
public OAIHarvesterBase(HarvesterConfig iconfig)
```
- Method Detail
  - open
```
public void open(ElasticsearchConnection es,
                 String targetIndex)
          throws Exception
```
    Description copied from class: Harvester
    
    Opens harvester for harvesting documents described by the given HarvesterConfig. Opens Harvester.processor for usage in Harvester.harvest() method.
    
    Overrides:
    
    open in class Harvester
    
    Throws:
    
    Exception - if an exception occurs during opening (various types of exceptions can be thrown).
  - addDocument
```
public void addDocument(MetadataDocument mdoc)
                 throws Exception
```
    Description copied from class: Harvester
    
    Adds a document to the Harvester.processor working in the background.
    
    Overrides:
    
    addDocument in class Harvester
    
    Throws:
    
    BackgroundFailure - if an error occurred in background thread. Exceptions can be thrown asynchronous and may not affect the currect document. The real exception is thrown again in Harvester.close(boolean).
    
    Exception
  - createMetadataDocumentInstance
```
public MetadataDocument createMetadataDocumentInstance()
```
    Description copied from class: Harvester
    
    Creates an instance of MetadataDocument and initializes it with the harvester config. This method should be overwritten, if a harvester uses another class.
    
    Overrides:
    
    createMetadataDocumentInstance in class Harvester
  - getMetadataDocumentFactory
```
protected org.apache.commons.digester.ObjectCreationFactory getMetadataDocumentFactory()
```
    Returns a factory for creating the MetadataDocuments in Digester code (using FactoryCreateRule).
    
    See Also:
    
    createMetadataDocumentInstance()
  - recreateDigester
```
protected abstract void recreateDigester()
```
    Recreates all digesters that are used by parsing the OAI XML. This method is called initiall once and later on network errors before parsing same document again. This allows to recover from document parsing failing somewhere in the middle of a document.
  - doParse
```
protected boolean doParse(Supplier<ExtendedDigester> digSupplier,
                          String url,
                          AtomicReference<Instant> checkModifiedDate)
                   throws Exception
```
    Harvests a URL using the suplied digester.
    
    Parameters:
    
    digSupplier - a Supplier that gives access to a (possibly recreated) digester instance.
    
    url - the URL is parsed by this digester instance.
    
    checkModifiedDate - for static repositories, it is possible to give a reference to a Instant for checking the last modification, in this case false is returned, if the URL was not modified. If it was modified, the reference contains a new Date object with the new modification date. Supply null for no checking of last modification, a last modification date is then not returned back (as there is no reference).
    
    Returns:
    
    true if harvested, false if not modified and no harvesting was done.
    
    Throws:
    
    Exception
  - getEntityResolver
```
protected EntityResolver getEntityResolver(EntityResolver parent)
```
    Returns an EntityResolver that resolves all HTTP-URLS using getInputSource(java.net.URI, java.util.concurrent.atomic.AtomicReference<java.time.Instant>).
    
    Parameters:
    
    parent - an EntityResolver that receives all unprocessed requests
    
    See Also:
    
    getInputSource(java.net.URI, java.util.concurrent.atomic.AtomicReference<java.time.Instant>)
  - getInputSource
```
protected InputSource getInputSource(URI url,
                                     AtomicReference<Instant> checkModifiedDate)
                              throws IOException
```
    Returns a SAX InputSource for retrieving stream data of an URL. It is optimized for compression of the HTTP(S) protocol and timeout checking.
    
    Parameters:
    
    url - the URL to open
    
    checkModifiedDate - for static repositories, it is possible to give a reference to a Instant for checking the last modification, in this case null is returned, if the URL was not modified. If it was modified, the reference contains a new Date object with the new modification date. Supply null for no checking of last modification, a last modification date is then not returned back (as there is no reference).
    
    Throws:
    
    IOException
    
    See Also:
    
    getEntityResolver(org.xml.sax.EntityResolver)
  - reset
```
protected void reset()
```
    Resets the internal variables.
  - enableMissingDocumentDelete
```
protected void enableMissingDocumentDelete()
```
    Enable unseen document deletes. This should be enabled by harvester before calling addDocument(MetadataDocument), so tracking can be enabled.
  - cancelMissingDocumentDelete
```
protected void cancelMissingDocumentDelete()
```
    Disable the property "deleteMissingDocuments" for this instance. This can be used, when the container (like a ZIP file was not modified), and all containing documents are not enumerated. To prevent deletion of all these documents call this.
  - close
```
public void close(boolean cleanShutdown)
           throws Exception
```
    Description copied from class: Harvester
    
    Closes harvester. All resources are freed and the Harvester.processor is closed.
    
    Overrides:
    
    close in class Harvester
    
    Parameters:
    
    cleanShutdown - enables writing of status information to the Elasticsearch instance for the next harvesting. If an error occurred during harvesting this should not be done.
    
    Throws:
    
    Exception - if an exception occurs during closing (various types of exceptions can be thrown). Exceptions can be thrown asynchronous and may not affect the correct document.
  - enumerateValidHarvesterPropertyNames
```
protected void enumerateValidHarvesterPropertyNames(Set<String> props)
```
    Description copied from class: Harvester
    
    This method is used by subclasses to enumerate all available harvester properties that are implemented by them. Overwrite this method in your own implementation and append all harvester names to the supplied Set. The public API for client code requesting property names is Harvester.getValidHarvesterPropertyNames().
    
    Overrides:
    
    enumerateValidHarvesterPropertyNames in class Harvester
    
    See Also:
    
    Harvester.getValidHarvesterPropertyNames()

Class OAIHarvesterBase

Field Summary

Fields inherited from class de.pangaea.metadataportal.harvester.Harvester

Constructor Summary

Method Summary

Methods inherited from class de.pangaea.metadataportal.harvester.Harvester

Methods inherited from class java.lang.Object

Field Detail

OAI_NS

OAI_STATICREPOSITORY_NS

DEFAULT_RETRY_TIME

DEFAULT_RETRY_COUNT

DEFAULT_TIMEOUT

USER_AGENT

metadataPrefix

identifierPrefix

sets

retryCount

retryTime

timeout

authorizationHeader

ignoreDatestamps

deleteMissingDocuments

httpClient

filterIncomingSets

Constructor Detail

OAIHarvesterBase

Method Detail

open

addDocument

createMetadataDocumentInstance

getMetadataDocumentFactory

recreateDigester

doParse

getEntityResolver

getInputSource

reset

enableMissingDocumentDelete

cancelMissingDocumentDelete

close

enumerateValidHarvesterPropertyNames