Class OAIHarvesterBase
- java.lang.Object
-
- de.pangaea.metadataportal.harvester.Harvester
-
- de.pangaea.metadataportal.harvester.OAIHarvesterBase
-
- Direct Known Subclasses:
OAIHarvester,OAIStaticRepositoryHarvester
public abstract class OAIHarvesterBase extends Harvester
Abstract base class for OAI harvesting support in panFMP. Use one of the subclasses for harvesting OAI-PMH or OAI Static Repositories.This harvester supports the following additional harvester properties:
setSpec: OAI set to harvest (default: none)retryCount: how often retry on HTTP errors? (default: 5)retryAfterSeconds: time between retries in seconds (default: 60)timeoutAfterSeconds: HTTP Timeout for harvesting in secondsauthorizationHeader: Optional 'Authorization' HTTP header contents to be sent with request.metadataPrefix: OAI metadata prefix to harvestidentifierPrefix: prepend all identifiers returned by OAI with this stringignoreDatestamps: does full harvesting, while ignoring all datestamps. They are saved, but ignored, if invalid.deleteMissingDocuments: remove documents after harvesting that were deleted from source (maybe a heavy operation). The harvester only does this on full (not on incremental harvesting). (default: true)
- Author:
- Uwe Schindler
-
-
Field Summary
Fields Modifier and Type Field Description protected StringauthorizationHeaderthe authorizationHeader from configurationstatic intDEFAULT_RETRY_COUNTstatic intDEFAULT_RETRY_TIMEstatic intDEFAULT_TIMEOUTprotected booleandeleteMissingDocumentsIf enabled, on any kind of full harvesting it will track all valid identifiers and delete all of them not seen in index.protected booleanfilterIncomingSetsThe harvester should filter incoming documents according to its set metadata.protected HttpClienthttpClientHttpClient to use, configured with correct connect timeout.protected StringidentifierPrefixprepend all identifiers returned by OAI with this stringprotected booleanignoreDatestampsIf enabled, does full harvesting, while ignoring all datestamps (default isfalse).protected StringmetadataPrefixthe used metadata prefix from the configurationstatic StringOAI_NSstatic StringOAI_STATICREPOSITORY_NSprotected intretryCountthe retryCount from configurationprotected intretryTimethe retryTime from configurationprotected Set<String>setsthe sets to harvest from the configuration,nullto harvest allprotected Durationtimeoutthe timeout from configurationstatic StringUSER_AGENT-
Fields inherited from class de.pangaea.metadataportal.harvester.Harvester
fromDateReference, harvestCount, HARVESTER_METADATA_FIELD_LAST_HARVESTED, harvestMessageStep, iconfig, log, processor
-
-
Constructor Summary
Constructors Constructor Description OAIHarvesterBase(HarvesterConfig iconfig)
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description voidaddDocument(MetadataDocument mdoc)Adds a document to theHarvester.processorworking in the background.protected voidcancelMissingDocumentDelete()Disable the property "deleteMissingDocuments" for this instance.voidclose(boolean cleanShutdown)Closes harvester.MetadataDocumentcreateMetadataDocumentInstance()Creates an instance of MetadataDocument and initializes it with the harvester config.protected booleandoParse(Supplier<ExtendedDigester> digSupplier, String url, AtomicReference<Instant> checkModifiedDate)Harvests a URL using the suplied digester.protected voidenableMissingDocumentDelete()Enable unseen document deletes.protected voidenumerateValidHarvesterPropertyNames(Set<String> props)This method is used by subclasses to enumerate all available harvester properties that are implemented by them.protected EntityResolvergetEntityResolver(EntityResolver parent)Returns anEntityResolverthat resolves all HTTP-URLS usinggetInputSource(java.net.URI, java.util.concurrent.atomic.AtomicReference<java.time.Instant>).protected InputSourcegetInputSource(URI url, AtomicReference<Instant> checkModifiedDate)Returns a SAXInputSourcefor retrieving stream data of an URL.protected org.apache.commons.digester.ObjectCreationFactorygetMetadataDocumentFactory()Returns a factory for creating theMetadataDocuments in Digester code (usingFactoryCreateRule).voidopen(ElasticsearchConnection es, String targetIndex)Opens harvester for harvesting documents described by the givenHarvesterConfig.protected abstract voidrecreateDigester()Recreates all digesters that are used by parsing the OAI XML.protected voidreset()Resets the internal variables.-
Methods inherited from class de.pangaea.metadataportal.harvester.Harvester
deleteDocument, finishReindex, getValidHarvesterPropertyNames, harvest, isAllIndexes, isClosed, isDocumentOutdated, main, prepareReindex, runHarvester, runHarvester, setHarvestingDateReference, setValidIdentifiers
-
-
-
-
Field Detail
-
OAI_NS
public static final String OAI_NS
- See Also:
- Constant Field Values
-
OAI_STATICREPOSITORY_NS
public static final String OAI_STATICREPOSITORY_NS
- See Also:
- Constant Field Values
-
DEFAULT_RETRY_TIME
public static final int DEFAULT_RETRY_TIME
- See Also:
- Constant Field Values
-
DEFAULT_RETRY_COUNT
public static final int DEFAULT_RETRY_COUNT
- See Also:
- Constant Field Values
-
DEFAULT_TIMEOUT
public static final int DEFAULT_TIMEOUT
- See Also:
- Constant Field Values
-
USER_AGENT
public static final String USER_AGENT
-
metadataPrefix
protected final String metadataPrefix
the used metadata prefix from the configuration
-
identifierPrefix
protected final String identifierPrefix
prepend all identifiers returned by OAI with this string
-
sets
protected final Set<String> sets
the sets to harvest from the configuration,nullto harvest all
-
retryCount
protected final int retryCount
the retryCount from configuration
-
retryTime
protected final int retryTime
the retryTime from configuration
-
timeout
protected final Duration timeout
the timeout from configuration
-
authorizationHeader
protected final String authorizationHeader
the authorizationHeader from configuration
-
ignoreDatestamps
protected final boolean ignoreDatestamps
If enabled, does full harvesting, while ignoring all datestamps (default isfalse). They are saved, but ignored, if invalid.
-
deleteMissingDocuments
protected final boolean deleteMissingDocuments
If enabled, on any kind of full harvesting it will track all valid identifiers and delete all of them not seen in index.
-
httpClient
protected final HttpClient httpClient
HttpClient to use, configured with correct connect timeout.
-
filterIncomingSets
protected boolean filterIncomingSets
The harvester should filter incoming documents according to its set metadata. Should be disabled for OAI-PMH protocol with only one set. Default istrue.
-
-
Constructor Detail
-
OAIHarvesterBase
public OAIHarvesterBase(HarvesterConfig iconfig)
-
-
Method Detail
-
open
public void open(ElasticsearchConnection es, String targetIndex) throws Exception
Description copied from class:HarvesterOpens harvester for harvesting documents described by the givenHarvesterConfig. OpensHarvester.processorfor usage inHarvester.harvest()method.
-
addDocument
public void addDocument(MetadataDocument mdoc) throws Exception
Description copied from class:HarvesterAdds a document to theHarvester.processorworking in the background.- Overrides:
addDocumentin classHarvester- Throws:
BackgroundFailure- if an error occurred in background thread. Exceptions can be thrown asynchronous and may not affect the currect document. The real exception is thrown again inHarvester.close(boolean).Exception
-
createMetadataDocumentInstance
public MetadataDocument createMetadataDocumentInstance()
Description copied from class:HarvesterCreates an instance of MetadataDocument and initializes it with the harvester config. This method should be overwritten, if a harvester uses another class.- Overrides:
createMetadataDocumentInstancein classHarvester
-
getMetadataDocumentFactory
protected org.apache.commons.digester.ObjectCreationFactory getMetadataDocumentFactory()
Returns a factory for creating theMetadataDocuments in Digester code (usingFactoryCreateRule).- See Also:
createMetadataDocumentInstance()
-
recreateDigester
protected abstract void recreateDigester()
Recreates all digesters that are used by parsing the OAI XML. This method is called initiall once and later on network errors before parsing same document again. This allows to recover from document parsing failing somewhere in the middle of a document.
-
doParse
protected boolean doParse(Supplier<ExtendedDigester> digSupplier, String url, AtomicReference<Instant> checkModifiedDate) throws Exception
Harvests a URL using the suplied digester.- Parameters:
digSupplier- aSupplierthat gives access to a (possibly recreated) digester instance.url- the URL is parsed by this digester instance.checkModifiedDate- for static repositories, it is possible to give a reference to aInstantfor checking the last modification, in this casefalseis returned, if the URL was not modified. If it was modified, the reference contains a newDateobject with the new modification date. Supplynullfor no checking of last modification, a last modification date is then not returned back (as there is no reference).- Returns:
trueif harvested,falseif not modified and no harvesting was done.- Throws:
Exception
-
getEntityResolver
protected EntityResolver getEntityResolver(EntityResolver parent)
Returns anEntityResolverthat resolves all HTTP-URLS usinggetInputSource(java.net.URI, java.util.concurrent.atomic.AtomicReference<java.time.Instant>).- Parameters:
parent- anEntityResolverthat receives all unprocessed requests- See Also:
getInputSource(java.net.URI, java.util.concurrent.atomic.AtomicReference<java.time.Instant>)
-
getInputSource
protected InputSource getInputSource(URI url, AtomicReference<Instant> checkModifiedDate) throws IOException
Returns a SAXInputSourcefor retrieving stream data of an URL. It is optimized for compression of the HTTP(S) protocol and timeout checking.- Parameters:
url- the URL to opencheckModifiedDate- for static repositories, it is possible to give a reference to aInstantfor checking the last modification, in this casenullis returned, if the URL was not modified. If it was modified, the reference contains a newDateobject with the new modification date. Supplynullfor no checking of last modification, a last modification date is then not returned back (as there is no reference).- Throws:
IOException- See Also:
getEntityResolver(org.xml.sax.EntityResolver)
-
reset
protected void reset()
Resets the internal variables.
-
enableMissingDocumentDelete
protected void enableMissingDocumentDelete()
Enable unseen document deletes. This should be enabled by harvester before callingaddDocument(MetadataDocument), so tracking can be enabled.
-
cancelMissingDocumentDelete
protected void cancelMissingDocumentDelete()
Disable the property "deleteMissingDocuments" for this instance. This can be used, when the container (like a ZIP file was not modified), and all containing documents are not enumerated. To prevent deletion of all these documents call this.
-
close
public void close(boolean cleanShutdown) throws ExceptionDescription copied from class:HarvesterCloses harvester. All resources are freed and theHarvester.processoris closed.- Overrides:
closein classHarvester- Parameters:
cleanShutdown- enables writing of status information to the Elasticsearch instance for the next harvesting. If an error occurred during harvesting this should not be done.- Throws:
Exception- if an exception occurs during closing (various types of exceptions can be thrown). Exceptions can be thrown asynchronous and may not affect the correct document.
-
enumerateValidHarvesterPropertyNames
protected void enumerateValidHarvesterPropertyNames(Set<String> props)
Description copied from class:HarvesterThis method is used by subclasses to enumerate all available harvester properties that are implemented by them. Overwrite this method in your own implementation and append all harvester names to the suppliedSet. The public API for client code requesting property names isHarvester.getValidHarvesterPropertyNames().- Overrides:
enumerateValidHarvesterPropertyNamesin classHarvester- See Also:
Harvester.getValidHarvesterPropertyNames()
-
-