public abstract class OAIHarvesterBase extends Harvester
This harvester supports the following additional harvester properties:
setSpec
: OAI set to harvest (default: none)retryCount
: how often retry on HTTP errors? (default: 5)retryAfterSeconds
: time between retries in seconds (default:
60)timeoutAfterSeconds
: HTTP Timeout for harvesting in secondsmetadataPrefix
: OAI metadata prefix to harvestidentifierPrefix
: prepend all identifiers returned by OAI with this stringignoreDatestamps
: does full harvesting, while ignoring all datestamps. They are saved, but ignored, if invalid.deleteMissingDocuments
: remove documents after harvesting that were
deleted from source (maybe a heavy operation). The harvester only does this on full
(not on incremental harvesting). (default: true)Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_RETRY_COUNT |
static int |
DEFAULT_RETRY_TIME |
static int |
DEFAULT_TIMEOUT |
protected boolean |
deleteMissingDocuments
If enabled, on any kind of full harvesting it will track all valid identifiers and delete all of them not seen in index.
|
protected boolean |
filterIncomingSets
The harvester should filter incoming documents according to its set
metadata.
|
protected String |
identifierPrefix
prepend all identifiers returned by OAI with this string
|
protected boolean |
ignoreDatestamps
If enabled, does full harvesting, while ignoring all datestamps (default is
false ). |
protected String |
metadataPrefix
the used metadata prefix from the configuration
|
static String |
OAI_NS |
static String |
OAI_STATICREPOSITORY_NS |
protected int |
retryCount
the retryCount from configuration
|
protected int |
retryTime
the retryTime from configuration
|
protected Set<String> |
sets
the sets to harvest from the configuration,
null to harvest all |
protected int |
timeout
the timeout from configuration
|
fromDateReference, harvestCount, HARVESTER_METADATA_FIELD_LAST_HARVESTED, harvestMessageStep, iconfig, log, processor
Constructor and Description |
---|
OAIHarvesterBase(HarvesterConfig iconfig) |
Modifier and Type | Method and Description |
---|---|
void |
addDocument(MetadataDocument mdoc)
Adds a document to the
Harvester.processor working in the background. |
protected void |
cancelMissingDocumentDelete()
Disable the property "deleteMissingDocuments" for this instance.
|
void |
close(boolean cleanShutdown)
Closes harvester.
|
MetadataDocument |
createMetadataDocumentInstance()
Creates an instance of MetadataDocument and initializes it with the harvester
config.
|
protected boolean |
doParse(Supplier<ExtendedDigester> digSupplier,
String url,
AtomicReference<Instant> checkModifiedDate)
Harvests a URL using the suplied digester.
|
protected void |
enableMissingDocumentDelete()
Enable unseen document deletes.
|
protected void |
enumerateValidHarvesterPropertyNames(Set<String> props)
This method is used by subclasses to enumerate all available harvester
properties that are implemented by them.
|
protected EntityResolver |
getEntityResolver(EntityResolver parent)
Returns an
EntityResolver that resolves all HTTP-URLS using
getInputSource(java.net.URL, java.util.concurrent.atomic.AtomicReference<java.time.Instant>) . |
protected InputSource |
getInputSource(URL url,
AtomicReference<Instant> checkModifiedDate)
Returns a SAX
InputSource for retrieving stream data of an
URL. |
protected org.apache.commons.digester.ObjectCreationFactory |
getMetadataDocumentFactory()
Returns a factory for creating the
MetadataDocument s in Digester
code (using FactoryCreateRule ). |
void |
open(ElasticsearchConnection es,
String targetIndex)
Opens harvester for harvesting documents described by the
given
HarvesterConfig . |
protected abstract void |
recreateDigester()
Recreates all digesters that are used by parsing the OAI XML.
|
protected void |
reset()
Resets the internal variables.
|
deleteDocument, finishReindex, getValidHarvesterPropertyNames, harvest, isAllIndexes, isClosed, isDocumentOutdated, main, prepareReindex, runHarvester, runHarvester, setHarvestingDateReference, setValidIdentifiers
public static final String OAI_NS
public static final String OAI_STATICREPOSITORY_NS
public static final int DEFAULT_RETRY_TIME
public static final int DEFAULT_RETRY_COUNT
public static final int DEFAULT_TIMEOUT
protected final String metadataPrefix
protected final String identifierPrefix
protected final Set<String> sets
null
to harvest allprotected final int retryCount
protected final int retryTime
protected final int timeout
protected final boolean ignoreDatestamps
false
). They are saved, but ignored, if invalid.protected final boolean deleteMissingDocuments
protected boolean filterIncomingSets
true
.public OAIHarvesterBase(HarvesterConfig iconfig)
public void open(ElasticsearchConnection es, String targetIndex) throws Exception
Harvester
HarvesterConfig
. Opens Harvester.processor
for usage in
Harvester.harvest()
method.public void addDocument(MetadataDocument mdoc) throws Exception
Harvester
Harvester.processor
working in the background.addDocument
in class Harvester
BackgroundFailure
- if an error occurred in background thread. Exceptions can be
thrown asynchronous and may not affect the currect document. The
real exception is thrown again in Harvester.close(boolean)
.Exception
public MetadataDocument createMetadataDocumentInstance()
Harvester
createMetadataDocumentInstance
in class Harvester
protected org.apache.commons.digester.ObjectCreationFactory getMetadataDocumentFactory()
MetadataDocument
s in Digester
code (using FactoryCreateRule
).createMetadataDocumentInstance()
protected abstract void recreateDigester()
protected boolean doParse(Supplier<ExtendedDigester> digSupplier, String url, AtomicReference<Instant> checkModifiedDate) throws Exception
digSupplier
- a Supplier
that gives access to a (possibly recreated)
digester instance.url
- the URL is parsed by this digester instance.checkModifiedDate
- for static repositories, it is possible to give a reference to a
Instant
for checking the last modification, in this case
false
is returned, if the URL was not modified. If it
was modified, the reference contains a new Date
object with the new modification date. Supply null
for no checking of last modification, a last modification date is
then not returned back (as there is no reference).true
if harvested, false
if not modified
and no harvesting was done.Exception
protected EntityResolver getEntityResolver(EntityResolver parent)
EntityResolver
that resolves all HTTP-URLS using
getInputSource(java.net.URL, java.util.concurrent.atomic.AtomicReference<java.time.Instant>)
.parent
- an EntityResolver
that receives all unprocessed
requestsgetInputSource(java.net.URL, java.util.concurrent.atomic.AtomicReference<java.time.Instant>)
protected InputSource getInputSource(URL url, AtomicReference<Instant> checkModifiedDate) throws IOException
InputSource
for retrieving stream data of an
URL. It is optimized for compression of the HTTP(S) protocol and timeout
checking.url
- the URL to opencheckModifiedDate
- for static repositories, it is possible to give a reference to a
Instant
for checking the last modification, in this case
null
is returned, if the URL was not modified. If it
was modified, the reference contains a new Date
object with the new modification date. Supply null
for no checking of last modification, a last modification date is
then not returned back (as there is no reference).IOException
getEntityResolver(org.xml.sax.EntityResolver)
protected void reset()
protected void enableMissingDocumentDelete()
addDocument(MetadataDocument)
, so tracking
can be enabled.protected void cancelMissingDocumentDelete()
public void close(boolean cleanShutdown) throws Exception
Harvester
Harvester.processor
is
closed.close
in class Harvester
cleanShutdown
- enables writing of status information to the Elasticsearch instance for the next
harvesting. If an error occurred during harvesting this should not
be done.Exception
- if an exception occurs during closing (various types of
exceptions can be thrown). Exceptions can be thrown asynchronous
and may not affect the correct document.protected void enumerateValidHarvesterPropertyNames(Set<String> props)
Harvester
Set
. The public API for client code requesting property names
is Harvester.getValidHarvesterPropertyNames()
.enumerateValidHarvesterPropertyNames
in class Harvester
Harvester.getValidHarvesterPropertyNames()
Copyright ©2007-2013 panFMP Developers c/o Uwe Schindler